Reddit Real-Time Trending Topics Pipeline

End-to-end streaming data pipeline that ingests Reddit posts via the PRAW API, processes them with PySpark Structured Streaming, stores results in a serverless lakehouse (S3 + Glue + Athena), and visualises live trends in a Streamlit dashboard.

Architecture

Reddit API (PRAW)
      │  OAuth2 polling every 30s
      ▼
confluent-kafka Producer
      │  JSON events keyed by subreddit
      ▼
Kafka Topic: reddit-posts
      │  
      ├──▶ console_consumer.py   (Phase 1 — verify end-to-end)
      │
      └──▶ PySpark Structured Streaming
                │  5-min tumbling windows
                │  watermark: 10 min
                ▼
           Parquet files (partitioned subreddit/date/hour)
                │
                ├── Local filesystem  (Phase 2)
                └── S3 data lake      (Phase 3)
                          │
                     AWS Glue Crawler (hourly)
                          │
                     Glue Data Catalog
                          │
                     Amazon Athena (SQL)
                          │
                     Streamlit Dashboard  (Phase 4)

Tech stack

Layer	Technology	Why
Ingestion	PRAW + confluent-kafka	OAuth2 auth; production Kafka client
Broker	Kafka (Docker Compose)	Decouples producer from consumer; replay
Processing	PySpark Structured Streaming	Same API as batch; windowed aggregations
Storage	S3 Parquet	Columnar; cheap; Glue-compatible
Catalog	AWS Glue	Auto-schema discovery; Athena integration
Query	Amazon Athena	Serverless SQL; pay-per-query
Dashboard	Streamlit	Fast to build; Python-native
IaC	Terraform	Reproducible AWS infra

Quickstart

Prerequisites

Docker + Docker Compose
Python 3.11+
Java 11+ (for PySpark)
AWS account (Phase 3+)

Phase 1 — Local: Reddit → Kafka → Console

# 1. Clone & install deps
git clone <repo-url>
cd realtime_reddit_trend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. Configure credentials
cp .env.example .env
# Edit .env: add REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET

# 3. Start Kafka
docker compose up -d

# 4. Run the producer (terminal 1)
python producer/reddit_producer.py

# 5. Run the console consumer (terminal 2)
python consumer/console_consumer.py

Open http://localhost:8080 to inspect the Kafka topic in the browser UI.

Phase 2 — Add PySpark (local filesystem output)

# In a new terminal (Kafka must still be running)
spark-submit \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
  spark/spark_streaming.py

Parquet files land in ./output/trending/.

Phase 3 — AWS (S3 + Glue + Athena)

# Provision AWS infrastructure
cd aws/terraform
cp terraform.tfvars.example terraform.tfvars   # set your bucket name
terraform init && terraform apply

# Update .env:
#   OUTPUT_PATH=s3a://your-bucket/trending
#   CHECKPOINT_PATH=s3a://your-bucket/checkpoints
#   + AWS credentials

spark-submit \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0,\
org.apache.hadoop:hadoop-aws:3.3.4,\
com.amazonaws:aws-java-sdk-bundle:1.12.262 \
  spark/spark_streaming.py

Trigger the Glue crawler manually once to populate the catalog, then query with Athena using the SQL in queries/trending_topics.sql.

Phase 4 — Streamlit Dashboard

# Set ATHENA_* vars in .env, then:
streamlit run dashboard/app.py

Project structure

.
├── producer/
│   └── reddit_producer.py      # PRAW → Kafka
├── consumer/
│   └── console_consumer.py     # Phase 1 debug consumer
├── spark/
│   └── spark_streaming.py      # PySpark Structured Streaming
├── aws/
│   └── terraform/
│       ├── main.tf             # S3, Glue, Athena resources
│       ├── variables.tf
│       └── outputs.tf
├── queries/
│   └── trending_topics.sql     # Athena query library
├── dashboard/
│   └── app.py                  # Streamlit live leaderboard
├── docker-compose.yml          # Kafka + Zookeeper + Kafka UI
├── requirements.txt
└── .env.example

Key engineering talking points

Event time vs. processing time — windows are based on ingested_at (event time) with a 10-minute watermark to handle late data correctly.
Backpressure — maxOffsetsPerTrigger caps how many Kafka messages Spark reads per micro-batch, preventing OOM on spikes.
Lakehouse pattern — raw Parquet in S3 → Glue schema catalog → Athena SQL. Adding dbt on top of Athena is a natural next step.
Rate-limit handling — PRAW respects Reddit's OAuth2 rate limits (60 req/min); the 30-second poll interval stays well within quota.
Idempotency — the producer tracks seen post IDs in-memory within a session; Spark checkpointing handles exactly-once at the consumer side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Real-Time Trending Topics Pipeline

Architecture

Tech stack

Quickstart

Prerequisites

Phase 1 — Local: Reddit → Kafka → Console

Phase 2 — Add PySpark (local filesystem output)

Phase 3 — AWS (S3 + Glue + Athena)

Phase 4 — Streamlit Dashboard

Project structure

Key engineering talking points

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
aws/terraform		aws/terraform
consumer		consumer
dashboard		dashboard
producer		producer
queries		queries
spark		spark
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Reddit Real-Time Trending Topics Pipeline

Architecture

Tech stack

Quickstart

Prerequisites

Phase 1 — Local: Reddit → Kafka → Console

Phase 2 — Add PySpark (local filesystem output)

Phase 3 — AWS (S3 + Glue + Athena)

Phase 4 — Streamlit Dashboard

Project structure

Key engineering talking points

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages