Skip to content

selengetu/realtime_reddit_trend

Repository files navigation

Reddit Real-Time Trending Topics Pipeline

End-to-end streaming data pipeline that ingests Reddit posts via the PRAW API, processes them with PySpark Structured Streaming, stores results in a serverless lakehouse (S3 + Glue + Athena), and visualises live trends in a Streamlit dashboard.

Architecture

Reddit API (PRAW)
      │  OAuth2 polling every 30s
      ▼
confluent-kafka Producer
      │  JSON events keyed by subreddit
      ▼
Kafka Topic: reddit-posts
      │  
      ├──▶ console_consumer.py   (Phase 1 — verify end-to-end)
      │
      └──▶ PySpark Structured Streaming
                │  5-min tumbling windows
                │  watermark: 10 min
                ▼
           Parquet files (partitioned subreddit/date/hour)
                │
                ├── Local filesystem  (Phase 2)
                └── S3 data lake      (Phase 3)
                          │
                     AWS Glue Crawler (hourly)
                          │
                     Glue Data Catalog
                          │
                     Amazon Athena (SQL)
                          │
                     Streamlit Dashboard  (Phase 4)

Tech stack

Layer Technology Why
Ingestion PRAW + confluent-kafka OAuth2 auth; production Kafka client
Broker Kafka (Docker Compose) Decouples producer from consumer; replay
Processing PySpark Structured Streaming Same API as batch; windowed aggregations
Storage S3 Parquet Columnar; cheap; Glue-compatible
Catalog AWS Glue Auto-schema discovery; Athena integration
Query Amazon Athena Serverless SQL; pay-per-query
Dashboard Streamlit Fast to build; Python-native
IaC Terraform Reproducible AWS infra

Quickstart

Prerequisites

  • Docker + Docker Compose
  • Python 3.11+
  • Java 11+ (for PySpark)
  • AWS account (Phase 3+)

Phase 1 — Local: Reddit → Kafka → Console

# 1. Clone & install deps
git clone <repo-url>
cd realtime_reddit_trend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. Configure credentials
cp .env.example .env
# Edit .env: add REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET

# 3. Start Kafka
docker compose up -d

# 4. Run the producer (terminal 1)
python producer/reddit_producer.py

# 5. Run the console consumer (terminal 2)
python consumer/console_consumer.py

Open http://localhost:8080 to inspect the Kafka topic in the browser UI.

Phase 2 — Add PySpark (local filesystem output)

# In a new terminal (Kafka must still be running)
spark-submit \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
  spark/spark_streaming.py

Parquet files land in ./output/trending/.

Phase 3 — AWS (S3 + Glue + Athena)

# Provision AWS infrastructure
cd aws/terraform
cp terraform.tfvars.example terraform.tfvars   # set your bucket name
terraform init && terraform apply

# Update .env:
#   OUTPUT_PATH=s3a://your-bucket/trending
#   CHECKPOINT_PATH=s3a://your-bucket/checkpoints
#   + AWS credentials

spark-submit \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0,\
org.apache.hadoop:hadoop-aws:3.3.4,\
com.amazonaws:aws-java-sdk-bundle:1.12.262 \
  spark/spark_streaming.py

Trigger the Glue crawler manually once to populate the catalog, then query with Athena using the SQL in queries/trending_topics.sql.

Phase 4 — Streamlit Dashboard

# Set ATHENA_* vars in .env, then:
streamlit run dashboard/app.py

Project structure

.
├── producer/
│   └── reddit_producer.py      # PRAW → Kafka
├── consumer/
│   └── console_consumer.py     # Phase 1 debug consumer
├── spark/
│   └── spark_streaming.py      # PySpark Structured Streaming
├── aws/
│   └── terraform/
│       ├── main.tf             # S3, Glue, Athena resources
│       ├── variables.tf
│       └── outputs.tf
├── queries/
│   └── trending_topics.sql     # Athena query library
├── dashboard/
│   └── app.py                  # Streamlit live leaderboard
├── docker-compose.yml          # Kafka + Zookeeper + Kafka UI
├── requirements.txt
└── .env.example

Key engineering talking points

  • Event time vs. processing time — windows are based on ingested_at (event time) with a 10-minute watermark to handle late data correctly.
  • BackpressuremaxOffsetsPerTrigger caps how many Kafka messages Spark reads per micro-batch, preventing OOM on spikes.
  • Lakehouse pattern — raw Parquet in S3 → Glue schema catalog → Athena SQL. Adding dbt on top of Athena is a natural next step.
  • Rate-limit handling — PRAW respects Reddit's OAuth2 rate limits (60 req/min); the 30-second poll interval stays well within quota.
  • Idempotency — the producer tracks seen post IDs in-memory within a session; Spark checkpointing handles exactly-once at the consumer side.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors