End-to-end streaming data pipeline that ingests Reddit posts via the PRAW API, processes them with PySpark Structured Streaming, stores results in a serverless lakehouse (S3 + Glue + Athena), and visualises live trends in a Streamlit dashboard.
Reddit API (PRAW)
│ OAuth2 polling every 30s
▼
confluent-kafka Producer
│ JSON events keyed by subreddit
▼
Kafka Topic: reddit-posts
│
├──▶ console_consumer.py (Phase 1 — verify end-to-end)
│
└──▶ PySpark Structured Streaming
│ 5-min tumbling windows
│ watermark: 10 min
▼
Parquet files (partitioned subreddit/date/hour)
│
├── Local filesystem (Phase 2)
└── S3 data lake (Phase 3)
│
AWS Glue Crawler (hourly)
│
Glue Data Catalog
│
Amazon Athena (SQL)
│
Streamlit Dashboard (Phase 4)
| Layer | Technology | Why |
|---|---|---|
| Ingestion | PRAW + confluent-kafka | OAuth2 auth; production Kafka client |
| Broker | Kafka (Docker Compose) | Decouples producer from consumer; replay |
| Processing | PySpark Structured Streaming | Same API as batch; windowed aggregations |
| Storage | S3 Parquet | Columnar; cheap; Glue-compatible |
| Catalog | AWS Glue | Auto-schema discovery; Athena integration |
| Query | Amazon Athena | Serverless SQL; pay-per-query |
| Dashboard | Streamlit | Fast to build; Python-native |
| IaC | Terraform | Reproducible AWS infra |
- Docker + Docker Compose
- Python 3.11+
- Java 11+ (for PySpark)
- AWS account (Phase 3+)
# 1. Clone & install deps
git clone <repo-url>
cd realtime_reddit_trend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 2. Configure credentials
cp .env.example .env
# Edit .env: add REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET
# 3. Start Kafka
docker compose up -d
# 4. Run the producer (terminal 1)
python producer/reddit_producer.py
# 5. Run the console consumer (terminal 2)
python consumer/console_consumer.pyOpen http://localhost:8080 to inspect the Kafka topic in the browser UI.
# In a new terminal (Kafka must still be running)
spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
spark/spark_streaming.pyParquet files land in ./output/trending/.
# Provision AWS infrastructure
cd aws/terraform
cp terraform.tfvars.example terraform.tfvars # set your bucket name
terraform init && terraform apply
# Update .env:
# OUTPUT_PATH=s3a://your-bucket/trending
# CHECKPOINT_PATH=s3a://your-bucket/checkpoints
# + AWS credentials
spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0,\
org.apache.hadoop:hadoop-aws:3.3.4,\
com.amazonaws:aws-java-sdk-bundle:1.12.262 \
spark/spark_streaming.pyTrigger the Glue crawler manually once to populate the catalog, then query with Athena using the SQL in queries/trending_topics.sql.
# Set ATHENA_* vars in .env, then:
streamlit run dashboard/app.py.
├── producer/
│ └── reddit_producer.py # PRAW → Kafka
├── consumer/
│ └── console_consumer.py # Phase 1 debug consumer
├── spark/
│ └── spark_streaming.py # PySpark Structured Streaming
├── aws/
│ └── terraform/
│ ├── main.tf # S3, Glue, Athena resources
│ ├── variables.tf
│ └── outputs.tf
├── queries/
│ └── trending_topics.sql # Athena query library
├── dashboard/
│ └── app.py # Streamlit live leaderboard
├── docker-compose.yml # Kafka + Zookeeper + Kafka UI
├── requirements.txt
└── .env.example
- Event time vs. processing time — windows are based on
ingested_at(event time) with a 10-minute watermark to handle late data correctly. - Backpressure —
maxOffsetsPerTriggercaps how many Kafka messages Spark reads per micro-batch, preventing OOM on spikes. - Lakehouse pattern — raw Parquet in S3 → Glue schema catalog → Athena SQL. Adding dbt on top of Athena is a natural next step.
- Rate-limit handling — PRAW respects Reddit's OAuth2 rate limits (60 req/min); the 30-second poll interval stays well within quota.
- Idempotency — the producer tracks seen post IDs in-memory within a session; Spark checkpointing handles exactly-once at the consumer side.