An ML-powered system to evaluate the quality and relevancy of Google location reviews. Project Veritas uses modern Natural Language Processing (NLP) techniques to detect spam, advertisements, and policy violations, improving the reliability of review platforms.
Identify low-quality or policy-violating reviews while preserving useful user feedback to enhance the trustworthiness of online review platforms.
Our solution follows an iterative, day-by-day workflow across:
- 📊 Data collection and integration from multiple sources
- 📈 Comprehensive exploratory data analysis (EDA)
- 🧩 Feature engineering to extract relevant signals
- 🤖 Model development using state-of-the-art NLP techniques
- 📝 Rigorous model evaluation and optimization
- 🔄 Multi-source data integration (UCSD and Delaware datasets)
- 📚 Linguistic feature extraction and sentiment analysis
- 🚫 Detection of spam patterns and policy violations
- ✅ Preservation of genuine user feedback
-
Day1: Setup, Data Collection, and Exploratory Data Analysis (EDA)
- Notebook:
TechJam_Day1_Data_Collection_and_Integration.ipynb - Inputs:
Day1 .../assets/downloaded_files/(meta-Delaware.json.gz, review-Delaware_10.json.gz, reviews.csv) - Outputs:
Day1 .../assets/output/ucsd_delaware_reviews_combined.csv - Visuals:
Day1 .../assets/png/*.png
- Notebook:
-
Day2: Feature Engineering and Initial Model Development
- Notebook:
TechJam_Day2_Feature_Engineering_and_Initial_Model_Development.ipynb - Input:
Day2 .../assets/input/ucsd_delaware_reviews_combined.csv(from Day 1) - Output:
Day2 .../assets/output/final_augmented_reviews_for_day3.csv
- Notebook:
-
Day3: Model Refinement, Evaluation, and Iteration
- Notebook:
TechJam_Day3_Model_Refinement_Evaluation_and_Iteration.ipynb - Input:
Day3 .../assets/input/final_augmented_reviews_for_day3.csv(from Day 2) - Visuals:
Day3 .../assets/png/confusion_matrix_final_optimized_model.png
- Notebook:
- Python 3.12+
- 4GB+ free disk space for datasets and intermediate artifacts
- Compatible with macOS, Linux, and Windows (with minor path adjustments)
# From the project root
source venv/bin/activate # On Windows: venv\Scripts\activate
python --version # Verify Python activationIf activation succeeds and Python is available, proceed to the notebooks.
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Update pip and install dependencies
python -m pip install --upgrade pip
pip install jupyter pandas numpy scikit-learn matplotlib seaborn nltk tqdm transformers datasets torchjupyter lab
# or
jupyter notebookFollow these steps to reproduce our results by running the notebooks sequentially:
- Open:
Day1 Setup, Data Collection, and Exploratory Data Analysis (EDA)/TechJam_Day1_Data_Collection_and_Integration.ipynb - Run all cells to:
- Load and preprocess the UCSD and Delaware datasets
- Perform data cleaning and integration
- Conduct exploratory data analysis with visualizations
- Generate the combined dataset:
Day1 .../assets/output/ucsd_delaware_reviews_combined.csv - Create EDA plots:
Day1 .../assets/png/*.png
- Open:
Day2 Feature Engineering and Initial Model Development/TechJam_Day2_Feature_Engineering_and_Initial_Model_Development.ipynb - Ensure the Day 1 combined CSV is available at:
Day2 .../assets/input/ucsd_delaware_reviews_combined.csv - Run all cells to:
- Extract linguistic features from review text
- Perform feature selection and transformation
- Develop initial classification models
- Generate the augmented dataset:
Day2 .../assets/output/final_augmented_reviews_for_day3.csv
- Open:
Day3 Model Refinement, Evaluation, and Iteration/TechJam_Day3_Model_Refinement_Evaluation_and_Iteration.ipynb - Ensure the Day 2 augmented CSV is available at:
Day3 .../assets/input/final_augmented_reviews_for_day3.csv - Run all cells to:
- Fine-tune model hyperparameters
- Evaluate model performance with various metrics
- Generate evaluation visualizations:
Day3 .../assets/png/confusion_matrix_final_optimized_model.png
- If any cell installs packages, restart the kernel after installation
- Set random seeds when prompted for reproducibility (default seeds are provided)
- For large datasets, ensure sufficient memory is available
- All paths in notebooks are relative to their respective directories
- Nguyen Tran Thanh Lam
- Pham Hong Phuc
- Pham Tran Minh Tuan
- Dinh Quang Anh
- Nguyen Le Tam
This project is licensed under the MIT License. See the LICENSE file for details.