A text-based hate speech classification project using XGBoost, NLP, and sklearn.
This project is part of a Project-Based Learning (PBL) showcase at Symbiosis Institute of Technology, Nagpur. It focuses on detecting hate speech in textual data using machine learning and natural language processing techniques.
The model classifies text into three categories:
- Hate Speech
- Offensive Language
- Neither
This classification is achieved using TF-IDF vectorization and the XGBoost algorithm. The dataset used includes labeled Twitter data with annotations indicating the type of speech.
- Preprocessing of raw text data including cleaning and tokenization.
- Feature extraction using TF-IDF.
- Training and evaluation using the XGBoost classifier.
- Performance metrics including accuracy, precision, recall, and F1-score.
- Source: Publicly available hate speech Twitter dataset.
- Format: CSV file with columns such as
tweet,class, andlabel.
- Python 3.8+
- Scikit-learn
- XGBoost
- Pandas
- NumPy
- Jupyter Notebook
Install dependencies with:
pip install -r requirements.txtThe dataset used in this project is publicly available and can be downloaded from: Kaggle - Hate Speech and Offensive Language Dataset