This repository contains code for the American Express Default Prediction Kaggle competition. The goal is to predict credit card default risk using historical customer transaction data.
This project implements a machine learning pipeline for predicting credit card defaults using LightGBM with DART boosting. The solution includes comprehensive feature engineering, cross-validation, and model evaluation using the custom Amex metric.
Based on high-performing solutions from the competition, here's a comprehensive approach that achieved strong results (Public: ~0.80, Private: ~0.81):
- Apply denoising operation:
np.floor(x*100)/100to round numerical features to two decimal places, reducing noise and dimensionality
One-Hot Category Features:
- From all data: Compute
mean,std,sum,lastaggregations across all customer transactions - From last 3 rows: Compute
mean,std,sumaggregations from the last 3 transaction rows per customer
Numerical Features - All Data:
- Raw values: Original numerical features
- Diff: Differences between consecutive numerical values
- Global rank: Rank of numerical values across the entire dataset
- User-based rank: Rank of numerical values within each user's transaction history
- Month-based rank: Rank of numerical values within each month
- Aggregations: Apply
{mean, std, min, max, sum, last}to all the above feature types
Numerical Features - Last 3 Rows:
- Raw values: Original numerical values from last 3 rows
- Diff: Differences between consecutive values from last 3 rows
- Aggregations: Apply
{mean, std, min, max, sum}
Numerical Features - Last 6 Rows:
- Raw values: Original numerical values from last 6 rows
- Aggregations: Apply
{mean, std, min, max, sum}
LGB Series OOF Features:
- Train a LightGBM model on series data with seed 42 (ensuring no data leakage for downstream tasks)
- Generate out-of-fold (OOF) predictions where all data for one user shares the same target
- Pad OOF predictions and final predictions to length 13 with NaN values (corresponding to maximum sequence length)
Cross-Validation:
- Use stratified 5-fold cross-validation with seed 42 for reproducibility
Ensemble Components:
-
Model 1 (Weight: 0.30)
- Public Score: 0.80044, Private Score: 0.80874
- Description: Heavy ensemble of manual features from early stage, using many combinations of features
-
Model 2 (Weight: 0.35)
- Public Score: 0.80052, Private Score: 0.80859
- Description: LightGBM with manual features and LGB series OOF features
-
Model 3 (Weight: 0.10)
- Public Score: 0.79008, Private Score: 0.79997
- Description: GRU (Gated Recurrent Unit) with denoised series features
-
Model 4 (Weight: 0.15)
- Public Score: 0.79713, Private Score: 0.80454
- Description: GRU with denoised series features + DNN (Deep Neural Network) with GreedyBins for manual features and LGB series OOF features
Final Ensemble:
- Weighted combination of the above models
- Note: The listed weights sum to 0.90, suggesting additional components or weight adjustments may be needed
This approach demonstrates the importance of:
- Comprehensive feature engineering across multiple time windows
- Combining tabular (LightGBM) and sequential (GRU) models
- Using OOF predictions as features to prevent data leakage
- Ensemble methods to combine diverse model architectures
- Time-series Feature Engineering: Aggregates customer transaction history into statistical features (mean, std, min, max, skew, kurtosis, last values)
- LightGBM Models: Uses DART (Dropouts meet Multiple Additive Regression Trees) boosting for robust predictions
- Stratified K-Fold Cross-Validation: 5-fold cross-validation to ensure robust model evaluation
- Custom Amex Metric: Implements the competition-specific evaluation metric combining Gini coefficient and top 4% capture rate
- Feature Selection: Removes highly correlated and low-importance features to improve model performance
- Data Augmentation: Includes noise augmentation variants for improved generalization
FeatureEngineering/- Feature engineering scripts for creating aggregated features from time-series dataProcessData.py- Main feature engineering pipeline with statistical aggregations
Models/- Trained model files (excluded from git)OOF/- Out-of-fold predictions for cross-validation evaluation (excluded from git)Predictions/- Test predictions for submission (excluded from git)agModels-predictClass/- AutoGluon model outputs (excluded from git)
-
baseline_train.py- Main baseline training script- Trains LightGBM models with DART boosting
- Performs 5-fold stratified cross-validation
- Generates feature importance files
- Saves out-of-fold predictions for evaluation
- Uses features from processed parquet files (e.g.,
train_fe_skew.parquet)
-
baseline_train_3m_6m_noise.py- Training with noise augmentation- Similar to baseline but uses noise-augmented data
- Helps improve model generalization
- Uses
train_fe_noise.parquetas input
-
process_data_v2.py- Main data processing pipeline- Processes raw time-series transaction data
- Creates aggregated features (mean, std, min, max, skew, kurtosis, last)
- Handles 3-month and 6-month time windows for temporal features
- Calculates differences between consecutive transactions
- Removes highly correlated and low-importance features
- Outputs processed parquet files ready for training
-
process_test_data.py- Test data processing- Applies the same feature engineering pipeline to test data
- Ensures consistency with training data preprocessing
-
processs_data_sample.py- Sample data processing for testing- Processes sample data for quick testing and development
OneFoldPred.py- Single fold prediction script- Generates predictions using a single trained model
- Useful for quick inference and testing
- Can be extended for ensemble predictions
-
autoGluon_quickStart.py- AutoGluon quick start script- Automated machine learning using AutoGluon
- Useful for baseline comparisons and quick experiments
-
autoGluon_test.py- AutoGluon testing script- Tests and evaluates AutoGluon models
compress_data.py- Data compression utilities- Compresses data files to save storage space
- Converts data types for memory efficiency
The feature engineering pipeline creates the following types of features:
-
Statistical Aggregations: For each numerical feature, computes:
- Mean, standard deviation, minimum, maximum
- Skewness and kurtosis (higher-order statistics)
- Last value (most recent transaction)
-
Categorical Aggregations: For categorical features, computes:
- Count, last value, number of unique values
-
Temporal Features:
- 3-month and 6-month window aggregations
- Differences between consecutive transactions
-
Feature Transformations:
- Label encoding for categorical features
- Rounding to 2 decimal places for float features
- Difference between last and mean values
The baseline model uses the following LightGBM parameters:
- Boosting Type: DART (Dropouts meet Multiple Additive Regression Trees)
- Learning Rate: 0.01
- Number of Leaves: 100
- Feature Fraction: 0.20
- Bagging Fraction: 0.50
- Bagging Frequency: 10
- Lambda L2: 2
- Min Data in Leaf: 40
- Early Stopping: 1500 rounds
- Max Boosting Rounds: 10500
The competition uses a custom Amex metric that combines:
- Gini Coefficient: Measures the model's ability to rank customers by default risk
- Top 4% Capture Rate: Percentage of actual defaults captured in the top 4% of predictions (weighted)
The final metric is: 0.5 * (Gini_normalized + top_four_percent_capture)
Data files (CSV/Parquet) are excluded from git due to size. Please add your data files locally in the appropriate directories:
- Raw data: Place in
Sample_Data/ordata/amex-default-prediction/ - Processed data: Generated by processing scripts (e.g.,
train_fe_skew.parquet,test_fe.parquet)
Trained models are excluded from git. Models can be regenerated by running the training scripts. Model files are saved with the format:
lgbm_{boosting_type}_fold{fold}_seed{seed}.pkl- Feature importance files:
lgbm_{boosting_type}_fold{fold}_seed{seed}_importance.csv
Install dependencies using:
pip install -r requirements.txtKey dependencies include:
- pandas
- numpy
- lightgbm
- scikit-learn
- scipy
- joblib
- tqdm
- pyarrow (for parquet files)
- dateutil
First, process the raw training data:
python process_data_v2.pyThis will generate feature-engineered parquet files in the Sample_Data/ directory.
Train the baseline model:
python baseline_train.pyThis will:
- Load the processed training data
- Perform 5-fold cross-validation
- Save models in the
Models/directory - Generate out-of-fold predictions in the
OOF/directory - Output feature importance CSV files
Generate predictions on test data:
python OneFoldPred.pyMake sure to process test data first:
python process_test_data.pyCheck the out-of-fold predictions CSV files in the OOF/ directory to evaluate model performance using the Amex metric.
- The code uses fixed random seeds (seed=42) for reproducibility
- Memory optimization is implemented through data type conversions (float16, int32)
- Feature importance files are generated for each fold to analyze which features contribute most to predictions
- The pipeline handles categorical features through label encoding