Skip to content

gjunjie/amex_credit_card_prediction

Repository files navigation

American Express Default Prediction - Kaggle Competition

This repository contains code for the American Express Default Prediction Kaggle competition. The goal is to predict credit card default risk using historical customer transaction data.

Overview

This project implements a machine learning pipeline for predicting credit card defaults using LightGBM with DART boosting. The solution includes comprehensive feature engineering, cross-validation, and model evaluation using the custom Amex metric.

Competition Solution Summary

Based on high-performing solutions from the competition, here's a comprehensive approach that achieved strong results (Public: ~0.80, Private: ~0.81):

Data Processing Pipeline

1. Denoising

  • Apply denoising operation: np.floor(x*100)/100 to round numerical features to two decimal places, reducing noise and dimensionality

2. Feature Engineering

One-Hot Category Features:

  • From all data: Compute mean, std, sum, last aggregations across all customer transactions
  • From last 3 rows: Compute mean, std, sum aggregations from the last 3 transaction rows per customer

Numerical Features - All Data:

  • Raw values: Original numerical features
  • Diff: Differences between consecutive numerical values
  • Global rank: Rank of numerical values across the entire dataset
  • User-based rank: Rank of numerical values within each user's transaction history
  • Month-based rank: Rank of numerical values within each month
  • Aggregations: Apply {mean, std, min, max, sum, last} to all the above feature types

Numerical Features - Last 3 Rows:

  • Raw values: Original numerical values from last 3 rows
  • Diff: Differences between consecutive values from last 3 rows
  • Aggregations: Apply {mean, std, min, max, sum}

Numerical Features - Last 6 Rows:

  • Raw values: Original numerical values from last 6 rows
  • Aggregations: Apply {mean, std, min, max, sum}

LGB Series OOF Features:

  • Train a LightGBM model on series data with seed 42 (ensuring no data leakage for downstream tasks)
  • Generate out-of-fold (OOF) predictions where all data for one user shares the same target
  • Pad OOF predictions and final predictions to length 13 with NaN values (corresponding to maximum sequence length)

Model Ensemble Strategy

Cross-Validation:

  • Use stratified 5-fold cross-validation with seed 42 for reproducibility

Ensemble Components:

  1. Model 1 (Weight: 0.30)

    • Public Score: 0.80044, Private Score: 0.80874
    • Description: Heavy ensemble of manual features from early stage, using many combinations of features
  2. Model 2 (Weight: 0.35)

    • Public Score: 0.80052, Private Score: 0.80859
    • Description: LightGBM with manual features and LGB series OOF features
  3. Model 3 (Weight: 0.10)

    • Public Score: 0.79008, Private Score: 0.79997
    • Description: GRU (Gated Recurrent Unit) with denoised series features
  4. Model 4 (Weight: 0.15)

    • Public Score: 0.79713, Private Score: 0.80454
    • Description: GRU with denoised series features + DNN (Deep Neural Network) with GreedyBins for manual features and LGB series OOF features

Final Ensemble:

  • Weighted combination of the above models
  • Note: The listed weights sum to 0.90, suggesting additional components or weight adjustments may be needed

This approach demonstrates the importance of:

  • Comprehensive feature engineering across multiple time windows
  • Combining tabular (LightGBM) and sequential (GRU) models
  • Using OOF predictions as features to prevent data leakage
  • Ensemble methods to combine diverse model architectures

Key Features

  • Time-series Feature Engineering: Aggregates customer transaction history into statistical features (mean, std, min, max, skew, kurtosis, last values)
  • LightGBM Models: Uses DART (Dropouts meet Multiple Additive Regression Trees) boosting for robust predictions
  • Stratified K-Fold Cross-Validation: 5-fold cross-validation to ensure robust model evaluation
  • Custom Amex Metric: Implements the competition-specific evaluation metric combining Gini coefficient and top 4% capture rate
  • Feature Selection: Removes highly correlated and low-importance features to improve model performance
  • Data Augmentation: Includes noise augmentation variants for improved generalization

Project Structure

  • FeatureEngineering/ - Feature engineering scripts for creating aggregated features from time-series data
    • ProcessData.py - Main feature engineering pipeline with statistical aggregations
  • Models/ - Trained model files (excluded from git)
  • OOF/ - Out-of-fold predictions for cross-validation evaluation (excluded from git)
  • Predictions/ - Test predictions for submission (excluded from git)
  • agModels-predictClass/ - AutoGluon model outputs (excluded from git)

Scripts

Training Scripts

  • baseline_train.py - Main baseline training script

    • Trains LightGBM models with DART boosting
    • Performs 5-fold stratified cross-validation
    • Generates feature importance files
    • Saves out-of-fold predictions for evaluation
    • Uses features from processed parquet files (e.g., train_fe_skew.parquet)
  • baseline_train_3m_6m_noise.py - Training with noise augmentation

    • Similar to baseline but uses noise-augmented data
    • Helps improve model generalization
    • Uses train_fe_noise.parquet as input

Data Processing Scripts

  • process_data_v2.py - Main data processing pipeline

    • Processes raw time-series transaction data
    • Creates aggregated features (mean, std, min, max, skew, kurtosis, last)
    • Handles 3-month and 6-month time windows for temporal features
    • Calculates differences between consecutive transactions
    • Removes highly correlated and low-importance features
    • Outputs processed parquet files ready for training
  • process_test_data.py - Test data processing

    • Applies the same feature engineering pipeline to test data
    • Ensures consistency with training data preprocessing
  • processs_data_sample.py - Sample data processing for testing

    • Processes sample data for quick testing and development

Prediction Scripts

  • OneFoldPred.py - Single fold prediction script
    • Generates predictions using a single trained model
    • Useful for quick inference and testing
    • Can be extended for ensemble predictions

AutoGluon Scripts

  • autoGluon_quickStart.py - AutoGluon quick start script

    • Automated machine learning using AutoGluon
    • Useful for baseline comparisons and quick experiments
  • autoGluon_test.py - AutoGluon testing script

    • Tests and evaluates AutoGluon models

Utility Scripts

  • compress_data.py - Data compression utilities
    • Compresses data files to save storage space
    • Converts data types for memory efficiency

Feature Engineering

The feature engineering pipeline creates the following types of features:

  1. Statistical Aggregations: For each numerical feature, computes:

    • Mean, standard deviation, minimum, maximum
    • Skewness and kurtosis (higher-order statistics)
    • Last value (most recent transaction)
  2. Categorical Aggregations: For categorical features, computes:

    • Count, last value, number of unique values
  3. Temporal Features:

    • 3-month and 6-month window aggregations
    • Differences between consecutive transactions
  4. Feature Transformations:

    • Label encoding for categorical features
    • Rounding to 2 decimal places for float features
    • Difference between last and mean values

Model Configuration

The baseline model uses the following LightGBM parameters:

  • Boosting Type: DART (Dropouts meet Multiple Additive Regression Trees)
  • Learning Rate: 0.01
  • Number of Leaves: 100
  • Feature Fraction: 0.20
  • Bagging Fraction: 0.50
  • Bagging Frequency: 10
  • Lambda L2: 2
  • Min Data in Leaf: 40
  • Early Stopping: 1500 rounds
  • Max Boosting Rounds: 10500

Evaluation Metric

The competition uses a custom Amex metric that combines:

  • Gini Coefficient: Measures the model's ability to rank customers by default risk
  • Top 4% Capture Rate: Percentage of actual defaults captured in the top 4% of predictions (weighted)

The final metric is: 0.5 * (Gini_normalized + top_four_percent_capture)

Data

Data files (CSV/Parquet) are excluded from git due to size. Please add your data files locally in the appropriate directories:

  • Raw data: Place in Sample_Data/ or data/amex-default-prediction/
  • Processed data: Generated by processing scripts (e.g., train_fe_skew.parquet, test_fe.parquet)

Models

Trained models are excluded from git. Models can be regenerated by running the training scripts. Model files are saved with the format:

  • lgbm_{boosting_type}_fold{fold}_seed{seed}.pkl
  • Feature importance files: lgbm_{boosting_type}_fold{fold}_seed{seed}_importance.csv

Requirements

Install dependencies using:

pip install -r requirements.txt

Key dependencies include:

  • pandas
  • numpy
  • lightgbm
  • scikit-learn
  • scipy
  • joblib
  • tqdm
  • pyarrow (for parquet files)
  • dateutil

Usage

1. Data Processing

First, process the raw training data:

python process_data_v2.py

This will generate feature-engineered parquet files in the Sample_Data/ directory.

2. Training

Train the baseline model:

python baseline_train.py

This will:

  • Load the processed training data
  • Perform 5-fold cross-validation
  • Save models in the Models/ directory
  • Generate out-of-fold predictions in the OOF/ directory
  • Output feature importance CSV files

3. Prediction

Generate predictions on test data:

python OneFoldPred.py

Make sure to process test data first:

python process_test_data.py

4. Evaluation

Check the out-of-fold predictions CSV files in the OOF/ directory to evaluate model performance using the Amex metric.

Notes

  • The code uses fixed random seeds (seed=42) for reproducibility
  • Memory optimization is implemented through data type conversions (float16, int32)
  • Feature importance files are generated for each fold to analyze which features contribute most to predictions
  • The pipeline handles categorical features through label encoding

About

Kaggle Chanllenge

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages