Home

AI/ML · Santa Clara University — ML/AI Course

Lending Club Loan Default Prediction

Ensemble ML pipeline predicting loan defaults across 887K loans with 0.80 AUC, using only pre-decision features to prevent data leakage

ML Engineer — Team Project·2025·GitHub
Lending Club Loan Default Prediction
Lending Club Loan Default Prediction · product surfacePython / XGBoost / Scikit-learn / Pandas
0.80
Test AUC
887K
Loans Analyzed
74
Raw Features
5
Models Compared

The Challenge

Lending Club needs to predict which borrowers will default before approving loans. The dataset contains 887K loans with 74 features, but most features capture post-loan behavior (payment history, recoveries, balances) — using them would create data leakage and produce misleadingly high accuracy that fails in production.

My Role & Approach

Made the critical decision to exclude all post-loan features and use only information available at loan origination. This reduced the feature set dramatically but ensured the model would actually work in a real lending decision pipeline. Engineered domain-specific features and compared 5 model architectures with rigorous cross-validation.

Product Thinking

Problem Framing

The naive framing was 'predict loan defaults with maximum accuracy.' The product framing is: 'predict defaults using only information available when a borrower applies.' This distinction — which features exist at decision time — is a product question disguised as a technical one. It's the difference between a research exercise and a deployable model.

Key Tradeoffs

Accuracy vs. production viability: using all 74 features gave 0.99 AUC (impressive but useless — post-loan features won't exist for new applicants). We chose 0.80 AUC with legitimate features over 0.99 with leaked data. Also chose ensemble averaging over stacking for interpretability — the lending team needs to explain decisions to regulators.

What We Didn't Build

Didn't build a real-time scoring API, credit limit optimizer, or loan pricing engine. The assignment was prediction, but the product insight was: the most valuable contribution was proving which features are leaky, not maximizing AUC.

Solution & Execution

End-to-end ML pipeline: feature selection based on domain knowledge (excluding post-loan data), ordinal encoding for sub_grade (A1=1 to G5=35) preserving risk order, target encoding for categorical variables, and engineered features like installment-to-income ratio, DTI × interest rate interaction, and delinquency flags. Trained 5 models (Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost) with GridSearchCV (10-fold CV, AUC scoring). Best performance: XGBoost/Ensemble at 0.80 AUC.

Tech: Python, XGBoost, Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn

Impact

  • 0.80 test AUC with XGBoost — using only pre-decision features (no data leakage)
  • 887K loans processed with engineered feature pipeline
  • 5 models benchmarked with 10-fold cross-validation and GridSearchCV
  • Domain-aware feature engineering: installment-to-income ratio, DTI × interest rate, credit history length

Challenges & Pivots

Challenge: Naive approach using all 74 features produced suspiciously high AUC (~0.99) due to data leakage from post-loan features

Resolution: Applied domain knowledge to identify and exclude all features that wouldn't be available at loan decision time. AUC dropped to 0.80 — a realistic, production-viable number.

Challenge: High-cardinality categorical features (state, loan purpose) caused dimensionality explosion with one-hot encoding

Resolution: Used target encoding for categorical variables and ordinal encoding for sub_grade that preserved the inherent risk ordering (A1=lowest risk to G5=highest).

Screenshots

Model comparison — Ensemble of top 3 models achieves 0.8049 AUC on held-out test set
Model comparison — Ensemble of top 3 models achieves 0.8049 AUC on held-out test set

What I Learned

  • Data leakage is the most dangerous pitfall in ML — a 0.99 AUC model that can't work in production is worse than useless.
  • Domain knowledge for feature selection matters more than model complexity — knowing which features are available at decision time is a product question, not just a technical one.
  • Feature engineering grounded in business logic (installment-to-income ratio) outperformed blindly throwing all features at the model.

Next case study

PetTriage AI