AI/ML · Santa Clara University — ML/AI Course

Lending Club Loan Default Prediction

Ensemble ML pipeline predicting loan defaults across 887K loans with 0.80 AUC, using only pre-decision features to prevent data leakage

ML Engineer — Team Project·2025·GitHub

Lending Club Loan Default Prediction · product surfacePython / XGBoost / Scikit-learn / Pandas

0.80

Test AUC

887K

Loans Analyzed

Raw Features

Models Compared

The Challenge

Lending Club needs to predict which borrowers will default before approving loans. The dataset contains 887K loans with 74 features, but most features capture post-loan behavior (payment history, recoveries, balances) — using them would create data leakage and produce misleadingly high accuracy that fails in production.

My Role & Approach

Made the critical decision to exclude all post-loan features and use only information available at loan origination. This reduced the feature set dramatically but ensured the model would actually work in a real lending decision pipeline. Engineered domain-specific features and compared 5 model architectures with rigorous cross-validation.

Product Thinking

Problem Framing

The naive framing was 'predict loan defaults with maximum accuracy.' The product framing is: 'predict defaults using only information available when a borrower applies.' This distinction — which features exist at decision time — is a product question disguised as a technical one. It's the difference between a research exercise and a deployable model.

Key Tradeoffs

Accuracy vs. production viability: using all 74 features gave 0.99 AUC (impressive but useless — post-loan features won't exist for new applicants). We chose 0.80 AUC with legitimate features over 0.99 with leaked data. Also chose ensemble averaging over stacking for interpretability — the lending team needs to explain decisions to regulators.

What We Didn't Build

Didn't build a real-time scoring API, credit limit optimizer, or loan pricing engine. The assignment was prediction, but the product insight was: the most valuable contribution was proving which features are leaky, not maximizing AUC.

Solution & Execution

End-to-end ML pipeline: feature selection based on domain knowledge (excluding post-loan data), ordinal encoding for sub_grade (A1=1 to G5=35) preserving risk order, target encoding for categorical variables, and engineered features like installment-to-income ratio, DTI × interest rate interaction, and delinquency flags. Trained 5 models (Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost) with GridSearchCV (10-fold CV, AUC scoring). Best performance: XGBoost/Ensemble at 0.80 AUC.

Tech: Python, XGBoost, Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn

Impact

0.80 test AUC with XGBoost — using only pre-decision features (no data leakage)
887K loans processed with engineered feature pipeline
5 models benchmarked with 10-fold cross-validation and GridSearchCV
Domain-aware feature engineering: installment-to-income ratio, DTI × interest rate, credit history length

Challenges & Pivots

Challenge: Naive approach using all 74 features produced suspiciously high AUC (~0.99) due to data leakage from post-loan features

Resolution: Applied domain knowledge to identify and exclude all features that wouldn't be available at loan decision time. AUC dropped to 0.80 — a realistic, production-viable number.

Challenge: High-cardinality categorical features (state, loan purpose) caused dimensionality explosion with one-hot encoding

Resolution: Used target encoding for categorical variables and ordinal encoding for sub_grade that preserved the inherent risk ordering (A1=lowest risk to G5=highest).

Screenshots

Model comparison — Ensemble of top 3 models achieves 0.8049 AUC on held-out test set

What I Learned

Data leakage is the most dangerous pitfall in ML — a 0.99 AUC model that can't work in production is worse than useless.
Domain knowledge for feature selection matters more than model complexity — knowing which features are available at decision time is a product question, not just a technical one.
Feature engineering grounded in business logic (installment-to-income ratio) outperformed blindly throwing all features at the model.

Next case study

PetTriage AI