AI/ML · Santa Clara University — ML/AI Course
Lending Club Loan Default Prediction
Ensemble ML pipeline predicting loan defaults across 887K loans with 0.80 AUC, using only pre-decision features to prevent data leakage

The Challenge
Lending Club needs to predict which borrowers will default before approving loans. The dataset contains 887K loans with 74 features, but most features capture post-loan behavior (payment history, recoveries, balances) — using them would create data leakage and produce misleadingly high accuracy that fails in production.
My Role & Approach
Made the critical decision to exclude all post-loan features and use only information available at loan origination. This reduced the feature set dramatically but ensured the model would actually work in a real lending decision pipeline. Engineered domain-specific features and compared 5 model architectures with rigorous cross-validation.
Product Thinking
Problem Framing
The naive framing was 'predict loan defaults with maximum accuracy.' The product framing is: 'predict defaults using only information available when a borrower applies.' This distinction — which features exist at decision time — is a product question disguised as a technical one. It's the difference between a research exercise and a deployable model.
Key Tradeoffs
Accuracy vs. production viability: using all 74 features gave 0.99 AUC (impressive but useless — post-loan features won't exist for new applicants). We chose 0.80 AUC with legitimate features over 0.99 with leaked data. Also chose ensemble averaging over stacking for interpretability — the lending team needs to explain decisions to regulators.
What We Didn't Build
Didn't build a real-time scoring API, credit limit optimizer, or loan pricing engine. The assignment was prediction, but the product insight was: the most valuable contribution was proving which features are leaky, not maximizing AUC.
Solution & Execution
End-to-end ML pipeline: feature selection based on domain knowledge (excluding post-loan data), ordinal encoding for sub_grade (A1=1 to G5=35) preserving risk order, target encoding for categorical variables, and engineered features like installment-to-income ratio, DTI × interest rate interaction, and delinquency flags. Trained 5 models (Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost) with GridSearchCV (10-fold CV, AUC scoring). Best performance: XGBoost/Ensemble at 0.80 AUC.
Tech: Python, XGBoost, Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn
Impact
- 0.80 test AUC with XGBoost — using only pre-decision features (no data leakage)
- 887K loans processed with engineered feature pipeline
- 5 models benchmarked with 10-fold cross-validation and GridSearchCV
- Domain-aware feature engineering: installment-to-income ratio, DTI × interest rate, credit history length
Challenges & Pivots
Challenge: Naive approach using all 74 features produced suspiciously high AUC (~0.99) due to data leakage from post-loan features
Resolution: Applied domain knowledge to identify and exclude all features that wouldn't be available at loan decision time. AUC dropped to 0.80 — a realistic, production-viable number.
Challenge: High-cardinality categorical features (state, loan purpose) caused dimensionality explosion with one-hot encoding
Resolution: Used target encoding for categorical variables and ordinal encoding for sub_grade that preserved the inherent risk ordering (A1=lowest risk to G5=highest).
Screenshots

What I Learned
- Data leakage is the most dangerous pitfall in ML — a 0.99 AUC model that can't work in production is worse than useless.
- Domain knowledge for feature selection matters more than model complexity — knowing which features are available at decision time is a product question, not just a technical one.
- Feature engineering grounded in business logic (installment-to-income ratio) outperformed blindly throwing all features at the model.
Next case study
PetTriage AI