Loan Default Prediction Pipeline

A scikit-learn machine learning pipeline that predicts whether a loan applicant is at risk of defaulting, based on their financial profile and personal details.

Model: Random Forest Classifier
Task: Binary Classification (default risk: 0 = Low Risk, 1 = High Risk)
Validation Accuracy: ~89.7%

Model Description

This pipeline was trained on 252,000 loan applicant records. It combines a full data preprocessing pipeline with a tuned Random Forest Classifier. The preprocessing handles:

Missing value imputation
Snake-case column name formatting
Boolean encoding (married/single status, car ownership, house ownership)
Job stability ordinal encoding
City tier ordinal encoding (Tier 1 / Tier 2 / Tier 3)
State-level default rate target encoding
Standard scaling of numerical features
One-hot encoding of nominal categorical features
Feature selection

How to Use

Install dependencies

pip install scikit-learn joblib pandas huggingface_hub

Load and use the pipeline

import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

# Download the pipeline from Hugging Face Hub
model_path = hf_hub_download(
    repo_id="amanbokaro/loan-default-prediction-pipeline",
    filename="loan_default_rf_pipeline.joblib"
)

# Load the pipeline
pipeline = joblib.load(model_path)

# Example input (single applicant)
applicant = pd.DataFrame([{
    "Income":          500000,
    "Age":             35,
    "Experience":      8,
    "Married/Single":  "married",
    "House_Ownership": "rented",
    "Car_Ownership":   "yes",
    "Profession":      "Software_Developer",
    "CITY":            "Mumbai",
    "STATE":           "Maharashtra",
    "CURRENT_JOB_YRS": 4,
    "CURRENT_HOUSE_YRS": 3
}])

# Predict
prediction = pipeline.predict(applicant)
probability = pipeline.predict_proba(applicant)

print("Prediction:", "High Risk" if prediction[0] == 1 else "Low Risk")
print(f"Default probability: {probability[0][1]:.2%}")

Input Features

Feature	Type	Description
`Income`	Numerical	Annual income of the applicant (INR)
`Age`	Numerical	Age of the applicant (years)
`Experience`	Numerical	Years of work experience
`Married/Single`	Categorical	Marital status (`married` / `single`)
`House_Ownership`	Categorical	Housing status (`rented` / `owned` / `norent_noown`)
`Car_Ownership`	Categorical	Car ownership (`yes` / `no`)
`Profession`	Categorical	Applicant's profession/job title
`CITY`	Categorical	City of residence
`STATE`	Categorical	State of residence
`CURRENT_JOB_YRS`	Numerical	Years at current job
`CURRENT_HOUSE_YRS`	Numerical	Years at current residence

Output

Value	Meaning
`0`	Low Risk
`1`	High Risk

Training Details

Detail	Value
Dataset size	252,000 records
Train / Val / Test	80% / 10% / 10%
Algorithm	Random Forest Classifier
Validation Accuracy	~89.7%
Random Seed	42
Framework	scikit-learn 1.8.0
Python	3.11

Files

File	Description
`loan_default_rf_pipeline.joblib`	Serialized sklearn pipeline (405 MB)

License

MIT License — see LICENSE for details.

Downloads last month: -