Loan Default Prediction Pipeline

A scikit-learn machine learning pipeline that predicts whether a loan applicant is at risk of defaulting, based on their financial profile and personal details.

Model: Random Forest Classifier
Task: Binary Classification (default risk: 0 = Low Risk, 1 = High Risk)
Validation Accuracy: ~89.7%


Model Description

This pipeline was trained on 252,000 loan applicant records. It combines a full data preprocessing pipeline with a tuned Random Forest Classifier. The preprocessing handles:

  • Missing value imputation
  • Snake-case column name formatting
  • Boolean encoding (married/single status, car ownership, house ownership)
  • Job stability ordinal encoding
  • City tier ordinal encoding (Tier 1 / Tier 2 / Tier 3)
  • State-level default rate target encoding
  • Standard scaling of numerical features
  • One-hot encoding of nominal categorical features
  • Feature selection

How to Use

Install dependencies

pip install scikit-learn joblib pandas huggingface_hub

Load and use the pipeline

import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

# Download the pipeline from Hugging Face Hub
model_path = hf_hub_download(
    repo_id="amanbokaro/loan-default-prediction-pipeline",
    filename="loan_default_rf_pipeline.joblib"
)

# Load the pipeline
pipeline = joblib.load(model_path)

# Example input (single applicant)
applicant = pd.DataFrame([{
    "Income":          500000,
    "Age":             35,
    "Experience":      8,
    "Married/Single":  "married",
    "House_Ownership": "rented",
    "Car_Ownership":   "yes",
    "Profession":      "Software_Developer",
    "CITY":            "Mumbai",
    "STATE":           "Maharashtra",
    "CURRENT_JOB_YRS": 4,
    "CURRENT_HOUSE_YRS": 3
}])

# Predict
prediction = pipeline.predict(applicant)
probability = pipeline.predict_proba(applicant)

print("Prediction:", "High Risk" if prediction[0] == 1 else "Low Risk")
print(f"Default probability: {probability[0][1]:.2%}")

Input Features

Feature Type Description
Income Numerical Annual income of the applicant (INR)
Age Numerical Age of the applicant (years)
Experience Numerical Years of work experience
Married/Single Categorical Marital status (married / single)
House_Ownership Categorical Housing status (rented / owned / norent_noown)
Car_Ownership Categorical Car ownership (yes / no)
Profession Categorical Applicant's profession/job title
CITY Categorical City of residence
STATE Categorical State of residence
CURRENT_JOB_YRS Numerical Years at current job
CURRENT_HOUSE_YRS Numerical Years at current residence

Output

Value Meaning
0 Low Risk
1 High Risk

Training Details

Detail Value
Dataset size 252,000 records
Train / Val / Test 80% / 10% / 10%
Algorithm Random Forest Classifier
Validation Accuracy ~89.7%
Random Seed 42
Framework scikit-learn 1.8.0
Python 3.11

Files

File Description
loan_default_rf_pipeline.joblib Serialized sklearn pipeline (405 MB)

License

MIT License — see LICENSE for details.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support