arjun-varma's picture
Update README.md
0910ab4 verified
---
license: mit
library_name: sklearn
tags:
- multi-label-classification
- movie-genre-classification
- tfidf
- svc
- huggingface
- mlops
- github-actions
- serverless
- pipeline
- tmdb
- sklearn
datasets:
- tmdb
model_name: TMDB Multi-Label Genre Classifier
author: Arjun Varma
language:
- en
pretty_name: TMDB Movie Genre Classifier
task:
- text-classification
- multi-label-classification
---
# 🎬 TMDB Multi-Label Movie Genre Classifier
_Serverless Machine Learning Pipeline — TF-IDF + Linear SVC — Fully Automated & Deployed_
---
## Summary
This project demonstrates the ability to **design, automate, and deploy a real-world Machine Learning system** without relying on paid cloud services.
It showcases strong understanding and application of:
- **MLOps & CI/CD**
- **Automated retraining & scheduled jobs**
- **Model deployment & UI interface**
- **Testing, documentation, reproducibility**
The model predicts **multiple genres** for a movie based on its description — similar to how streaming platforms tag content for recommendations.
➡ Live Demo: https://huggingface.co/spaces/arjun-varma/tmdb-genre-classifier
➡ Model Hub: https://huggingface.co/arjun-varma/tmdb-genre-classifier
---
## 🧠 Problem — Why Multi‑Label Classification?
Movies are **not mutually exclusive**:
| Plot Summary | Correct Genres |
|-------------|----------------|
| Soldier returns from war, struggling with trauma | Drama, War |
| AI becomes sentient and turns against creators | Sci‑Fi, Thriller |
| A musician finds love on tour | Music, Romance |
Single‑label classifiers fail here.
Multi‑label learning predicts **all genres that simultaneously apply**.
This creates challenges:
- Soft labels
- Ambiguity
- Genre co‑occurrence patterns
- Long‑tail imbalance (Documentary vs Thriller vs Music)
---
## 🧱 Architecture — Serverless ML Pipeline
![pipeline-architecture](pipeline-architecture.png)
No AWS SageMaker, no GCP Vertex AI.
**Infrastructure cost = $0**
---
## ⚙️ Model — Why TF‑IDF + Linear SVC vs Transformers?
| Choice | Reason |
|--------|-------|
| Transformers | Expensive & slow for nightly retraining |
| Neural Networks | Need GPUs / infra |
| Logistic Regression | High precision, low recall |
| **Linear SVC + TF‑IDF** | Fast, scalable, interpretable 👈 Best for pipeline |
The biggest improvement:
- Logistic Regression predicted almost nothing → trying to be “safe”
- Linear SVC learned boundary margins → better multi‑genre recall
- Applying sigmoid + threshold → configurable precision/recall trade‑off
---
## 📊 Performance Metrics
| Model | Precision_micro | Recall_micro | F1_macro | Result |
|------|----------------|--------------|----------|--------|
| Logistic Regression | 0.83 | 0.006 | ~0.03 | Almost no predictions |
| **Linear SVC + threshold 0.25** | 0.16 | **0.99** | **0.27** | Usable predictions |
Interpretation:
- High recall = the model "understands" the genres
- Threshold lets *different applications choose correctness level*
If this was powering **recommendations**, threshold matters.
---
## 🧪 Testing
This project includes:
- Unit tests for vectorization & data transformation
- Mocked API tests for dataset ingestion
- End‑to‑end pipeline test verifying artifacts & metrics
Tools used:
- `pytest`
- `monkeypatch`
- `tmp_path`
- GitHub CI
This demonstrates **reliability in automation-focused ML environments**.
---
## 🖥 Demo & Integration
| Component | Link |
|----------|------|
| 🔥 Live App (HF Space) | [link](https://arjun-varma-tmdb-genre-demo.hf.space/?__theme=system&deep_link=uuDed8RzLJI) |
| 📁 Github repo | [link](https://github.com/ArjunXvarma/Serverless-ML-pipeline.git)|
The model provides:
- ⭐ Ranked genre probabilities
- ⭐ Adjustable confidence threshold
- ⭐ Real‑time inference
---
## 🚀 Future Enhancements
| Idea | Value |
|-----|------|
| Compare vs MiniLM Transformer | Benchmark credibility |
| Add FastAPI inference service | Deployable microservice |
| Visualize confidence & confusion | Explainable AI |
---
## ✍ Author
**Arjun Varma**
Machine Learning Engineer & Systems Developer
Designed for real-world ML infrastructure readiness.
---