Update README.md

0910ab4 verified 3 months ago

4.25 kB

	---
	license: mit
	library_name: sklearn
	tags:
	- multi-label-classification
	- movie-genre-classification
	- tfidf
	- svc
	- huggingface
	- mlops
	- github-actions
	- serverless
	- pipeline
	- tmdb
	- sklearn
	datasets:
	- tmdb
	model_name: TMDB Multi-Label Genre Classifier
	author: Arjun Varma
	language:
	- en
	pretty_name: TMDB Movie Genre Classifier
	task:
	- text-classification
	- multi-label-classification
	---

	# 🎬 TMDB Multi-Label Movie Genre Classifier
	_Serverless Machine Learning Pipeline — TF-IDF + Linear SVC — Fully Automated & Deployed_

	---

	## Summary

	This project demonstrates the ability to design, automate, and deploy a real-world Machine Learning system without relying on paid cloud services.

	It showcases strong understanding and application of:
	- MLOps & CI/CD
	- Automated retraining & scheduled jobs
	- Model deployment & UI interface
	- Testing, documentation, reproducibility

	The model predicts multiple genres for a movie based on its description — similar to how streaming platforms tag content for recommendations.

	➡ Live Demo: https://huggingface.co/spaces/arjun-varma/tmdb-genre-classifier
	➡ Model Hub: https://huggingface.co/arjun-varma/tmdb-genre-classifier

	---

	## 🧠 Problem — Why Multi‑Label Classification?

	Movies are not mutually exclusive:

	\| Plot Summary \| Correct Genres \|
	\|-------------\|----------------\|
	\| Soldier returns from war, struggling with trauma \| Drama, War \|
	\| AI becomes sentient and turns against creators \| Sci‑Fi, Thriller \|
	\| A musician finds love on tour \| Music, Romance \|

	Single‑label classifiers fail here.
	Multi‑label learning predicts all genres that simultaneously apply.

	This creates challenges:
	- Soft labels
	- Ambiguity
	- Genre co‑occurrence patterns
	- Long‑tail imbalance (Documentary vs Thriller vs Music)

	---

	## 🧱 Architecture — Serverless ML Pipeline

	![pipeline-architecture](pipeline-architecture.png)

	No AWS SageMaker, no GCP Vertex AI.
	Infrastructure cost = $0

	---

	## ⚙️ Model — Why TF‑IDF + Linear SVC vs Transformers?

	\| Choice \| Reason \|
	\|--------\|-------\|
	\| Transformers \| Expensive & slow for nightly retraining \|
	\| Neural Networks \| Need GPUs / infra \|
	\| Logistic Regression \| High precision, low recall \|
	\| Linear SVC + TF‑IDF \| Fast, scalable, interpretable 👈 Best for pipeline \|

	The biggest improvement:
	- Logistic Regression predicted almost nothing → trying to be “safe”
	- Linear SVC learned boundary margins → better multi‑genre recall
	- Applying sigmoid + threshold → configurable precision/recall trade‑off

	---

	## 📊 Performance Metrics

	\| Model \| Precision_micro \| Recall_micro \| F1_macro \| Result \|
	\|------\|----------------\|--------------\|----------\|--------\|
	\| Logistic Regression \| 0.83 \| 0.006 \| ~0.03 \| Almost no predictions \|
	\| Linear SVC + threshold 0.25 \| 0.16 \| 0.99 \| 0.27 \| Usable predictions \|

	Interpretation:
	- High recall = the model "understands" the genres
	- Threshold lets different applications choose correctness level

	If this was powering recommendations, threshold matters.

	---

	## 🧪 Testing

	This project includes:
	- Unit tests for vectorization & data transformation
	- Mocked API tests for dataset ingestion
	- End‑to‑end pipeline test verifying artifacts & metrics

	Tools used:
	- `pytest`
	- `monkeypatch`
	- `tmp_path`
	- GitHub CI

	This demonstrates reliability in automation-focused ML environments.

	---

	## 🖥 Demo & Integration

	\| Component \| Link \|
	\|----------\|------\|
	\| 🔥 Live App (HF Space) \| [link](https://arjun-varma-tmdb-genre-demo.hf.space/?__theme=system&deep_link=uuDed8RzLJI) \|
	\| 📁 Github repo \| [link](https://github.com/ArjunXvarma/Serverless-ML-pipeline.git)\|

	The model provides:
	- ⭐ Ranked genre probabilities
	- ⭐ Adjustable confidence threshold
	- ⭐ Real‑time inference

	---

	## 🚀 Future Enhancements

	\| Idea \| Value \|
	\|-----\|------\|
	\| Compare vs MiniLM Transformer \| Benchmark credibility \|
	\| Add FastAPI inference service \| Deployable microservice \|
	\| Visualize confidence & confusion \| Explainable AI \|

	---

	## ✍ Author

	Arjun Varma
	Machine Learning Engineer & Systems Developer
	Designed for real-world ML infrastructure readiness.

	---