| | --- |
| | license: mit |
| | library_name: sklearn |
| | tags: |
| | - multi-label-classification |
| | - movie-genre-classification |
| | - tfidf |
| | - svc |
| | - huggingface |
| | - mlops |
| | - github-actions |
| | - serverless |
| | - pipeline |
| | - tmdb |
| | - sklearn |
| | datasets: |
| | - tmdb |
| | model_name: TMDB Multi-Label Genre Classifier |
| | author: Arjun Varma |
| | language: |
| | - en |
| | pretty_name: TMDB Movie Genre Classifier |
| | task: |
| | - text-classification |
| | - multi-label-classification |
| | --- |
| | |
| | # 🎬 TMDB Multi-Label Movie Genre Classifier |
| | _Serverless Machine Learning Pipeline — TF-IDF + Linear SVC — Fully Automated & Deployed_ |
| |
|
| | --- |
| |
|
| | ## Summary |
| |
|
| | This project demonstrates the ability to **design, automate, and deploy a real-world Machine Learning system** without relying on paid cloud services. |
| |
|
| | It showcases strong understanding and application of: |
| | - **MLOps & CI/CD** |
| | - **Automated retraining & scheduled jobs** |
| | - **Model deployment & UI interface** |
| | - **Testing, documentation, reproducibility** |
| |
|
| | The model predicts **multiple genres** for a movie based on its description — similar to how streaming platforms tag content for recommendations. |
| |
|
| | ➡ Live Demo: https://huggingface.co/spaces/arjun-varma/tmdb-genre-classifier |
| | ➡ Model Hub: https://huggingface.co/arjun-varma/tmdb-genre-classifier |
| |
|
| | --- |
| |
|
| | ## 🧠 Problem — Why Multi‑Label Classification? |
| |
|
| | Movies are **not mutually exclusive**: |
| |
|
| | | Plot Summary | Correct Genres | |
| | |-------------|----------------| |
| | | Soldier returns from war, struggling with trauma | Drama, War | |
| | | AI becomes sentient and turns against creators | Sci‑Fi, Thriller | |
| | | A musician finds love on tour | Music, Romance | |
| |
|
| | Single‑label classifiers fail here. |
| | Multi‑label learning predicts **all genres that simultaneously apply**. |
| |
|
| | This creates challenges: |
| | - Soft labels |
| | - Ambiguity |
| | - Genre co‑occurrence patterns |
| | - Long‑tail imbalance (Documentary vs Thriller vs Music) |
| |
|
| | --- |
| |
|
| | ## 🧱 Architecture — Serverless ML Pipeline |
| |
|
| |  |
| |
|
| | No AWS SageMaker, no GCP Vertex AI. |
| | **Infrastructure cost = $0** |
| |
|
| | --- |
| |
|
| | ## ⚙️ Model — Why TF‑IDF + Linear SVC vs Transformers? |
| |
|
| | | Choice | Reason | |
| | |--------|-------| |
| | | Transformers | Expensive & slow for nightly retraining | |
| | | Neural Networks | Need GPUs / infra | |
| | | Logistic Regression | High precision, low recall | |
| | | **Linear SVC + TF‑IDF** | Fast, scalable, interpretable 👈 Best for pipeline | |
| |
|
| | The biggest improvement: |
| | - Logistic Regression predicted almost nothing → trying to be “safe” |
| | - Linear SVC learned boundary margins → better multi‑genre recall |
| | - Applying sigmoid + threshold → configurable precision/recall trade‑off |
| |
|
| | --- |
| |
|
| | ## 📊 Performance Metrics |
| |
|
| | | Model | Precision_micro | Recall_micro | F1_macro | Result | |
| | |------|----------------|--------------|----------|--------| |
| | | Logistic Regression | 0.83 | 0.006 | ~0.03 | Almost no predictions | |
| | | **Linear SVC + threshold 0.25** | 0.16 | **0.99** | **0.27** | Usable predictions | |
| | |
| | Interpretation: |
| | - High recall = the model "understands" the genres |
| | - Threshold lets *different applications choose correctness level* |
| | |
| | If this was powering **recommendations**, threshold matters. |
| | |
| | --- |
| | |
| | ## 🧪 Testing |
| | |
| | This project includes: |
| | - Unit tests for vectorization & data transformation |
| | - Mocked API tests for dataset ingestion |
| | - End‑to‑end pipeline test verifying artifacts & metrics |
| | |
| | Tools used: |
| | - `pytest` |
| | - `monkeypatch` |
| | - `tmp_path` |
| | - GitHub CI |
| |
|
| | This demonstrates **reliability in automation-focused ML environments**. |
| |
|
| | --- |
| |
|
| | ## 🖥 Demo & Integration |
| |
|
| | | Component | Link | |
| | |----------|------| |
| | | 🔥 Live App (HF Space) | [link](https://arjun-varma-tmdb-genre-demo.hf.space/?__theme=system&deep_link=uuDed8RzLJI) | |
| | | 📁 Github repo | [link](https://github.com/ArjunXvarma/Serverless-ML-pipeline.git)| |
| |
|
| | The model provides: |
| | - ⭐ Ranked genre probabilities |
| | - ⭐ Adjustable confidence threshold |
| | - ⭐ Real‑time inference |
| |
|
| | --- |
| |
|
| | ## 🚀 Future Enhancements |
| |
|
| | | Idea | Value | |
| | |-----|------| |
| | | Compare vs MiniLM Transformer | Benchmark credibility | |
| | | Add FastAPI inference service | Deployable microservice | |
| | | Visualize confidence & confusion | Explainable AI | |
| |
|
| | --- |
| |
|
| | ## ✍ Author |
| |
|
| | **Arjun Varma** |
| | Machine Learning Engineer & Systems Developer |
| | Designed for real-world ML infrastructure readiness. |
| |
|
| | --- |
| |
|