IndoHoaxDetector
A comprehensive machine learning system for detecting hoax-style news articles in Indonesian language. This project provides multiple models trained on linguistic features of Indonesian news to identify articles written in a style typical of hoaxes or fake news. The system analyzes writing patterns, sensationalism, and other stylistic indicators rather than checking factual accuracy.
Features
- Multiple Models: Logistic Regression, Linear SVM, Random Forest, Naive Bayes, and IndoBERT
- High Accuracy: Best model (IndoBERT) achieves 99.89% accuracy on test data
- Web Interface: Interactive Gradio app for easy prediction
- Lightweight Options: TF-IDF based models for fast inference
- Transformer Support: IndoBERT for state-of-the-art performance
Model Comparison
All models were trained on 62,972 Indonesian news articles and evaluated on 12,595 held-out test samples:
| Model | Accuracy | Precision (Macro) | Recall (Macro) | F1 (Macro) | Type |
|---|---|---|---|---|---|
| IndoBERT | 0.9989 | 0.9989 | 0.9989 | 0.9989 | Transformer |
| Linear SVM | 0.9819 | 0.9820 | 0.9817 | 0.9818 | TF-IDF + SVM |
| Logistic Regression | 0.9782 | 0.9787 | 0.9777 | 0.9781 | TF-IDF + LR |
| Random Forest | 0.9765 | 0.9768 | 0.9760 | 0.9764 | TF-IDF + RF |
| Multinomial Naive Bayes | 0.9398 | 0.9414 | 0.9381 | 0.9393 | TF-IDF + NB |
Model Details
IndoBERT (Recommended)
- Type: Fine-tuned IndoBenchmark/indobert-base-p1
- Training: 3 epochs, batch_size=16, learning_rate=2e-5, max_length=128
- Accuracy: 99.89%
- Best For: Highest accuracy, contextual understanding
- Requirements: transformers, torch (GPU recommended)
TF-IDF Based Models
- Vectorizer: TF-IDF (unigrams, max_features=5000, sublinear_tf=True)
- Preprocessing: Lowercase, URL removal, punctuation, Indonesian stopwords, Sastrawi stemming
- Logistic Regression: L2 regularization, max_iter=1000 (97.82% accuracy)
- Linear SVM: C=1.0, max_iter=10000 (98.19% accuracy)
- Random Forest: 100 trees, random_state=42 (97.65% accuracy)
- Naive Bayes: Multinomial, alpha=1.0 (93.98% accuracy)
Usage
Web Interface
Visit the Hugging Face Space to use the interactive app:
- Select a model from the dropdown
- Enter Indonesian news text
- Click "Detect"
- View the prediction and confidence
Local Usage
To run locally:
pip install gradio scikit-learn transformers torch
python app.py
API Usage
Load and use models programmatically:
import pickle
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# For TF-IDF models (Logistic Regression, SVM, RF, NB)
with open('logreg_model.pkl', 'rb') as f:
model = pickle.load(f)
# Load vectorizer
with open('tfidf_vectorizer.pkl', 'rb') as f:
vectorizer = pickle.load(f)
text = "Your Indonesian news text here"
X = vectorizer.transform([text])
prediction = model.predict(X) # 0 for FAKTA, 1 for HOAX
# For IndoBERT
tokenizer = AutoTokenizer.from_pretrained('indobert_model')
model = AutoModelForSequenceClassification.from_pretrained('indobert_model')
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=-1).item()
Training Data
- Source: TurnBackHoax fact-checks, Kompas/CekFakta.com articles
- Size: 62,972 labeled articles (balanced HOAX/FAKTA)
- Preprocessing: Indonesian text normalization, stemming, stopword removal
- Split: 80% train, 20% test (stratified)
Limitations
- Stylistic Analysis Only: Detects hoax-like writing style, not factual accuracy
- Language Specific: Trained on Indonesian news patterns
- Domain Shift: May perform differently on social media vs. formal news
- False Positives/Negatives: Sensational legitimate news may be flagged as hoax
- Transformer Requirements: IndoBERT needs GPU for training/inference
Ethical Considerations
- Responsible Use: Augment human fact-checkers, not replace them
- Bias Awareness: Training data may reflect source biases
- Transparency: All models and code are open-source
- Misuse Prevention: Clearly labeled as stylistic detector, not truth verifier
Contributing
Contributions welcome! Open issues or PRs for improvements.
License
MIT License - see LICENSE file for details
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference