IndoHoaxDetector

A comprehensive machine learning system for detecting hoax-style news articles in Indonesian language. This project provides multiple models trained on linguistic features of Indonesian news to identify articles written in a style typical of hoaxes or fake news. The system analyzes writing patterns, sensationalism, and other stylistic indicators rather than checking factual accuracy.

Features

Multiple Models: Logistic Regression, Linear SVM, Random Forest, Naive Bayes, and IndoBERT
High Accuracy: Best model (IndoBERT) achieves 99.89% accuracy on test data
Web Interface: Interactive Gradio app for easy prediction
Lightweight Options: TF-IDF based models for fast inference
Transformer Support: IndoBERT for state-of-the-art performance

Model Comparison

All models were trained on 62,972 Indonesian news articles and evaluated on 12,595 held-out test samples:

Model	Accuracy	Precision (Macro)	Recall (Macro)	F1 (Macro)	Type
IndoBERT	0.9989	0.9989	0.9989	0.9989	Transformer
Linear SVM	0.9819	0.9820	0.9817	0.9818	TF-IDF + SVM
Logistic Regression	0.9782	0.9787	0.9777	0.9781	TF-IDF + LR
Random Forest	0.9765	0.9768	0.9760	0.9764	TF-IDF + RF
Multinomial Naive Bayes	0.9398	0.9414	0.9381	0.9393	TF-IDF + NB

Model Details

IndoBERT (Recommended)

Type: Fine-tuned IndoBenchmark/indobert-base-p1
Training: 3 epochs, batch_size=16, learning_rate=2e-5, max_length=128
Accuracy: 99.89%
Best For: Highest accuracy, contextual understanding
Requirements: transformers, torch (GPU recommended)

TF-IDF Based Models

Vectorizer: TF-IDF (unigrams, max_features=5000, sublinear_tf=True)
Preprocessing: Lowercase, URL removal, punctuation, Indonesian stopwords, Sastrawi stemming
Logistic Regression: L2 regularization, max_iter=1000 (97.82% accuracy)
Linear SVM: C=1.0, max_iter=10000 (98.19% accuracy)
Random Forest: 100 trees, random_state=42 (97.65% accuracy)
Naive Bayes: Multinomial, alpha=1.0 (93.98% accuracy)

Usage

Web Interface

Visit the Hugging Face Space to use the interactive app:

Select a model from the dropdown
Enter Indonesian news text
Click "Detect"
View the prediction and confidence

Local Usage

To run locally:

pip install gradio scikit-learn transformers torch
python app.py

API Usage

Load and use models programmatically:

import pickle
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# For TF-IDF models (Logistic Regression, SVM, RF, NB)
with open('logreg_model.pkl', 'rb') as f:
    model = pickle.load(f)

# Load vectorizer
with open('tfidf_vectorizer.pkl', 'rb') as f:
    vectorizer = pickle.load(f)

text = "Your Indonesian news text here"
X = vectorizer.transform([text])
prediction = model.predict(X)  # 0 for FAKTA, 1 for HOAX

# For IndoBERT
tokenizer = AutoTokenizer.from_pretrained('indobert_model')
model = AutoModelForSequenceClassification.from_pretrained('indobert_model')

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=-1).item()

Training Data

Source: TurnBackHoax fact-checks, Kompas/CekFakta.com articles
Size: 62,972 labeled articles (balanced HOAX/FAKTA)
Preprocessing: Indonesian text normalization, stemming, stopword removal
Split: 80% train, 20% test (stratified)

Limitations

Stylistic Analysis Only: Detects hoax-like writing style, not factual accuracy
Language Specific: Trained on Indonesian news patterns
Domain Shift: May perform differently on social media vs. formal news
False Positives/Negatives: Sensational legitimate news may be flagged as hoax
Transformer Requirements: IndoBERT needs GPU for training/inference

Ethical Considerations

Responsible Use: Augment human fact-checkers, not replace them
Bias Awareness: Training data may reflect source biases
Transparency: All models and code are open-source
Misuse Prevention: Clearly labeled as stylistic detector, not truth verifier

Contributing

Contributions welcome! Open issues or PRs for improvements.

License

MIT License - see LICENSE file for details

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Downloads last month: -; Downloads are not tracked for this model. How to track