--- license: apache-2.0 tags: - intrusion-detection - host-based-ids - adfa-ld - distilbert - sequence-classification - security - cybersecurity - binary-classification datasets: - ADFA-LD model-index: - name: distilbert-base-uncased-hids-adfa results: - task: type: text-classification name: Host-based Intrusion Detection dataset: name: ADFA-LD type: custom metrics: - type: accuracy value: 0.9403 - type: f1 value: 0.9450 - type: precision value: 0.9245 - type: recall value: 0.9664 - type: auc value: 0.9630 --- # DistilBERT for Host-based Intrusion Detection System (HIDS) This model is a fine-tuned DistilBERT model for binary classification of system call sequences to detect intrusions in the ADFA-LD dataset. The model was trained through hyperparameter tuning to achieve optimal performance for host-based intrusion detection. ## Model Details ### Base Model - **Architecture**: DistilBERT (DistilBertForSequenceClassification) - **Base Model**: `distilbert-base-uncased` - **Task**: Binary Sequence Classification (Normal vs Attack) - **Number of Labels**: 2 ### Training Configuration - **Training Epochs**: 8 - **Batch Size**: 32 - **Learning Rate**: 2e-05 - **Weight Decay**: 0.0 - **Warmup Ratio**: 0.1 - **Optimizer**: AdamW - **Scheduler**: LinearLR ### Dataset - **Dataset**: ADFA-LD (Australian Defence Force Academy Linux Dataset) - **Preprocessing**: 18-gram sequences ## Performance ### Validation Metrics - **Accuracy**: 94.03% - **F1 Score**: 94.50% - **Precision**: 92.45% - **Recall**: 96.64% - **AUC-ROC**: 96.30% ## Usage You can use this model directly with a pipeline for text classification: ```python >>> from transformers import pipeline >>> classifier = pipeline('text-classification', model='salsazufar/distilbert-base-hids-adfa') >>> classifier("1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18") [{'label': 'LABEL_0', 'score': 0.9876}, {'label': 'LABEL_1', 'score': 0.0124}] ``` Here is how to use this model to get the classification of a given text in PyTorch: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained('salsazufar/distilbert-base-hids-adfa') model = AutoModelForSequenceClassification.from_pretrained('salsazufar/distilbert-base-hids-adfa') # Prepare input (18-gram system call sequence) text = "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18" encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=20) # Forward pass with torch.no_grad(): output = model(**encoded_input) logits = output.logits probabilities = torch.softmax(logits, dim=-1) predicted_class = torch.argmax(logits, dim=-1).item() # Interpret results class_names = ["Normal", "Attack"] print(f"Predicted class: {class_names[predicted_class]}") print(f"Confidence: {probabilities[0][predicted_class].item():.4f}") print(f"Probabilities: Normal={probabilities[0][0].item():.4f}, Attack={probabilities[0][1].item():.4f}") ``` ### Data Preprocessing This model expects input in 18-gram format. If you have raw system call traces, you need to: 1. Extract system calls from trace files 2. Convert to n-grams (n=18) 3. Format as space-separated string 4. Ensure sequences are exactly 18 tokens (pad or truncate if necessary) Example preprocessing pipeline: ```python def create_ngrams(trace, n=18): """Convert system call trace to n-grams""" ngrams = [] for i in range(len(trace) - n + 1): ngram = trace[i:i+n] ngrams.append(" ".join(map(str, ngram))) return ngrams ``` ### Limitations and Considerations 1. **Domain Specific**: This model is trained specifically on ADFA-LD dataset and may not generalize well to other system call datasets without retraining. 2. **Input Format**: The model expects 18-gram sequences. Raw system calls must be preprocessed accordingly. 3. **Binary Classification**: The model only distinguishes between "Normal" and "Attack" classes. It does not classify specific attack types. ### BibTeX entry and citation info ```bibtex @misc{distilbert-hids-adfa, title={DistilBERT for Host-based Intrusion Detection on ADFA-LD Dataset}, author={salsazufar}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/salsazufar/distilbert-base-hids-adfa}} } ``` ## References - ADFA-LD Dataset: [ADFA-LD: An Anomaly Detection Dataset for Linux-based Host Intrusion Detection Systems](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-LD-Dataset/) - DistilBERT: [DistilBERT, a distilled version of BERT](https://arxiv.org/abs/1910.01108) ## License This model is licensed under the Apache 2.0 license.