| | --- |
| | license: apache-2.0 |
| | tags: |
| | - intrusion-detection |
| | - host-based-ids |
| | - adfa-ld |
| | - distilbert |
| | - sequence-classification |
| | - security |
| | - cybersecurity |
| | - binary-classification |
| | datasets: |
| | - ADFA-LD |
| | model-index: |
| | - name: distilbert-base-uncased-hids-adfa |
| | results: |
| | - task: |
| | type: text-classification |
| | name: Host-based Intrusion Detection |
| | dataset: |
| | name: ADFA-LD |
| | type: custom |
| | metrics: |
| | - type: accuracy |
| | value: 0.9403 |
| | - type: f1 |
| | value: 0.9450 |
| | - type: precision |
| | value: 0.9245 |
| | - type: recall |
| | value: 0.9664 |
| | - type: auc |
| | value: 0.9630 |
| | --- |
| | |
| | # DistilBERT for Host-based Intrusion Detection System (HIDS) |
| |
|
| | This model is a fine-tuned DistilBERT model for binary classification of system call sequences to detect intrusions in the ADFA-LD dataset. The model was trained through hyperparameter tuning to achieve optimal performance for host-based intrusion detection. |
| |
|
| | ## Model Details |
| |
|
| | ### Base Model |
| | - **Architecture**: DistilBERT (DistilBertForSequenceClassification) |
| | - **Base Model**: `distilbert-base-uncased` |
| | - **Task**: Binary Sequence Classification (Normal vs Attack) |
| | - **Number of Labels**: 2 |
| |
|
| | ### Training Configuration |
| | - **Training Epochs**: 8 |
| | - **Batch Size**: 32 |
| | - **Learning Rate**: 2e-05 |
| | - **Weight Decay**: 0.0 |
| | - **Warmup Ratio**: 0.1 |
| | - **Optimizer**: AdamW |
| | - **Scheduler**: LinearLR |
| |
|
| | ### Dataset |
| | - **Dataset**: ADFA-LD (Australian Defence Force Academy Linux Dataset) |
| | - **Preprocessing**: 18-gram sequences |
| |
|
| |
|
| | ## Performance |
| |
|
| | ### Validation Metrics |
| | - **Accuracy**: 94.03% |
| | - **F1 Score**: 94.50% |
| | - **Precision**: 92.45% |
| | - **Recall**: 96.64% |
| | - **AUC-ROC**: 96.30% |
| |
|
| | ## Usage |
| |
|
| | You can use this model directly with a pipeline for text classification: |
| |
|
| | ```python |
| | >>> from transformers import pipeline |
| | |
| | >>> classifier = pipeline('text-classification', model='salsazufar/distilbert-base-hids-adfa') |
| | >>> classifier("1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18") |
| | |
| | [{'label': 'LABEL_0', |
| | 'score': 0.9876}, |
| | {'label': 'LABEL_1', |
| | 'score': 0.0124}] |
| | ``` |
| |
|
| | Here is how to use this model to get the classification of a given text in PyTorch: |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | |
| | tokenizer = AutoTokenizer.from_pretrained('salsazufar/distilbert-base-hids-adfa') |
| | model = AutoModelForSequenceClassification.from_pretrained('salsazufar/distilbert-base-hids-adfa') |
| | |
| | # Prepare input (18-gram system call sequence) |
| | text = "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18" |
| | encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=20) |
| | |
| | # Forward pass |
| | with torch.no_grad(): |
| | output = model(**encoded_input) |
| | logits = output.logits |
| | probabilities = torch.softmax(logits, dim=-1) |
| | predicted_class = torch.argmax(logits, dim=-1).item() |
| | |
| | # Interpret results |
| | class_names = ["Normal", "Attack"] |
| | print(f"Predicted class: {class_names[predicted_class]}") |
| | print(f"Confidence: {probabilities[0][predicted_class].item():.4f}") |
| | print(f"Probabilities: Normal={probabilities[0][0].item():.4f}, Attack={probabilities[0][1].item():.4f}") |
| | ``` |
| |
|
| | ### Data Preprocessing |
| |
|
| | This model expects input in 18-gram format. If you have raw system call traces, you need to: |
| |
|
| | 1. Extract system calls from trace files |
| | 2. Convert to n-grams (n=18) |
| | 3. Format as space-separated string |
| | 4. Ensure sequences are exactly 18 tokens (pad or truncate if necessary) |
| |
|
| | Example preprocessing pipeline: |
| |
|
| | ```python |
| | def create_ngrams(trace, n=18): |
| | """Convert system call trace to n-grams""" |
| | ngrams = [] |
| | for i in range(len(trace) - n + 1): |
| | ngram = trace[i:i+n] |
| | ngrams.append(" ".join(map(str, ngram))) |
| | return ngrams |
| | ``` |
| |
|
| | ### Limitations and Considerations |
| |
|
| | 1. **Domain Specific**: This model is trained specifically on ADFA-LD dataset and may not generalize well to other system call datasets without retraining. |
| |
|
| | 2. **Input Format**: The model expects 18-gram sequences. Raw system calls must be preprocessed accordingly. |
| |
|
| | 3. **Binary Classification**: The model only distinguishes between "Normal" and "Attack" classes. It does not classify specific attack types. |
| |
|
| | ### BibTeX entry and citation info |
| |
|
| | ```bibtex |
| | @misc{distilbert-hids-adfa, |
| | title={DistilBERT for Host-based Intrusion Detection on ADFA-LD Dataset}, |
| | author={salsazufar}, |
| | year={2025}, |
| | publisher={Hugging Face}, |
| | howpublished={\url{https://huggingface.co/salsazufar/distilbert-base-hids-adfa}} |
| | } |
| | ``` |
| |
|
| | ## References |
| |
|
| | - ADFA-LD Dataset: [ADFA-LD: An Anomaly Detection Dataset for Linux-based Host Intrusion Detection Systems](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-LD-Dataset/) |
| | - DistilBERT: [DistilBERT, a distilled version of BERT](https://arxiv.org/abs/1910.01108) |
| |
|
| | ## License |
| |
|
| | This model is licensed under the Apache 2.0 license. |