---
license: apache-2.0
tags:
- intrusion-detection
- host-based-ids
- adfa-ld
- distilbert
- sequence-classification
- security
- cybersecurity
- binary-classification
datasets:
- ADFA-LD
model-index:
- name: distilbert-base-uncased-hids-adfa
  results:
  - task:
      type: text-classification
      name: Host-based Intrusion Detection
    dataset:
      name: ADFA-LD
      type: custom
    metrics:
    - type: accuracy
      value: 0.9403
    - type: f1
      value: 0.9450
    - type: precision
      value: 0.9245
    - type: recall
      value: 0.9664
    - type: auc
      value: 0.9630
---

# DistilBERT for Host-based Intrusion Detection System (HIDS)

This model is a fine-tuned DistilBERT model for binary classification of system call sequences to detect intrusions in the ADFA-LD dataset. The model was trained through hyperparameter tuning to achieve optimal performance for host-based intrusion detection.

## Model Details

### Base Model
- **Architecture**: DistilBERT (DistilBertForSequenceClassification)
- **Base Model**: `distilbert-base-uncased`
- **Task**: Binary Sequence Classification (Normal vs Attack)
- **Number of Labels**: 2

### Training Configuration
- **Training Epochs**: 8
- **Batch Size**: 32
- **Learning Rate**: 2e-05
- **Weight Decay**: 0.0
- **Warmup Ratio**: 0.1
- **Optimizer**: AdamW
- **Scheduler**: LinearLR

### Dataset
- **Dataset**: ADFA-LD (Australian Defence Force Academy Linux Dataset)
- **Preprocessing**: 18-gram sequences


## Performance

### Validation Metrics 
- **Accuracy**: 94.03%
- **F1 Score**: 94.50%
- **Precision**: 92.45%
- **Recall**: 96.64%
- **AUC-ROC**: 96.30%

## Usage

You can use this model directly with a pipeline for text classification:

```python
>>> from transformers import pipeline

>>> classifier = pipeline('text-classification', model='salsazufar/distilbert-base-hids-adfa')
>>> classifier("1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18")

[{'label': 'LABEL_0',
  'score': 0.9876},
 {'label': 'LABEL_1',
  'score': 0.0124}]
```

Here is how to use this model to get the classification of a given text in PyTorch:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained('salsazufar/distilbert-base-hids-adfa')
model = AutoModelForSequenceClassification.from_pretrained('salsazufar/distilbert-base-hids-adfa')

# Prepare input (18-gram system call sequence)
text = "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18"
encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=20)

# Forward pass
with torch.no_grad():
    output = model(**encoded_input)
    logits = output.logits
    probabilities = torch.softmax(logits, dim=-1)
    predicted_class = torch.argmax(logits, dim=-1).item()

# Interpret results
class_names = ["Normal", "Attack"]
print(f"Predicted class: {class_names[predicted_class]}")
print(f"Confidence: {probabilities[0][predicted_class].item():.4f}")
print(f"Probabilities: Normal={probabilities[0][0].item():.4f}, Attack={probabilities[0][1].item():.4f}")
```

### Data Preprocessing

This model expects input in 18-gram format. If you have raw system call traces, you need to:

1. Extract system calls from trace files
2. Convert to n-grams (n=18)
3. Format as space-separated string
4. Ensure sequences are exactly 18 tokens (pad or truncate if necessary)

Example preprocessing pipeline:

```python
def create_ngrams(trace, n=18):
    """Convert system call trace to n-grams"""
    ngrams = []
    for i in range(len(trace) - n + 1):
        ngram = trace[i:i+n]
        ngrams.append(" ".join(map(str, ngram)))
    return ngrams
```

### Limitations and Considerations

1. **Domain Specific**: This model is trained specifically on ADFA-LD dataset and may not generalize well to other system call datasets without retraining.

2. **Input Format**: The model expects 18-gram sequences. Raw system calls must be preprocessed accordingly.

3. **Binary Classification**: The model only distinguishes between "Normal" and "Attack" classes. It does not classify specific attack types.

### BibTeX entry and citation info

```bibtex
@misc{distilbert-hids-adfa,
  title={DistilBERT for Host-based Intrusion Detection on ADFA-LD Dataset},
  author={salsazufar},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/salsazufar/distilbert-base-hids-adfa}}
}
```

## References

- ADFA-LD Dataset: [ADFA-LD: An Anomaly Detection Dataset for Linux-based Host Intrusion Detection Systems](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-LD-Dataset/)
- DistilBERT: [DistilBERT, a distilled version of BERT](https://arxiv.org/abs/1910.01108)

## License

This model is licensed under the Apache 2.0 license.