Upload folder using huggingface_hub

e73a2d5 verified 3 months ago

4.78 kB

	---
	license: apache-2.0
	tags:
	- intrusion-detection
	- host-based-ids
	- adfa-ld
	- distilbert
	- sequence-classification
	- security
	- cybersecurity
	- binary-classification
	datasets:
	- ADFA-LD
	model-index:
	- name: distilbert-base-uncased-hids-adfa
	results:
	- task:
	type: text-classification
	name: Host-based Intrusion Detection
	dataset:
	name: ADFA-LD
	type: custom
	metrics:
	- type: accuracy
	value: 0.9403
	- type: f1
	value: 0.9450
	- type: precision
	value: 0.9245
	- type: recall
	value: 0.9664
	- type: auc
	value: 0.9630
	---

	# DistilBERT for Host-based Intrusion Detection System (HIDS)

	This model is a fine-tuned DistilBERT model for binary classification of system call sequences to detect intrusions in the ADFA-LD dataset. The model was trained through hyperparameter tuning to achieve optimal performance for host-based intrusion detection.

	## Model Details

	### Base Model
	- Architecture: DistilBERT (DistilBertForSequenceClassification)
	- Base Model: `distilbert-base-uncased`
	- Task: Binary Sequence Classification (Normal vs Attack)
	- Number of Labels: 2

	### Training Configuration
	- Training Epochs: 8
	- Batch Size: 32
	- Learning Rate: 2e-05
	- Weight Decay: 0.0
	- Warmup Ratio: 0.1
	- Optimizer: AdamW
	- Scheduler: LinearLR

	### Dataset
	- Dataset: ADFA-LD (Australian Defence Force Academy Linux Dataset)
	- Preprocessing: 18-gram sequences


	## Performance

	### Validation Metrics
	- Accuracy: 94.03%
	- F1 Score: 94.50%
	- Precision: 92.45%
	- Recall: 96.64%
	- AUC-ROC: 96.30%

	## Usage

	You can use this model directly with a pipeline for text classification:

	```python
	>>> from transformers import pipeline

	>>> classifier = pipeline('text-classification', model='salsazufar/distilbert-base-hids-adfa')
	>>> classifier("1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18")

	[{'label': 'LABEL_0',
	'score': 0.9876},
	{'label': 'LABEL_1',
	'score': 0.0124}]
	```

	Here is how to use this model to get the classification of a given text in PyTorch:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained('salsazufar/distilbert-base-hids-adfa')
	model = AutoModelForSequenceClassification.from_pretrained('salsazufar/distilbert-base-hids-adfa')

	# Prepare input (18-gram system call sequence)
	text = "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18"
	encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=20)

	# Forward pass
	with torch.no_grad():
	output = model(**encoded_input)
	logits = output.logits
	probabilities = torch.softmax(logits, dim=-1)
	predicted_class = torch.argmax(logits, dim=-1).item()

	# Interpret results
	class_names = ["Normal", "Attack"]
	print(f"Predicted class: {class_names[predicted_class]}")
	print(f"Confidence: {probabilities[0][predicted_class].item():.4f}")
	print(f"Probabilities: Normal={probabilities[0][0].item():.4f}, Attack={probabilities[0][1].item():.4f}")
	```

	### Data Preprocessing

	This model expects input in 18-gram format. If you have raw system call traces, you need to:

	1. Extract system calls from trace files
	2. Convert to n-grams (n=18)
	3. Format as space-separated string
	4. Ensure sequences are exactly 18 tokens (pad or truncate if necessary)

	Example preprocessing pipeline:

	```python
	def create_ngrams(trace, n=18):
	"""Convert system call trace to n-grams"""
	ngrams = []
	for i in range(len(trace) - n + 1):
	ngram = trace[i:i+n]
	ngrams.append(" ".join(map(str, ngram)))
	return ngrams
	```

	### Limitations and Considerations

	1. Domain Specific: This model is trained specifically on ADFA-LD dataset and may not generalize well to other system call datasets without retraining.

	2. Input Format: The model expects 18-gram sequences. Raw system calls must be preprocessed accordingly.

	3. Binary Classification: The model only distinguishes between "Normal" and "Attack" classes. It does not classify specific attack types.

	### BibTeX entry and citation info

	```bibtex
	@misc{distilbert-hids-adfa,
	title={DistilBERT for Host-based Intrusion Detection on ADFA-LD Dataset},
	author={salsazufar},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/salsazufar/distilbert-base-hids-adfa}}
	}
	```

	## References

	- ADFA-LD Dataset: [ADFA-LD: An Anomaly Detection Dataset for Linux-based Host Intrusion Detection Systems](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-LD-Dataset/)
	- DistilBERT: [DistilBERT, a distilled version of BERT](https://arxiv.org/abs/1910.01108)

	## License

	This model is licensed under the Apache 2.0 license.