--- license: apache-2.0 datasets: - custom language: - en metrics: - f1 - accuracy - precision - recall base_model: distilbert-base-uncased pipeline_tag: text-classification library_name: transformers tags: - cybersecurity - vulnerability-detection - text-classification - dursgo --- # durs-llm-web-scanner

durs-llm-web-scanner logo

## Model Description `durs-llm-web-scanner` is a language model (LLM) that has been fine-tuned to **classify various types of cybersecurity-related inputs**. This model is trained to recognize and differentiate between: - **Injection Payloads**: Such as XSS, SQLi, LFI, SSRF, etc. - **Contextual Data**: Such as vulnerable parameter names (e.g., `user_id` for IDOR) or error patterns (e.g., SQL error messages). - **Scanner Logic**: Textual descriptions of the workflow and decision-making processes of a security scanner. - **Crawler Logic**: Descriptions of how to discover new endpoints, forms, and parameters. The primary goal of this model is to act as the "brain" for an autonomous security scanning agent, enabling it to understand context and make strategic decisions. **Project Status: Beta** This model is still in the early stages of development (beta). Its dataset will be continuously updated and enriched periodically to improve accuracy and detection coverage. ## How to Use This model is designed to be used with the `transformers` library in Python. ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Replace 'kangali/durs-llm-web-scanner' with your repo name if different model_name = "kangali/durs-llm-web-scanner" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example inputs text_inputs = [ "", # XSS Payload "user_id", # IDOR Context "A probe string is reflected inside an HTML tag...", # Scanner Logic "This is a normal comment." # Benign ] # Prediction inputs = tokenizer(text_inputs, return_tensors="pt", padding=True, truncation=True) with torch.no_grad(): logits = model(**inputs).logits predicted_class_ids = torch.argmax(logits, dim=1) # To get the string labels, you need the label_encoder.joblib from the repository # or you can map them manually from the model's config.json for i, text in enumerate(text_inputs): predicted_id = predicted_class_ids[i].item() label = model.config.id2label[predicted_id] print(f"Input: '{text[:50]}...' -> Predicted Label: {label}") ``` ## Training Data This model was trained on a custom-built `master_training_dataset.csv` dataset, which contains **over 2700 samples** extracted and synthesized from the codebase of [Dursgo](https://github.com/roomkangali/dursgo), an open-source web security scanner. The dataset includes three main categories of data: 1. **Injection Payloads**: Concrete examples of attack payloads (XSS, SQLi, LFI, etc scanner in dursgo.). 2. **Contextual Definitions**: Keywords, parameter names, and error patterns that provide context for attacks (e.g., IDOR parameter names, SQL error messages). 3. **Scanner & Crawler Logic**: Textual descriptions of the workflows and decision rules used by the scanner and crawler (e.g., "If the 'url' parameter is found, test for SSRF"). ## Training Procedure This model is a `distilbert-base-uncased` that has been fine-tuned for 50 epochs using the `Trainer` from the Hugging Face Transformers library. The complete workflow for creating the dataset and retraining this model is available in the project's GitHub repository: [Tunning-AI](https://github.com/roomkangali/Tunning-AI) (Repo Private - To Be Continue to Open).