OpenSOC: Self-Play SOC Triage Environment
An OpenEnv environment for training cybersecurity defender LLMs against an attacker LLM that auto-generates novel incidents. Built for the OpenEnv Hackathon, April 2026.
Humans cannot watch every alert in a Security Operations Center 24/7, and as stronger generative models start writing exploits and phishing at industrial scale that gap only widens. OpenSOC is an environment where a defender LLM learns to triage attacks generated by another LLM in a self-play loop. The trick is RLVR: triage ground truth is computed by a deterministic schema-side verifier from the structured incident parameters β never from any text the attacker writes β so neither side can hack the reward.
Try it
| Link | What it is |
|---|---|
HF Space β shivam2k3-opensoc-env.hf.space |
Deployed env (Running). OpenEnv judge can hit /reset /step /state /grade. |
Live /demo β shivam2k3-opensoc-env.hf.space/demo |
Gradio "before vs after" UI. Click Next incident to compare baseline vs trained. |
Trained model β shivam2k3/opensoc-defender-grpo |
GRPO-trained Qwen2.5-3B-Instruct LoRA defender adapter. |
Training notebook β train_grpo.ipynb |
End-to-end SFT warm-start + GRPO curriculum using Unsloth + TRL. |
Mini-blog β docs/blog.md |
~600-word write-up of the project. |
Table of contents
- Architecture
- Why the reward cannot be hacked
- Action space and reward
- Run locally
- Run the training pipeline
- Headline results
- Deploy to Hugging Face Spaces
- Repo map
- Submission deliverables
Build status
| Build artifact | Status |
|---|---|
Pure-python env (OpenSOCEnv, FastAPI) |
β shipped |
| Verifier + plausibility checker | β shipped, 17-test adversarial suite |
| Rubric (defender + attacker rewards) | β shipped, anti-hack regression tests |
600-example SFT dataset (data/sft_train.jsonl) |
β shipped |
200-incident frozen hold-out (data/holdout.jsonl) |
β shipped |
| SFT warm-start adapter | β
trained β opensoc-defender-grpo-sft |
| GRPO curriculum (4 stages) | β trained β adapters for each stage on HF |
| Final GRPO adapter | β
shivam2k3/opensoc-defender-grpo |
GRPO training notebook (train_grpo.ipynb) |
β shipped (ran on HF Jupyter with Unsloth + TRL) |
| Gradio "before vs after" UI | β
live at /demo |
Eval harness + plotters (eval/) |
β shipped |
| Pytest suite | β 93 tests, all green |
| HF Space | β
live at shivam2k3/opensoc-env |
Architecture
flowchart LR
Defender[Defender LLM trainee]
Attacker[Attacker LLM trainee]
Env[OpenSOC FastAPI Environment]
Verifier[Deterministic verifier + plausibility check]
Defender -->|submit_triage| Env
Attacker -->|craft_incident| Env
Env -->|observation reward| Defender
Env -->|attacker reward| Attacker
Env --> Verifier
Verifier -->|ground truth label| Env
An episode has exactly two turns: attacker proposes incident params β env validates them and materializes a SIEM-style alert + log window β defender submits a triage action. The verifier computes the ground-truth action from the events alone and scores both sides β the attacker's free-text narrative is never read by the labeler.
In defender_only mode (used for SFT, eval, smoke tests, and the /demo UI) the env auto-generates the incident from tasks/registry.py and skips straight to the defender turn.
Why the reward cannot be hacked
- The verifier is a transparent rule set in
verifier.compute_ground_truth(params); the only inputs are the structured events. The attacker'snarrativeand even its self-claimedtarget_labelare ignored. - The plausibility checker (
verifier.check_plausibility(params)) refuses incoherent stories β for example, a "data exfiltration" claim with a purely-internal destination, or alolbin_useevent with noprocessfield. The attacker's reward is gated on plausibility passing. - Schema-violation incidents floor attacker reward at -0.5, so trying to short-circuit pydantic's validators is strictly worse than playing along.
The anti-hack invariants are pinned in tests/test_verifier.py and tests/test_rubric.py.
Action space and reward
Tool names are deliberately non-reserved β there is no reset/step/state/close clash with the OpenEnv MCPEnvironment reserved-name list.
action_space:
craft_incident:
target_label: dismiss | monitor | quarantine_host | block_ip | escalate
category: malware_execution | c2_beacon | data_exfiltration | ...
events: [ { event_type, fields, timestamp, log_id }, ... ]
narrative: string # ignored by the verifier
submit_triage:
action: <one of the five triage actions>
cited_log_id: <id of the log line that drove the decision>
rationale: short string
- Defender: +1 correct, β1 missed-malicious, β0.3 over-react on benign, β0.05 unnecessary escalate, +0.1 bonus for citing the right triggering log id, β0.1 floor for format violation.
- Attacker: +1 iff defender wrong AND incident plausible, β0.5 if schema validation fails, +0.2 novelty bonus, 0 for gibberish.
Full breakdown: openenv.yaml and rubric.py.
Run locally
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python server.py # serves on :7860
Smoke test from another shell:
curl -s http://localhost:7860/health | jq .
curl -s -X POST 'http://localhost:7860/reset?task=stage1_basic&mode=defender_only' | jq .
curl -s -X POST 'http://localhost:7860/step?task=stage1_basic&mode=defender_only' \
-H 'content-type: application/json' \
-d '{"submit_triage": {"action": "monitor", "cited_log_id": "L1-0", "rationale": "smoke"}}' | jq .
open http://localhost:7860/demo # Gradio before-vs-after UI
Run the test suite (CPU only, no GPU deps):
pytest -q # 93 passed
Or via the bundled Python client:
from client import OpenSOCClient
c = OpenSOCClient()
obs = c.reset(task="stage1_basic", mode="defender_only", seed=1)
result = c.step({"submit_triage": {"action": "monitor", "cited_log_id": "L1-0", "rationale": "ok"}},
task="stage1_basic", mode="defender_only", seed=1)
print(result)
Run the training pipeline
Full end-to-end procedure: TRAIN.md. TL;DR β on an HF Jupyter L4 (~$3 of credits, ~3.5h wall time):
bash scripts/run_full_pipeline.sh
Or step-by-step inside train_grpo.ipynb:
- SFT warm-start (~12 min) β pushes P(format-OK) from ~0% to ~95%.
- GRPO curriculum across 4 stages (~3h) β verifier-grounded reward, group size 8.
- Eval on the frozen 200-incident hold-out (~5 min).
eval.plot_results+eval.plot_trainingrender four PNGs.eval.bake_demowrites 50 before-vs-after pairs todata/demo_examples.jsonfor the Gradio UI.
Headline results
The defender model was trained using GRPO with a 4-stage curriculum on Qwen2.5-3B-Instruct with LoRA. All trained adapters are published on HuggingFace:
| Stage | Adapter | Difficulty |
|---|---|---|
| SFT warm-start | opensoc-defender-grpo-sft |
Format learning |
| Stage 1 | opensoc-defender-grpo-stage1_basic |
Easy β single-event templates |
| Stage 2 | opensoc-defender-grpo-stage2_multi |
Medium β multi-event windows |
| Stage 3 | opensoc-defender-grpo-stage3_mixed |
Hard β benign decoys interleaved |
| Stage 4 | opensoc-defender-grpo-stage4_adversarial |
Adversarial β attacker-controlled |
| Final | opensoc-defender-grpo |
Combined final adapter |
Dismiss-on-malicious (the cardinal failure mode)
Macro F1 across 200-incident hold-out
Confusion matrices
Reward across the curriculum
| Model | Accuracy | Macro F1 | Dismiss-on-malicious | Over-react |
|---|---|---|---|---|
always_dismiss (floor) |
0.13 | 0.05 | 1.00 | 0.00 |
verifier_oracle (ceiling) |
1.00 | 1.00 | 0.00 | 0.00 |
Deploy to Hugging Face Spaces
Full recipe: DEPLOY.md. The fast version, after huggingface-cli login:
export HF_USER=<your-username>
bash scripts/deploy_to_hf.sh
# Build takes ~5 minutes; then:
open https://${HF_USER}-opensoc-env.hf.space/demo
The Space runs FastAPI + Gradio in a single container. /reset, /step, /state, /grade, /tasks, /health continue to work for the OpenEnv judge bot; /demo is the human-readable UI.
Repo map
| File / dir | Purpose |
|---|---|
openenv.yaml |
OpenEnv manifest (tasks, action space, reward range, endpoints) |
schema.py |
Incident / event / action schema with strict validators |
generator.py |
Materializes incidents for defender_only mode (eval, SFT) |
verifier.py |
Deterministic ground-truth labeler + plausibility checker |
rubric.py |
Layered defender + attacker reward functions |
env.py |
Two-role OpenSOCEnv (reset / step / state / grade) |
app_runtime.py |
FastAPI app exposing the OpenEnv API |
demo_app.py |
Gradio Blocks app mounted at /demo |
demo_data.py |
Pure-python helpers for the demo UI |
server.py |
Container entry point β imports demo_app then starts uvicorn |
tasks/registry.py |
Curriculum stages: stage1_basic β stage4_adversarial |
client/ |
Thin HTTP client (server-internals-free) |
train/ |
SFT warm-start + GRPO loop + reusable prompt format |
eval/ |
Hold-out generator, metrics, eval driver, plot renderers, bake_demo |
scripts/run_full_pipeline.sh |
One-shot training + eval + bake-demo |
scripts/deploy_to_hf.sh |
One-shot HF Space push |
docs/ |
Blog post, video script, slide deck builder |
tests/ |
Pytest suite (93 tests, anti-hack regressions included) |
Submission deliverables
Mapped to the four judging criteria:
| Criterion | Weight | Where it lives |
|---|---|---|
| Environment Innovation | 40% | openenv.yaml, schema.py, verifier.py, env.py, this README's Architecture and Why the reward cannot be hacked sections |
| Storytelling & Presentation | 30% | /demo Gradio UI + 90s video + HF blog |
| Showing Improvement in Rewards | 20% | eval/results/*.png (training curves + confusion + headline bar) embedded above |
| Reward & Training Pipeline | 10% | rubric.py + 93-test anti-hack suite + train_grpo.ipynb + scripts/run_full_pipeline.sh |
Submission checklist:
- OpenEnv-compatible env (gym-style API, manifest, non-reserved tool names)
- Deterministic RLVR verifier + plausibility checker
- Layered defender + attacker reward
- SFT warm-start dataset (committed)
- Frozen 200-incident hold-out (committed)
- GRPO curriculum notebook + one-shot training script
- Eval harness + plotters
- Pytest suite (93 tests, anti-hack regressions included)
- Gradio
/demoUI mounted on the same Space (free-CPU-tier compatible) - Blog post (
docs/blog.md) - HF Space pushed and running:
shivam2k3/opensoc-env - SFT adapter trained and pushed:
opensoc-defender-grpo-sft - GRPO adapters trained and pushed (4 stages):
stage1stage2stage3stage4 - Final adapter pushed:
opensoc-defender-grpo
License
BSD-3-Clause.




