Kodeseer-9B
A LoRA fine-tuned Qwen3.5-9B model for GUI element grounding — predicting the (x, y) coordinates of UI elements from screenshots given natural language instructions.
Results
| Benchmark |
Score |
Rank |
| ScreenSpot-V2 |
94.7% |
#7 overall |
| ScreenSpot-Pro |
65.0% |
#9 overall |
| ScreenSpot Original |
92.1% |
#1 overall |
ScreenSpot-V2 Breakdown
| Split |
Accuracy |
| Mobile |
95.2% |
| Desktop |
94.6% |
| Web |
92.9% |
| Overall |
94.7% |
ScreenSpot-Pro Full Breakdown (1581 samples)
| Category |
Accuracy |
|
Category |
Accuracy |
| eviews |
90.0% |
|
word |
88.1% |
| powerpoint |
82.9% |
|
unreal_engine |
80.0% |
| vmware |
78.0% |
|
matlab |
77.4% |
| davinci |
75.0% |
|
solidworks |
72.7% |
| linux_common |
70.0% |
|
photoshop |
68.6% |
| android_studio |
66.2% |
|
pycharm |
66.7% |
| quartus |
64.4% |
|
inventor |
64.3% |
| vivado |
63.7% |
|
vscode |
61.8% |
| blender |
60.6% |
|
windows_common |
59.3% |
| illustrator |
58.1% |
|
macos_common |
53.8% |
| excel |
51.6% |
|
premiere |
48.1% |
| stata |
46.9% |
|
autocad |
41.2% |
| fruitloops |
40.4% |
|
origin |
38.7% |
| Overall |
65.0% |
|
|
|
Comparison with State-of-the-Art
ScreenSpot-V2
| Rank |
Model |
Size |
Score |
| 1 |
MAI-UI |
32B |
96.5% |
| 2 |
OmegaUse |
30B-A3B MoE |
96.3% |
| 3 |
UI-Venus-1.5 |
30B-A3B MoE |
96.2% |
| 4 |
UI-Venus-1.5 |
8B |
95.9% |
| 5 |
UI-Venus-1.0 |
72B |
95.3% |
| 6 |
MAI-UI / GTA1 |
8B / 32B |
95.2% |
| 7 |
Kodeseer-9B |
9B |
94.7% |
| 8 |
UI-TARS 1.5 |
7B |
94.2% |
| 9 |
UI-Venus-1.0 |
7B |
94.1% |
| 10 |
Step-GUI |
4B |
93.6% |
ScreenSpot-Pro
| Rank |
Model |
Size |
Score |
| 1 |
Holo2 (3-step) |
235B-A22B MoE |
78.5% |
| 2 |
MAI-UI + zoom-in |
32B |
73.5% |
| 3 |
Holo2 (1-step) |
235B-A22B MoE |
70.6% |
| 4 |
UI-Venus-1.5 |
30B-A3B MoE |
69.6% |
| 5 |
UI-Venus-1.5 |
8B |
68.4% |
| 6 |
MAI-UI |
32B |
67.9% |
| 7 |
Holo2 |
30B-A3B MoE |
66.1% |
| 8 |
MAI-UI |
8B |
65.8% |
| 9 |
Kodeseer-9B |
9B |
65.0% |
| 10 |
Qwen3-VL + MVP |
8B |
65.3%* |
| 11 |
GTA1 |
32B |
63.6% |
| 12 |
UI-TARS 1.5 |
7B |
61.6% |
*MVP is a training-free inference trick
ScreenSpot Original
| Rank |
Model |
Size |
Score |
| 1 |
Kodeseer-9B |
9B |
92.1% |
| 2 |
GUI-G2 |
7B |
92.0% |
| 3 |
GUI-Actor-7B + Verifier |
7B |
89.7% |
| 4 |
UI-TARS-7B |
7B |
89.5% |
| 5 |
UGround-V1 |
72B |
89.4% |
Usage
import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
from PIL import Image
base_model = "Qwen/Qwen3.5-9B"
adapter = "mdabis/qwen35-9b-gui-grounding-v1"
model = Qwen3_5ForConditionalGeneration.from_pretrained(
base_model, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoint-3100")
model.eval()
image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the submit button"
messages = [
{"role": "system", "content": (
"You are a GUI grounding assistant. Given a screenshot and a user instruction, "
"return the exact coordinates of the target UI element using the format: "
"<|box_start|>(x,y)<|box_end|> where x and y are in [0, 1000] range."
)},
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": instruction},
]},
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)
generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=False)[0]
print(response)
Coordinate Format
The model predicts click coordinates in <|box_start|>(x,y)<|box_end|> format where x and y are in [0, 1000] range. To convert to pixel coordinates:
pixel_x = int(x / 1000 * image_width)
pixel_y = int(y / 1000 * image_height)
Training Details
- Base model: Qwen/Qwen3.5-9B (9.65B parameters)
- Method: LoRA (rank 32, alpha 64, all-linear targets)
- Frozen: ViT + aligner (only LLM LoRA trained)
- MAX_PIXELS: 3,014,656 (3M — critical for ScreenSpot-Pro's tiny targets)
- Epochs: 3
- Learning rate: 5e-5, cosine scheduler, 5% warmup
- Effective batch size: 24 (1 per device × 3 grad_accum × 8 GPUs)
- Hardware: 8x NVIDIA A40 (48GB each)
- Training time: ~4.5 hours
- Best checkpoint: step 3100 (selected by eval_loss)
- dtype: bfloat16
- Framework: ms-swift 4.0.2, transformers 5.2.0
Training Data (~26K samples)
| Source |
Samples |
Description |
| ShowUI-desktop |
7,496 |
General desktop UI screenshots |
| UGround-V1-8k (filtered) |
~6,920 |
Web UI, quality filtered (removed <3 word instructions, duplicates, OOB points) |
| AMEX-8k |
8,000 |
Mobile UI (e-commerce/financial) |
| Hcompany/WebClick |
1,639 |
Web interaction data |
| Paraphrased instructions |
2,000 |
Augmented 1-4 word instructions into 7-12 word natural language |
| Total |
~26,055 |
|
Data Filtering (UGround)
Original UGround-V1-8k (~8K samples) was filtered to ~6.9K:
- Removed instructions with fewer than 3 words (too vague)
- Removed duplicate (image, instruction) pairs
- Removed out-of-bounds coordinate points
- Removed corrupted/missing images
Training Curve
- Eval loss decreased steadily from 0.38 (step 100) to ~0.229 (step 2100), then plateaued
- Token accuracy reached 92%+ on validation set
- No significant overfitting observed with 3 epochs (unlike 4B v4 which overfit at epoch 3+)
- VRAM usage: ~31 GB per GPU (of 48 GB available)
Key Design Decisions
- 3M pixels (MAX_PIXELS=3,014,656): Critical for ScreenSpot-Pro where average UI target is only 0.07% of screen area on 2560x1440+ screenshots
- LoRA rank 32 (vs 16 on 4B): Bigger model benefits from more trainable parameters
- LR 5e-5 (vs 1e-4 on 4B): Lower learning rate for larger model stability
- 3 epochs (vs 4 on 4B): Avoided overfitting observed in 4B v4 training
- Frozen ViT + aligner: Only LLM layers trained via LoRA — preserves visual encoder quality
Limitations
- Trained on English instructions only
- Weakest on niche professional software (AutoCAD 41.2%, FruitLoops 40.4%, Origin 38.7%)
- SFT-only — no RL/GRPO applied yet (further gains expected)
- No training data from professional software domains (all training data is general desktop/mobile/web)
License
Apache 2.0 (same as base model)