HORAMA-BTP V2
Enhanced Vision-Language Model for Construction Site Analysis & Safety Inspection
Image → Structured JSON | Safety-Enhanced | HPO-Optimized | Built on Qwen2.5-VL
Horama-BTP V2 builds on V1's structured analysis capabilities with significantly enhanced safety inspection -- better PPE detection, improved hazard recognition, and more reliable risk assessment, trained on 10,000+ construction site images with hyperparameter-optimized LoRA.
What's New in V2
| Aspect | V1 | V2 |
|---|---|---|
| Safety detection | Baseline PPE and hazard detection | Enhanced PPE compliance, multi-hazard recognition |
| Training scale | Initial domain adaptation | 10,000+ construction site images |
| LoRA capacity | r=32 (lightweight) | r=128 (HPO-optimized, 4x capacity) |
| Hyperparameters | Manual tuning | Bayesian optimization (Optuna) |
| Focus | Structured JSON output learning | Safety inspection depth + structured output |
Overview
Horama-BTP V2 is the safety-enhanced evolution of Horama-BTP. Starting from V1's structured JSON output capabilities, V2 was fine-tuned on a large-scale construction safety dataset (10,000+ images) with hyperparameter-optimized LoRA to dramatically improve:
- PPE detection accuracy -- helmets, vests, harnesses, boots, goggles
- Hazard recognition -- fall risks, open trenches, unstable loads, electrical hazards
- Risk level assessment -- more calibrated overall risk scoring
- Safety control measures -- guardrails, barriers, signage, netting identification
The model retains V1's full 15-dimension analysis pipeline (progress, quality, logistics, environment) while excelling at safety compliance -- making it ideal for automated site safety audits.
Key Capabilities
| Dimension | What the model extracts |
|---|---|
| Safety | PPE compliance per worker (8 equipment types), hazard identification (9 types), control measures, overall risk level |
| Progress | Construction stage (earthworks → commissioning), estimated % completion, milestones |
| Quality | Structural defects (cracks, corrosion, misalignment...), non-conformities |
| Observations | Objects, materials, equipment, personnel, vehicles with attributes and confidence |
| Logistics | Materials inventory, equipment status (idle/operating), access constraints |
| Environment | Dust, waste, spills; waste management assessment |
| Evidence | Traceable evidence entries with unique IDs linking every finding to visual proof |
Architecture
┌─────────────────────────────────────────┐
│ HORAMA-BTP V2 │
│ │
Input Image ───┐ │ Qwen2.5-VL-3B V1 LoRA V2 LoRA │
├──►│ (backbone) ──► (merged) ──► (merged) │──► Structured JSON
System Prompt ─┘ │ r=32 r=128 │
│ domain safety │
│ adaptation enhanced │
└─────────────────────────────────────────┘
V2 is a two-stage fine-tuned model:
- Stage 1 (V1): LoRA fine-tuning (r=32) on domain-specific annotations to learn the Horama-BTP JSON schema and construction vocabulary
- Stage 2 (V2): LoRA fine-tuning (r=128, HPO-optimized) on 10,000+ safety-focused construction images to deepen detection capabilities
Both LoRA adapters are merged into the backbone -- V2 is a standalone model with no runtime adapter dependencies.
| Component | Details |
|---|---|
| Backbone | Qwen2.5-VL-3B-Instruct -- 3B parameter multimodal transformer |
| Stage 1 adaptation | LoRA r=32, alpha=64, targeting all attention + MLP projections |
| Stage 2 adaptation | LoRA r=128, alpha=256, HPO-optimized on 10k+ safety images |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Precision | BF16 (GPU) / FP32 (CPU/MPS) |
| Output | Deterministic JSON (temperature=0, greedy decoding) |
Quick Start
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
# Load model and processor
model_id = "Horama/Horama_BTP_v2"
model = AutoModelForVision2Seq.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Load image
image = Image.open("construction_site.jpg").convert("RGB")
# System prompt
system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after.
CRITICAL RULES:
1. ONLY describe what you can CLEARLY SEE in the image
2. If you cannot see something -> use empty array [] or "unknown"
3. Output must follow the Horama-BTP v1 JSON schema exactly"""
user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output."
# Prepare messages
messages = [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": user_prompt},
],
},
]
# Generate
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
result = processor.decode(output[0], skip_special_tokens=True)
# Extract JSON from response
import json
json_start = result.rfind("{")
json_end = result.rfind("}") + 1
analysis = json.loads(result[json_start:json_end])
print(json.dumps(analysis, indent=2))
Output Schema
Identical to V1 -- the model outputs a single JSON object with 15 required top-level fields:
{
"job_type": "construction" | "renovation" | "infrastructure" | "unknown",
"asset_type": "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown",
"scene_context": { location_hint, weather_light, viewpoint },
"summary": { one_liner, confidence },
"progress": { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected },
"work_activities": [{ activity, status, confidence, evidence_ids }],
"observations": [{ type, label, attributes, confidence, evidence_ids }],
"safety": { overall_risk_level, ppe[], hazards[], control_measures[] },
"quality": { issues[], non_conformities[] },
"logistics": { materials_on_site[], equipment_on_site[], access_constraints[] },
"environment": { impacts[], waste_management },
"evidence": [{ evidence_id, source, bbox_xyxy, description }],
"unknown": [{ question, why_unknown, needed_input }],
"domain_fields": { custom_kpis, lot_breakdown, client_specific },
"metadata": { model, version, generated_at }
}
Safety-Specific Schema Detail
V2 particularly excels at populating the safety section:
{
"safety": {
"overall_risk_level": "low | medium | high | unknown",
"ppe": [
{
"role": "worker | visitor | unknown",
"ppe_item": "helmet | vest | gloves | goggles | harness | boots | mask | other",
"status": "compliant | non_compliant | unknown",
"confidence": 0.0,
"evidence_ids": ["ev_XXX"]
}
],
"hazards": [
{
"hazard_type": "fall_risk | open_trench | moving_vehicle | electrical | fire | unstable_load | poor_housekeeping | restricted_area | other",
"severity": "low | medium | high | unknown",
"confidence": 0.0,
"evidence_ids": ["ev_XXX"]
}
],
"control_measures": [
{
"measure": "guardrail | barrier | signage | netting | cones | spotter | lockout_tagout | other",
"status": "present | missing | unknown",
"confidence": 0.0,
"evidence_ids": ["ev_XXX"]
}
]
}
}
Example Output
Given a photograph of an active construction site with workers:
{
"job_type": "construction",
"asset_type": "building",
"scene_context": {
"location_hint": "outdoor",
"weather_light": "day",
"viewpoint": "ground"
},
"summary": {
"one_liner": "Active multi-story building construction site with scaffolding, multiple workers performing structural work with mixed PPE compliance.",
"confidence": 0.90
},
"progress": {
"overall_stage": "structure",
"stage_confidence": 0.88,
"progress_percent_estimate": 45,
"progress_confidence": 0.40,
"milestones_detected": [
{ "name": "Foundation complete", "status": "done", "confidence": 0.85, "evidence_ids": ["ev_001"] },
{ "name": "Structural framing in progress", "status": "in_progress", "confidence": 0.90, "evidence_ids": ["ev_002"] }
]
},
"safety": {
"overall_risk_level": "high",
"ppe": [
{ "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.92, "evidence_ids": ["ev_003"] },
{ "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.90, "evidence_ids": ["ev_003"] },
{ "role": "worker", "ppe_item": "harness", "status": "non_compliant", "confidence": 0.75, "evidence_ids": ["ev_004"] },
{ "role": "worker", "ppe_item": "boots", "status": "compliant", "confidence": 0.80, "evidence_ids": ["ev_003"] }
],
"hazards": [
{ "hazard_type": "fall_risk", "severity": "high", "confidence": 0.88, "evidence_ids": ["ev_005"] },
{ "hazard_type": "unstable_load", "severity": "medium", "confidence": 0.65, "evidence_ids": ["ev_006"] }
],
"control_measures": [
{ "measure": "guardrail", "status": "present", "confidence": 0.85, "evidence_ids": ["ev_007"] },
{ "measure": "netting", "status": "missing", "confidence": 0.70, "evidence_ids": ["ev_005"] }
]
},
"evidence": [
{ "evidence_id": "ev_001", "source": "image", "description": "Completed concrete foundation visible at ground level" },
{ "evidence_id": "ev_003", "source": "image", "description": "Workers wearing hard hats, high-vis vests, and safety boots on scaffolding" },
{ "evidence_id": "ev_004", "source": "image", "description": "Worker at height without visible safety harness attachment" },
{ "evidence_id": "ev_005", "source": "image", "description": "Open edges on upper floors without safety netting" }
]
}
(Truncated for readability -- full output includes all 15 top-level fields)
Training Details
Stage 2 (V2 Safety Enhancement)
| Parameter | Value |
|---|---|
| Base model | Horama/Horama_BTP (V1) |
| Method | LoRA (Parameter-Efficient Fine-Tuning) |
| Training images | 10,000+ construction site photographs |
| Focus | Safety inspection, PPE detection, hazard recognition |
| LoRA rank | r=128 (HPO-optimized, 4x V1 capacity) |
| LoRA alpha | 256 (2x rank) |
| LoRA dropout | 0.05 (HPO-optimized) |
| Epochs | 3 |
| Effective batch size | 8 (batch=2, accumulation=4) |
| Learning rate | 2.52e-4 (HPO-optimized, cosine schedule) |
| Warmup | 3% of training steps |
| Weight decay | 0.0 (HPO-optimized) |
| Gradient checkpointing | Enabled |
| Framework | Transformers + PEFT |
| Hyperparameter search | Bayesian optimization via Optuna |
Hyperparameter Optimization
V2 hyperparameters were selected through Bayesian optimization (Optuna) searching over:
| Hyperparameter | Search space | Optimal value |
|---|---|---|
| Learning rate | [1e-5, 5e-4] | 2.52e-4 |
| LoRA rank | {16, 32, 64, 128} | 128 |
| LoRA dropout | {0.0, 0.1, 0.2} | 0.05 |
| Weight decay | {0.0, 0.01, 0.05, 0.1} | 0.0 |
| Gradient accumulation | {4, 8, 16} | 4 |
V1 vs V2: When to Use Which
| Use case | Recommended |
|---|---|
| Safety audits & PPE compliance | V2 -- significantly better safety detection |
| Hazard identification | V2 -- trained on diverse hazard scenarios |
| General progress tracking | Both work well; V1 is lighter |
| Quality defect detection | Both comparable |
| Resource-constrained deployment | V1 (identical architecture but lighter training) |
| Safety-critical applications | V2 -- deeper safety understanding |
Intended Uses
Primary use cases:
- Automated safety compliance auditing from site photographs
- PPE verification across construction teams
- Hazard detection and risk level assessment
- Construction progress reporting
- Quality control and defect identification
- Environmental impact documentation
Input requirements:
- Single construction site image (JPEG, PNG, WebP, BMP)
- Supports ground-level, drone, and fixed-camera viewpoints
- Works best with daylight, well-lit images
Limitations
- Single-image analysis: Analyzes one image at a time; no temporal comparison between images
- Visible elements only: Cannot detect hidden structural issues or elements behind walls/coverings
- No sensory data: Cannot measure noise levels, dust concentration, or air quality from static images
- PPE at distance: Small or distant workers may have lower PPE detection confidence
- Schema-bound: Output follows the Horama-BTP v1 schema strictly -- custom fields use the
domain_fieldsextension - Not a replacement for human inspectors: The model assists and augments human safety inspections but should not be the sole decision-maker for safety-critical assessments
Hardware Requirements
| Setup | VRAM / RAM | Precision | Notes |
|---|---|---|---|
| NVIDIA GPU | ~8 GB VRAM | BF16 | Recommended for production |
| Apple Silicon | ~8 GB RAM | FP32 | Supported via MPS backend |
| CPU | ~12 GB RAM | FP32 | Functional but slower |
License
AGPL-3.0 -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license.
For commercial or closed-source usage, please contact Horama for a commercial license.
Citation
@misc{horama-btp-v2-2025,
title = {Horama-BTP V2: Safety-Enhanced Vision-Language Model for Construction Site Analysis},
author = {Horama},
year = {2025},
url = {https://huggingface.co/Horama/Horama_BTP_v2},
note = {Two-stage LoRA fine-tuning from Qwen2.5-VL-3B-Instruct with HPO-optimized safety training}
}
Built by Horama | Construction intelligence, powered by vision AI
- Downloads last month
- 19