You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

HORAMA-BTP V2

Enhanced Vision-Language Model for Construction Site Analysis & Safety Inspection

Image → Structured JSON | Safety-Enhanced | HPO-Optimized | Built on Qwen2.5-VL

Model License Format Version


Horama-BTP V2 builds on V1's structured analysis capabilities with significantly enhanced safety inspection -- better PPE detection, improved hazard recognition, and more reliable risk assessment, trained on 10,000+ construction site images with hyperparameter-optimized LoRA.

What's New in V2

Aspect V1 V2
Safety detection Baseline PPE and hazard detection Enhanced PPE compliance, multi-hazard recognition
Training scale Initial domain adaptation 10,000+ construction site images
LoRA capacity r=32 (lightweight) r=128 (HPO-optimized, 4x capacity)
Hyperparameters Manual tuning Bayesian optimization (Optuna)
Focus Structured JSON output learning Safety inspection depth + structured output

Overview

Horama-BTP V2 is the safety-enhanced evolution of Horama-BTP. Starting from V1's structured JSON output capabilities, V2 was fine-tuned on a large-scale construction safety dataset (10,000+ images) with hyperparameter-optimized LoRA to dramatically improve:

  • PPE detection accuracy -- helmets, vests, harnesses, boots, goggles
  • Hazard recognition -- fall risks, open trenches, unstable loads, electrical hazards
  • Risk level assessment -- more calibrated overall risk scoring
  • Safety control measures -- guardrails, barriers, signage, netting identification

The model retains V1's full 15-dimension analysis pipeline (progress, quality, logistics, environment) while excelling at safety compliance -- making it ideal for automated site safety audits.

Key Capabilities

Dimension What the model extracts
Safety PPE compliance per worker (8 equipment types), hazard identification (9 types), control measures, overall risk level
Progress Construction stage (earthworks → commissioning), estimated % completion, milestones
Quality Structural defects (cracks, corrosion, misalignment...), non-conformities
Observations Objects, materials, equipment, personnel, vehicles with attributes and confidence
Logistics Materials inventory, equipment status (idle/operating), access constraints
Environment Dust, waste, spills; waste management assessment
Evidence Traceable evidence entries with unique IDs linking every finding to visual proof

Architecture

                    ┌─────────────────────────────────────────┐
                    │           HORAMA-BTP V2                  │
                    │                                          │
Input Image ───┐   │  Qwen2.5-VL-3B    V1 LoRA     V2 LoRA  │
               ├──►│  (backbone)    ──► (merged) ──► (merged) │──► Structured JSON
System Prompt ─┘   │                    r=32         r=128    │
                    │                   domain       safety   │
                    │                   adaptation   enhanced  │
                    └─────────────────────────────────────────┘

V2 is a two-stage fine-tuned model:

  1. Stage 1 (V1): LoRA fine-tuning (r=32) on domain-specific annotations to learn the Horama-BTP JSON schema and construction vocabulary
  2. Stage 2 (V2): LoRA fine-tuning (r=128, HPO-optimized) on 10,000+ safety-focused construction images to deepen detection capabilities

Both LoRA adapters are merged into the backbone -- V2 is a standalone model with no runtime adapter dependencies.

Component Details
Backbone Qwen2.5-VL-3B-Instruct -- 3B parameter multimodal transformer
Stage 1 adaptation LoRA r=32, alpha=64, targeting all attention + MLP projections
Stage 2 adaptation LoRA r=128, alpha=256, HPO-optimized on 10k+ safety images
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Precision BF16 (GPU) / FP32 (CPU/MPS)
Output Deterministic JSON (temperature=0, greedy decoding)

Quick Start

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

# Load model and processor
model_id = "Horama/Horama_BTP_v2"

model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load image
image = Image.open("construction_site.jpg").convert("RGB")

# System prompt
system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after.
CRITICAL RULES:
1. ONLY describe what you can CLEARLY SEE in the image
2. If you cannot see something -> use empty array [] or "unknown"
3. Output must follow the Horama-BTP v1 JSON schema exactly"""

user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output."

# Prepare messages
messages = [
    {"role": "system", "content": system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": user_prompt},
        ],
    },
]

# Generate
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

result = processor.decode(output[0], skip_special_tokens=True)

# Extract JSON from response
import json
json_start = result.rfind("{")
json_end = result.rfind("}") + 1
analysis = json.loads(result[json_start:json_end])

print(json.dumps(analysis, indent=2))

Output Schema

Identical to V1 -- the model outputs a single JSON object with 15 required top-level fields:

{
  "job_type":        "construction" | "renovation" | "infrastructure" | "unknown",
  "asset_type":      "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown",
  "scene_context":   { location_hint, weather_light, viewpoint },
  "summary":         { one_liner, confidence },
  "progress":        { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected },
  "work_activities":  [{ activity, status, confidence, evidence_ids }],
  "observations":    [{ type, label, attributes, confidence, evidence_ids }],
  "safety":          { overall_risk_level, ppe[], hazards[], control_measures[] },
  "quality":         { issues[], non_conformities[] },
  "logistics":       { materials_on_site[], equipment_on_site[], access_constraints[] },
  "environment":     { impacts[], waste_management },
  "evidence":        [{ evidence_id, source, bbox_xyxy, description }],
  "unknown":         [{ question, why_unknown, needed_input }],
  "domain_fields":   { custom_kpis, lot_breakdown, client_specific },
  "metadata":        { model, version, generated_at }
}

Safety-Specific Schema Detail

V2 particularly excels at populating the safety section:

{
  "safety": {
    "overall_risk_level": "low | medium | high | unknown",
    "ppe": [
      {
        "role": "worker | visitor | unknown",
        "ppe_item": "helmet | vest | gloves | goggles | harness | boots | mask | other",
        "status": "compliant | non_compliant | unknown",
        "confidence": 0.0,
        "evidence_ids": ["ev_XXX"]
      }
    ],
    "hazards": [
      {
        "hazard_type": "fall_risk | open_trench | moving_vehicle | electrical | fire | unstable_load | poor_housekeeping | restricted_area | other",
        "severity": "low | medium | high | unknown",
        "confidence": 0.0,
        "evidence_ids": ["ev_XXX"]
      }
    ],
    "control_measures": [
      {
        "measure": "guardrail | barrier | signage | netting | cones | spotter | lockout_tagout | other",
        "status": "present | missing | unknown",
        "confidence": 0.0,
        "evidence_ids": ["ev_XXX"]
      }
    ]
  }
}

Example Output

Given a photograph of an active construction site with workers:

{
  "job_type": "construction",
  "asset_type": "building",
  "scene_context": {
    "location_hint": "outdoor",
    "weather_light": "day",
    "viewpoint": "ground"
  },
  "summary": {
    "one_liner": "Active multi-story building construction site with scaffolding, multiple workers performing structural work with mixed PPE compliance.",
    "confidence": 0.90
  },
  "progress": {
    "overall_stage": "structure",
    "stage_confidence": 0.88,
    "progress_percent_estimate": 45,
    "progress_confidence": 0.40,
    "milestones_detected": [
      { "name": "Foundation complete", "status": "done", "confidence": 0.85, "evidence_ids": ["ev_001"] },
      { "name": "Structural framing in progress", "status": "in_progress", "confidence": 0.90, "evidence_ids": ["ev_002"] }
    ]
  },
  "safety": {
    "overall_risk_level": "high",
    "ppe": [
      { "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.92, "evidence_ids": ["ev_003"] },
      { "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.90, "evidence_ids": ["ev_003"] },
      { "role": "worker", "ppe_item": "harness", "status": "non_compliant", "confidence": 0.75, "evidence_ids": ["ev_004"] },
      { "role": "worker", "ppe_item": "boots", "status": "compliant", "confidence": 0.80, "evidence_ids": ["ev_003"] }
    ],
    "hazards": [
      { "hazard_type": "fall_risk", "severity": "high", "confidence": 0.88, "evidence_ids": ["ev_005"] },
      { "hazard_type": "unstable_load", "severity": "medium", "confidence": 0.65, "evidence_ids": ["ev_006"] }
    ],
    "control_measures": [
      { "measure": "guardrail", "status": "present", "confidence": 0.85, "evidence_ids": ["ev_007"] },
      { "measure": "netting", "status": "missing", "confidence": 0.70, "evidence_ids": ["ev_005"] }
    ]
  },
  "evidence": [
    { "evidence_id": "ev_001", "source": "image", "description": "Completed concrete foundation visible at ground level" },
    { "evidence_id": "ev_003", "source": "image", "description": "Workers wearing hard hats, high-vis vests, and safety boots on scaffolding" },
    { "evidence_id": "ev_004", "source": "image", "description": "Worker at height without visible safety harness attachment" },
    { "evidence_id": "ev_005", "source": "image", "description": "Open edges on upper floors without safety netting" }
  ]
}

(Truncated for readability -- full output includes all 15 top-level fields)

Training Details

Stage 2 (V2 Safety Enhancement)

Parameter Value
Base model Horama/Horama_BTP (V1)
Method LoRA (Parameter-Efficient Fine-Tuning)
Training images 10,000+ construction site photographs
Focus Safety inspection, PPE detection, hazard recognition
LoRA rank r=128 (HPO-optimized, 4x V1 capacity)
LoRA alpha 256 (2x rank)
LoRA dropout 0.05 (HPO-optimized)
Epochs 3
Effective batch size 8 (batch=2, accumulation=4)
Learning rate 2.52e-4 (HPO-optimized, cosine schedule)
Warmup 3% of training steps
Weight decay 0.0 (HPO-optimized)
Gradient checkpointing Enabled
Framework Transformers + PEFT
Hyperparameter search Bayesian optimization via Optuna

Hyperparameter Optimization

V2 hyperparameters were selected through Bayesian optimization (Optuna) searching over:

Hyperparameter Search space Optimal value
Learning rate [1e-5, 5e-4] 2.52e-4
LoRA rank {16, 32, 64, 128} 128
LoRA dropout {0.0, 0.1, 0.2} 0.05
Weight decay {0.0, 0.01, 0.05, 0.1} 0.0
Gradient accumulation {4, 8, 16} 4

V1 vs V2: When to Use Which

Use case Recommended
Safety audits & PPE compliance V2 -- significantly better safety detection
Hazard identification V2 -- trained on diverse hazard scenarios
General progress tracking Both work well; V1 is lighter
Quality defect detection Both comparable
Resource-constrained deployment V1 (identical architecture but lighter training)
Safety-critical applications V2 -- deeper safety understanding

Intended Uses

Primary use cases:

  • Automated safety compliance auditing from site photographs
  • PPE verification across construction teams
  • Hazard detection and risk level assessment
  • Construction progress reporting
  • Quality control and defect identification
  • Environmental impact documentation

Input requirements:

  • Single construction site image (JPEG, PNG, WebP, BMP)
  • Supports ground-level, drone, and fixed-camera viewpoints
  • Works best with daylight, well-lit images

Limitations

  • Single-image analysis: Analyzes one image at a time; no temporal comparison between images
  • Visible elements only: Cannot detect hidden structural issues or elements behind walls/coverings
  • No sensory data: Cannot measure noise levels, dust concentration, or air quality from static images
  • PPE at distance: Small or distant workers may have lower PPE detection confidence
  • Schema-bound: Output follows the Horama-BTP v1 schema strictly -- custom fields use the domain_fields extension
  • Not a replacement for human inspectors: The model assists and augments human safety inspections but should not be the sole decision-maker for safety-critical assessments

Hardware Requirements

Setup VRAM / RAM Precision Notes
NVIDIA GPU ~8 GB VRAM BF16 Recommended for production
Apple Silicon ~8 GB RAM FP32 Supported via MPS backend
CPU ~12 GB RAM FP32 Functional but slower

License

AGPL-3.0 -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license.

For commercial or closed-source usage, please contact Horama for a commercial license.

Citation

@misc{horama-btp-v2-2025,
  title   = {Horama-BTP V2: Safety-Enhanced Vision-Language Model for Construction Site Analysis},
  author  = {Horama},
  year    = {2025},
  url     = {https://huggingface.co/Horama/Horama_BTP_v2},
  note    = {Two-stage LoRA fine-tuning from Qwen2.5-VL-3B-Instruct with HPO-optimized safety training}
}

Built by Horama | Construction intelligence, powered by vision AI

Downloads last month
19
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Horama/Horama_BTP_v2

Adapter
(1)
this model

Collection including Horama/Horama_BTP_v2