You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Dak-OCR v1: Khasi Optical Character Recognition model

Dak-OCR is a fine-tuned version of DeepSeek-OCR-2 designed for accurate OCR, Document Understanding, and Handwriting Recognition in the Khasi Language.

It was trained on the custom Khasi-OCR-36K dataset to reduce hallucination and repetition issues often seen in base multimodal models when working with low-resource languages. The model is designed to preserve document structure and produce clean Markdown output.

Model Highlights

Language Support: Native Khasi (using Latin script with special characters like ï and ñ) and English.
Task: Specialized for "Free OCR" (transcribing document images into Markdown-formatted text, preserving headings, paragraphs, lists, and tables).
Robustness: Highly resilient to degraded, noisy, and historical scans.
Base Model: unsloth/DeepSeek-OCR-2
Hardware & Performance: Training was conducted on an NVIDIA A100 SXM (80GB VRAM / 16 vCPU) leveraging native bfloat16 precision for optimal quality and memory efficiency.
LoRA Setup: Adaptation was implemented through LoRA with a rank of 64 (r=64) and lora_alpha=128. LoRA layers were applied broadly to linear components across vision encoders, language layers, attention blocks, and MLP modules to enable strong task specialization.
Vision Processing: To effectively handle large or dense document pages, the model uses dynamic high-resolution multi-patch cropping (crop_mode=True) with a base resolution of 1024 and image size of 768, preventing loss of detail from aggressive downscaling.
Model Precision: The final model is provided in native bfloat16 precision.

Performance & Evaluation

The model was evaluated on a mixed set of 40 highly dense Khasi samples containing complex markdown and degraded/noisy scans.

EVALUATION RESULTS

Metric	Score
WER	1.71%
CER	0.91%

Usage

from unsloth import FastVisionModel
from transformers import AutoModel
import torch

# Load Model
model, tokenizer = FastVisionModel.from_pretrained(
    "toiar/Dak-OCR",
    load_in_4bit = False,
    auto_model = AutoModel,
    trust_remote_code = True,
    torch_dtype = torch.bfloat16,
)
FastVisionModel.for_inference(model)

model.generation_config.do_sample = False
model.generation_config.temperature = None
model.generation_config.top_p = None

# Inference
prompt = "<image>\nFree OCR."
image_path = "path/to/your/khasi_document.png"

with torch.no_grad():
    output = model.infer(
        tokenizer,
        prompt=prompt,
        image_file=image_path,
        base_size=1024,
        image_size=768,
        crop_mode=True
    )
    
print(output)