Arabic Question Answering with LoRA Fine-tuning

A fine-tuned T5 model for Arabic Question Answering using Low-Rank Adaptation (LoRA) on the AraT5v2-base-1024 model.

Model Details

Model Description

This model is a fine-tuned version of UBC-NLP/AraT5v2-base-1024 for extractive question answering tasks on Arabic text. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning, making it parameter-efficient while maintaining high performance on Arabic QA tasks.

The model takes a question and context as input and generates the answer extracted from the provided context, following the SQuAD-style question answering format.

Developed by: Diaa Essam
Model type: Seq2Seq Question Answering
Language(s) (NLP): Arabic (ar)
License: MIT
Finetuned from model: UBC-NLP/AraT5v2-base-1024

Model Sources

Repository: Kaggle notebook
Base Model: UBC-NLP/AraT5v2-base-1024
Dataset: DataScienceUIBK/ArabicaQA

Uses

Direct Use

The model can be directly used for Arabic Question Answering tasks without additional fine-tuning. It's suitable for:

Extracting answers from Arabic documents and articles
Building Arabic reading comprehension systems
Information retrieval from Arabic text
Arabic chatbots and conversational AI
Educational applications for Arabic language learning

Downstream Use

The model can be further fine-tuned for:

Domain-specific QA (medical, legal, technical Arabic text)
Multi-hop question answering
Conversational question answering
Arabic summarization tasks

Out-of-Scope Use

Non-Arabic text (the model is trained exclusively on Arabic)
Open-domain question answering without context
Generating answers not present in the provided context
Real-time applications requiring sub-millisecond latency without optimization
Questions in other languages or code-switched text

Bias, Risks, and Limitations

The model's performance may vary across different Arabic dialects and writing styles
Answer extraction accuracy depends on context quality and may be lower for informal or dialectal Arabic
The model may struggle with complex reasoning or multi-hop questions
Performance depends on the quality and clarity of the provided context
The model is optimized for extractive QA and may not perform well on generative or abstractive tasks

Recommendations

Users should be aware of the model's limitations regarding:

Dialectal variations in Arabic text
Context length (maximum 1024 tokens)
The model extracts answers from context; it doesn't generate knowledge
Potential biases in the training data
Need for clear and well-formed questions

How to Get Started with the Model

from transformers import T5Tokenizer, T5ForConditionalGeneration, GenerationConfig
from peft import PeftModel
import torch

# Load tokenizer and base model
MODEL_NAME = "UBC-NLP/AraT5v2-base-1024"
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
base_model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "[checkpoint_path]")
model.eval()

# Create generation config
generation_config = GenerationConfig(
    max_length=128,
    min_length=5,
    num_beams=5,
    early_stopping=True,
    no_repeat_ngram_size=3,
    repetition_penalty=1.5,
    length_penalty=0.8,
    do_sample=False
)

def answer_question(question, context):
    """Generate answer from question and context"""
    # Format input
    input_text = f"question: {question} context: {context}"
    
    # Tokenize
    inputs = tokenizer(
        input_text,
        return_tensors='pt',
        max_length=1024,
        truncation=True
    )
    
    # Generate answer
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            generation_config=generation_config
        )
    
    # Decode
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

# Example usage
question = "ما هي عاصمة مصر؟"
context = "القاهرة هي عاصمة جمهورية مصر العربية وأكبر مدنها. تقع على ضفاف نهر النيل في شمال مصر."

answer = answer_question(question, context)
print(f"Question: {question}")
print(f"Answer: {answer}")
# Output: القاهرة

Training Details

Training Data

The model was trained on an Arabic Question Answering dataset in SQuAD format, containing:

Question-context-answer triples
Extractive answers span from the provided context
Diverse topics and domains in Modern Standard Arabic

Training data composition:

Training samples: Combined train + validation sets (merged for final training)
Validation samples: 100 samples from test set
Test samples: Held-out test set for evaluation
Format: SQuAD-style JSON with questions, contexts, and answers

Training Procedure

Preprocessing

Text tokenization using AraT5v2 tokenizer
Input format: "question: {question} context: {context}"
Target format: "{answer}"
Maximum source length: 1024 tokens (matching model's context window)
Maximum target length: 128 tokens
Padding to max length with truncation

Training Hyperparameters

LoRA Configuration:

LoRA rank (r): 32
LoRA alpha: 64
LoRA dropout: 0.1
Target modules: q, k, v, o (attention layers)
Bias: none
Task type: SEQ_2_SEQ_LM
Trainable parameters: ~1.8% of total model parameters

Training Arguments:

Training regime: fp16 mixed precision (when GPU available)
Learning rate: 1e-4
Batch size per device: 4 (training and evaluation)
Gradient accumulation steps: 4
Effective batch size: 16
Number of epochs: 5
Weight decay: 0.01
Warmup steps: 500
Optimizer: AdamW
LR scheduler: Linear
Label smoothing: None
Save steps: 100
Eval steps: 100
Logging steps: 100
Save total limit: 10
Load best model at end: True
Metric for best model: F1 Score

Generation Configuration:

Max length: 128
Min length: 5
Num beams: 5
Early stopping: True
No repeat ngram size: 3
Repetition penalty: 1.5
Length penalty: 0.8
Sampling: False (deterministic beam search)

Speeds, Sizes, Times

Training time: Approximately 2-3 hours (with GPU acceleration)
Model size: Base model + LoRA adapter (~1.2GB total)
Inference speed: Varies by hardware; optimized for GPU inference with beam search
Checkpoint frequency: Every 100 steps

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on the test split of the Arabic QA dataset, containing unseen question-context-answer triples.

Metrics

The model is evaluated using standard QA metrics:

F1 Score: Token-level F1 score measuring word overlap between predicted and reference answers
- Normalized for Arabic text (removes diacritics, normalizes Alef/Ya variants)
- Primary metric for model selection
Exact Match (EM): Percentage of predictions that exactly match the reference answer after normalization
- Stricter metric requiring perfect match
- Important for applications requiring high precision
BLEU Score: Measures n-gram overlap between prediction and reference
- Captures fluency and word choice quality
- Useful for evaluating answer generation quality

All metrics use Arabic text normalization including:

Diacritics removal
Alef variant normalization (إ, أ, آ → ا)
Ta Marbuta normalization (ة → ه)
Ya normalization (ى → ي)
Punctuation removal
Case normalization

Results

Test Set Performance:

F1 Score: 0.614
Exact Match: 0.23
BLEU: 0.475

The model demonstrates strong performance on Arabic extractive question answering, with particular strengths in:

Short, factual answers
Questions about named entities (people, places, organizations)
When-type and Where-type questions
Questions with clear context boundaries

Technical Specifications

Model Architecture and Objective

Base Architecture: T5 (Text-to-Text Transfer Transformer)
Specific Model: AraT5v2 base with 1024 token context window
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Task: Seq2Seq Question Answering (Extractive)
Objective: Generate answer text given question and context
Parameters: ~220M base parameters + ~8M trainable LoRA parameters

Compute Infrastructure

Hardware

GPU-accelerated training (CUDA-enabled)
Optimized for modern GPUs with fp16 support
Tested on Kaggle GPU environments

Software

Framework: PyTorch with Transformers and PEFT libraries
Key Libraries:
- transformers (Hugging Face)
- peft (Parameter-Efficient Fine-Tuning)
- evaluate (HuggingFace Evaluate)
- torch (PyTorch)
- datasets (Hugging Face)

Framework Versions

PEFT: Latest compatible version
Transformers: 4.x
PyTorch: 2.x with CUDA support
Python: 3.8+
Evaluate: Latest version

Limitations and Considerations

Known Limitations

Context Length: Maximum 1024 tokens; longer contexts are truncated
Extractive Only: Cannot generate answers not present in context
Single-hop QA: Optimized for single-hop reasoning; may struggle with complex multi-hop questions
Dialect Sensitivity: Best performance on Modern Standard Arabic (MSA)
Answer Length: Optimized for short to medium answers (5-128 tokens)

Best Practices

Provide clear, focused contexts containing the answer
Ensure questions are well-formed and unambiguous
Keep contexts under 1024 tokens for best results
Use Modern Standard Arabic for optimal performance
Post-process answers for your specific use case

Citation

If you use this model, please cite:

BibTeX:

@misc{arabic-qa-arat5-lora,
  author = {Diaa Eldin Essam Zaki},
  title = {Arabic Question Answering with LoRA Fine-tuning on AraT5v2},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/[model-path]}}
}

APA:

Diaa Eldin Essam Zaki. (2025). Arabic Question Answering with LoRA Fine-tuning on AraT5v2. HuggingFace Model Hub.

Glossary

QA: Question Answering - the task of automatically answering questions posed in natural language
LoRA: Low-Rank Adaptation - a parameter-efficient fine-tuning method
Extractive QA: Question answering where the answer is extracted directly from the context
SQuAD: Stanford Question Answering Dataset - a popular QA benchmark format
AraT5: Arabic T5 model pre-trained on large Arabic corpora
Seq2Seq: Sequence-to-Sequence - models that transform one sequence into another
Beam Search: Decoding strategy that explores multiple hypotheses to find the best answer
F1 Score: Harmonic mean of precision and recall at the token level
Exact Match: Strict metric requiring perfect answer match

Model Card Authors

Diaa Essam

Model Card Contact

diaaesam123@gmail.com

Acknowledgments

Base model: UBC-NLP for AraT5v2
Framework: HuggingFace Transformers and PEFT teams
Infrastructure: Kaggle for providing GPU resources

Version History

v1.0 (2025-01): Initial release with LoRA fine-tuning on Arabic QA dataset

Downloads last month: -

Model tree for Diaa-Essam/AraT5v2_on_QA_dataset_using_LoRA

Base model

UBC-NLP/AraT5v2-base-1024

Adapter

(8)

this model