Arabic Question Answering with LoRA Fine-tuning

A fine-tuned T5 model for Arabic Question Answering using Low-Rank Adaptation (LoRA) on the AraT5v2-base-1024 model.

Model Details

Model Description

This model is a fine-tuned version of UBC-NLP/AraT5v2-base-1024 for extractive question answering tasks on Arabic text. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning, making it parameter-efficient while maintaining high performance on Arabic QA tasks.

The model takes a question and context as input and generates the answer extracted from the provided context, following the SQuAD-style question answering format.

  • Developed by: Diaa Essam
  • Model type: Seq2Seq Question Answering
  • Language(s) (NLP): Arabic (ar)
  • License: MIT
  • Finetuned from model: UBC-NLP/AraT5v2-base-1024

Model Sources

Uses

Direct Use

The model can be directly used for Arabic Question Answering tasks without additional fine-tuning. It's suitable for:

  • Extracting answers from Arabic documents and articles
  • Building Arabic reading comprehension systems
  • Information retrieval from Arabic text
  • Arabic chatbots and conversational AI
  • Educational applications for Arabic language learning

Downstream Use

The model can be further fine-tuned for:

  • Domain-specific QA (medical, legal, technical Arabic text)
  • Multi-hop question answering
  • Conversational question answering
  • Arabic summarization tasks

Out-of-Scope Use

  • Non-Arabic text (the model is trained exclusively on Arabic)
  • Open-domain question answering without context
  • Generating answers not present in the provided context
  • Real-time applications requiring sub-millisecond latency without optimization
  • Questions in other languages or code-switched text

Bias, Risks, and Limitations

  • The model's performance may vary across different Arabic dialects and writing styles
  • Answer extraction accuracy depends on context quality and may be lower for informal or dialectal Arabic
  • The model may struggle with complex reasoning or multi-hop questions
  • Performance depends on the quality and clarity of the provided context
  • The model is optimized for extractive QA and may not perform well on generative or abstractive tasks

Recommendations

Users should be aware of the model's limitations regarding:

  • Dialectal variations in Arabic text
  • Context length (maximum 1024 tokens)
  • The model extracts answers from context; it doesn't generate knowledge
  • Potential biases in the training data
  • Need for clear and well-formed questions

How to Get Started with the Model

from transformers import T5Tokenizer, T5ForConditionalGeneration, GenerationConfig
from peft import PeftModel
import torch

# Load tokenizer and base model
MODEL_NAME = "UBC-NLP/AraT5v2-base-1024"
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
base_model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "[checkpoint_path]")
model.eval()

# Create generation config
generation_config = GenerationConfig(
    max_length=128,
    min_length=5,
    num_beams=5,
    early_stopping=True,
    no_repeat_ngram_size=3,
    repetition_penalty=1.5,
    length_penalty=0.8,
    do_sample=False
)

def answer_question(question, context):
    """Generate answer from question and context"""
    # Format input
    input_text = f"question: {question} context: {context}"
    
    # Tokenize
    inputs = tokenizer(
        input_text,
        return_tensors='pt',
        max_length=1024,
        truncation=True
    )
    
    # Generate answer
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            generation_config=generation_config
        )
    
    # Decode
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

# Example usage
question = "ู…ุง ู‡ูŠ ุนุงุตู…ุฉ ู…ุตุฑุŸ"
context = "ุงู„ู‚ุงู‡ุฑุฉ ู‡ูŠ ุนุงุตู…ุฉ ุฌู…ู‡ูˆุฑูŠุฉ ู…ุตุฑ ุงู„ุนุฑุจูŠุฉ ูˆุฃูƒุจุฑ ู…ุฏู†ู‡ุง. ุชู‚ุน ุนู„ู‰ ุถูุงู ู†ู‡ุฑ ุงู„ู†ูŠู„ ููŠ ุดู…ุงู„ ู…ุตุฑ."

answer = answer_question(question, context)
print(f"Question: {question}")
print(f"Answer: {answer}")
# Output: ุงู„ู‚ุงู‡ุฑุฉ

Training Details

Training Data

The model was trained on an Arabic Question Answering dataset in SQuAD format, containing:

  • Question-context-answer triples
  • Extractive answers span from the provided context
  • Diverse topics and domains in Modern Standard Arabic

Training data composition:

  • Training samples: Combined train + validation sets (merged for final training)
  • Validation samples: 100 samples from test set
  • Test samples: Held-out test set for evaluation
  • Format: SQuAD-style JSON with questions, contexts, and answers

Training Procedure

Preprocessing

  • Text tokenization using AraT5v2 tokenizer
  • Input format: "question: {question} context: {context}"
  • Target format: "{answer}"
  • Maximum source length: 1024 tokens (matching model's context window)
  • Maximum target length: 128 tokens
  • Padding to max length with truncation

Training Hyperparameters

LoRA Configuration:

  • LoRA rank (r): 32
  • LoRA alpha: 64
  • LoRA dropout: 0.1
  • Target modules: q, k, v, o (attention layers)
  • Bias: none
  • Task type: SEQ_2_SEQ_LM
  • Trainable parameters: ~1.8% of total model parameters

Training Arguments:

  • Training regime: fp16 mixed precision (when GPU available)
  • Learning rate: 1e-4
  • Batch size per device: 4 (training and evaluation)
  • Gradient accumulation steps: 4
  • Effective batch size: 16
  • Number of epochs: 5
  • Weight decay: 0.01
  • Warmup steps: 500
  • Optimizer: AdamW
  • LR scheduler: Linear
  • Label smoothing: None
  • Save steps: 100
  • Eval steps: 100
  • Logging steps: 100
  • Save total limit: 10
  • Load best model at end: True
  • Metric for best model: F1 Score

Generation Configuration:

  • Max length: 128
  • Min length: 5
  • Num beams: 5
  • Early stopping: True
  • No repeat ngram size: 3
  • Repetition penalty: 1.5
  • Length penalty: 0.8
  • Sampling: False (deterministic beam search)

Speeds, Sizes, Times

  • Training time: Approximately 2-3 hours (with GPU acceleration)
  • Model size: Base model + LoRA adapter (~1.2GB total)
  • Inference speed: Varies by hardware; optimized for GPU inference with beam search
  • Checkpoint frequency: Every 100 steps

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on the test split of the Arabic QA dataset, containing unseen question-context-answer triples.

Metrics

The model is evaluated using standard QA metrics:

  • F1 Score: Token-level F1 score measuring word overlap between predicted and reference answers

    • Normalized for Arabic text (removes diacritics, normalizes Alef/Ya variants)
    • Primary metric for model selection
  • Exact Match (EM): Percentage of predictions that exactly match the reference answer after normalization

    • Stricter metric requiring perfect match
    • Important for applications requiring high precision
  • BLEU Score: Measures n-gram overlap between prediction and reference

    • Captures fluency and word choice quality
    • Useful for evaluating answer generation quality

All metrics use Arabic text normalization including:

  • Diacritics removal
  • Alef variant normalization (ุฅ, ุฃ, ุข โ†’ ุง)
  • Ta Marbuta normalization (ุฉ โ†’ ู‡)
  • Ya normalization (ู‰ โ†’ ูŠ)
  • Punctuation removal
  • Case normalization

Results

Test Set Performance:

  • F1 Score: 0.614
  • Exact Match: 0.23
  • BLEU: 0.475

The model demonstrates strong performance on Arabic extractive question answering, with particular strengths in:

  • Short, factual answers
  • Questions about named entities (people, places, organizations)
  • When-type and Where-type questions
  • Questions with clear context boundaries

Technical Specifications

Model Architecture and Objective

  • Base Architecture: T5 (Text-to-Text Transfer Transformer)
  • Specific Model: AraT5v2 base with 1024 token context window
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Task: Seq2Seq Question Answering (Extractive)
  • Objective: Generate answer text given question and context
  • Parameters: ~220M base parameters + ~8M trainable LoRA parameters

Compute Infrastructure

Hardware

  • GPU-accelerated training (CUDA-enabled)
  • Optimized for modern GPUs with fp16 support
  • Tested on Kaggle GPU environments

Software

  • Framework: PyTorch with Transformers and PEFT libraries
  • Key Libraries:
    • transformers (Hugging Face)
    • peft (Parameter-Efficient Fine-Tuning)
    • evaluate (HuggingFace Evaluate)
    • torch (PyTorch)
    • datasets (Hugging Face)

Framework Versions

  • PEFT: Latest compatible version
  • Transformers: 4.x
  • PyTorch: 2.x with CUDA support
  • Python: 3.8+
  • Evaluate: Latest version

Limitations and Considerations

Known Limitations

  1. Context Length: Maximum 1024 tokens; longer contexts are truncated
  2. Extractive Only: Cannot generate answers not present in context
  3. Single-hop QA: Optimized for single-hop reasoning; may struggle with complex multi-hop questions
  4. Dialect Sensitivity: Best performance on Modern Standard Arabic (MSA)
  5. Answer Length: Optimized for short to medium answers (5-128 tokens)

Best Practices

  • Provide clear, focused contexts containing the answer
  • Ensure questions are well-formed and unambiguous
  • Keep contexts under 1024 tokens for best results
  • Use Modern Standard Arabic for optimal performance
  • Post-process answers for your specific use case

Citation

If you use this model, please cite:

BibTeX:

@misc{arabic-qa-arat5-lora,
  author = {Diaa Eldin Essam Zaki},
  title = {Arabic Question Answering with LoRA Fine-tuning on AraT5v2},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/[model-path]}}
}

APA:

Diaa Eldin Essam Zaki. (2025). Arabic Question Answering with LoRA Fine-tuning on AraT5v2. HuggingFace Model Hub.

Glossary

  • QA: Question Answering - the task of automatically answering questions posed in natural language
  • LoRA: Low-Rank Adaptation - a parameter-efficient fine-tuning method
  • Extractive QA: Question answering where the answer is extracted directly from the context
  • SQuAD: Stanford Question Answering Dataset - a popular QA benchmark format
  • AraT5: Arabic T5 model pre-trained on large Arabic corpora
  • Seq2Seq: Sequence-to-Sequence - models that transform one sequence into another
  • Beam Search: Decoding strategy that explores multiple hypotheses to find the best answer
  • F1 Score: Harmonic mean of precision and recall at the token level
  • Exact Match: Strict metric requiring perfect answer match

Model Card Authors

Diaa Essam

Model Card Contact

diaaesam123@gmail.com


Acknowledgments

  • Base model: UBC-NLP for AraT5v2
  • Framework: HuggingFace Transformers and PEFT teams
  • Infrastructure: Kaggle for providing GPU resources

Version History

  • v1.0 (2025-01): Initial release with LoRA fine-tuning on Arabic QA dataset
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Diaa-Essam/AraT5v2_on_QA_dataset_using_LoRA

Adapter
(8)
this model