Arabic Question Answering with LoRA Fine-tuning
A fine-tuned T5 model for Arabic Question Answering using Low-Rank Adaptation (LoRA) on the AraT5v2-base-1024 model.
Model Details
Model Description
This model is a fine-tuned version of UBC-NLP/AraT5v2-base-1024 for extractive question answering tasks on Arabic text. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning, making it parameter-efficient while maintaining high performance on Arabic QA tasks.
The model takes a question and context as input and generates the answer extracted from the provided context, following the SQuAD-style question answering format.
- Developed by: Diaa Essam
- Model type: Seq2Seq Question Answering
- Language(s) (NLP): Arabic (ar)
- License: MIT
- Finetuned from model: UBC-NLP/AraT5v2-base-1024
Model Sources
- Repository: Kaggle notebook
- Base Model: UBC-NLP/AraT5v2-base-1024
- Dataset: DataScienceUIBK/ArabicaQA
Uses
Direct Use
The model can be directly used for Arabic Question Answering tasks without additional fine-tuning. It's suitable for:
- Extracting answers from Arabic documents and articles
- Building Arabic reading comprehension systems
- Information retrieval from Arabic text
- Arabic chatbots and conversational AI
- Educational applications for Arabic language learning
Downstream Use
The model can be further fine-tuned for:
- Domain-specific QA (medical, legal, technical Arabic text)
- Multi-hop question answering
- Conversational question answering
- Arabic summarization tasks
Out-of-Scope Use
- Non-Arabic text (the model is trained exclusively on Arabic)
- Open-domain question answering without context
- Generating answers not present in the provided context
- Real-time applications requiring sub-millisecond latency without optimization
- Questions in other languages or code-switched text
Bias, Risks, and Limitations
- The model's performance may vary across different Arabic dialects and writing styles
- Answer extraction accuracy depends on context quality and may be lower for informal or dialectal Arabic
- The model may struggle with complex reasoning or multi-hop questions
- Performance depends on the quality and clarity of the provided context
- The model is optimized for extractive QA and may not perform well on generative or abstractive tasks
Recommendations
Users should be aware of the model's limitations regarding:
- Dialectal variations in Arabic text
- Context length (maximum 1024 tokens)
- The model extracts answers from context; it doesn't generate knowledge
- Potential biases in the training data
- Need for clear and well-formed questions
How to Get Started with the Model
from transformers import T5Tokenizer, T5ForConditionalGeneration, GenerationConfig
from peft import PeftModel
import torch
# Load tokenizer and base model
MODEL_NAME = "UBC-NLP/AraT5v2-base-1024"
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
base_model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "[checkpoint_path]")
model.eval()
# Create generation config
generation_config = GenerationConfig(
max_length=128,
min_length=5,
num_beams=5,
early_stopping=True,
no_repeat_ngram_size=3,
repetition_penalty=1.5,
length_penalty=0.8,
do_sample=False
)
def answer_question(question, context):
"""Generate answer from question and context"""
# Format input
input_text = f"question: {question} context: {context}"
# Tokenize
inputs = tokenizer(
input_text,
return_tensors='pt',
max_length=1024,
truncation=True
)
# Generate answer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
generation_config=generation_config
)
# Decode
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
return answer
# Example usage
question = "ู
ุง ูู ุนุงุตู
ุฉ ู
ุตุฑุ"
context = "ุงููุงูุฑุฉ ูู ุนุงุตู
ุฉ ุฌู
ููุฑูุฉ ู
ุตุฑ ุงูุนุฑุจูุฉ ูุฃูุจุฑ ู
ุฏููุง. ุชูุน ุนูู ุถูุงู ููุฑ ุงูููู ูู ุดู
ุงู ู
ุตุฑ."
answer = answer_question(question, context)
print(f"Question: {question}")
print(f"Answer: {answer}")
# Output: ุงููุงูุฑุฉ
Training Details
Training Data
The model was trained on an Arabic Question Answering dataset in SQuAD format, containing:
- Question-context-answer triples
- Extractive answers span from the provided context
- Diverse topics and domains in Modern Standard Arabic
Training data composition:
- Training samples: Combined train + validation sets (merged for final training)
- Validation samples: 100 samples from test set
- Test samples: Held-out test set for evaluation
- Format: SQuAD-style JSON with questions, contexts, and answers
Training Procedure
Preprocessing
- Text tokenization using AraT5v2 tokenizer
- Input format:
"question: {question} context: {context}" - Target format:
"{answer}" - Maximum source length: 1024 tokens (matching model's context window)
- Maximum target length: 128 tokens
- Padding to max length with truncation
Training Hyperparameters
LoRA Configuration:
- LoRA rank (r): 32
- LoRA alpha: 64
- LoRA dropout: 0.1
- Target modules: q, k, v, o (attention layers)
- Bias: none
- Task type: SEQ_2_SEQ_LM
- Trainable parameters: ~1.8% of total model parameters
Training Arguments:
- Training regime: fp16 mixed precision (when GPU available)
- Learning rate: 1e-4
- Batch size per device: 4 (training and evaluation)
- Gradient accumulation steps: 4
- Effective batch size: 16
- Number of epochs: 5
- Weight decay: 0.01
- Warmup steps: 500
- Optimizer: AdamW
- LR scheduler: Linear
- Label smoothing: None
- Save steps: 100
- Eval steps: 100
- Logging steps: 100
- Save total limit: 10
- Load best model at end: True
- Metric for best model: F1 Score
Generation Configuration:
- Max length: 128
- Min length: 5
- Num beams: 5
- Early stopping: True
- No repeat ngram size: 3
- Repetition penalty: 1.5
- Length penalty: 0.8
- Sampling: False (deterministic beam search)
Speeds, Sizes, Times
- Training time: Approximately 2-3 hours (with GPU acceleration)
- Model size: Base model + LoRA adapter (~1.2GB total)
- Inference speed: Varies by hardware; optimized for GPU inference with beam search
- Checkpoint frequency: Every 100 steps
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on the test split of the Arabic QA dataset, containing unseen question-context-answer triples.
Metrics
The model is evaluated using standard QA metrics:
F1 Score: Token-level F1 score measuring word overlap between predicted and reference answers
- Normalized for Arabic text (removes diacritics, normalizes Alef/Ya variants)
- Primary metric for model selection
Exact Match (EM): Percentage of predictions that exactly match the reference answer after normalization
- Stricter metric requiring perfect match
- Important for applications requiring high precision
BLEU Score: Measures n-gram overlap between prediction and reference
- Captures fluency and word choice quality
- Useful for evaluating answer generation quality
All metrics use Arabic text normalization including:
- Diacritics removal
- Alef variant normalization (ุฅ, ุฃ, ุข โ ุง)
- Ta Marbuta normalization (ุฉ โ ู)
- Ya normalization (ู โ ู)
- Punctuation removal
- Case normalization
Results
Test Set Performance:
- F1 Score: 0.614
- Exact Match: 0.23
- BLEU: 0.475
The model demonstrates strong performance on Arabic extractive question answering, with particular strengths in:
- Short, factual answers
- Questions about named entities (people, places, organizations)
- When-type and Where-type questions
- Questions with clear context boundaries
Technical Specifications
Model Architecture and Objective
- Base Architecture: T5 (Text-to-Text Transfer Transformer)
- Specific Model: AraT5v2 base with 1024 token context window
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Task: Seq2Seq Question Answering (Extractive)
- Objective: Generate answer text given question and context
- Parameters: ~220M base parameters + ~8M trainable LoRA parameters
Compute Infrastructure
Hardware
- GPU-accelerated training (CUDA-enabled)
- Optimized for modern GPUs with fp16 support
- Tested on Kaggle GPU environments
Software
- Framework: PyTorch with Transformers and PEFT libraries
- Key Libraries:
- transformers (Hugging Face)
- peft (Parameter-Efficient Fine-Tuning)
- evaluate (HuggingFace Evaluate)
- torch (PyTorch)
- datasets (Hugging Face)
Framework Versions
- PEFT: Latest compatible version
- Transformers: 4.x
- PyTorch: 2.x with CUDA support
- Python: 3.8+
- Evaluate: Latest version
Limitations and Considerations
Known Limitations
- Context Length: Maximum 1024 tokens; longer contexts are truncated
- Extractive Only: Cannot generate answers not present in context
- Single-hop QA: Optimized for single-hop reasoning; may struggle with complex multi-hop questions
- Dialect Sensitivity: Best performance on Modern Standard Arabic (MSA)
- Answer Length: Optimized for short to medium answers (5-128 tokens)
Best Practices
- Provide clear, focused contexts containing the answer
- Ensure questions are well-formed and unambiguous
- Keep contexts under 1024 tokens for best results
- Use Modern Standard Arabic for optimal performance
- Post-process answers for your specific use case
Citation
If you use this model, please cite:
BibTeX:
@misc{arabic-qa-arat5-lora,
author = {Diaa Eldin Essam Zaki},
title = {Arabic Question Answering with LoRA Fine-tuning on AraT5v2},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/[model-path]}}
}
APA:
Diaa Eldin Essam Zaki. (2025). Arabic Question Answering with LoRA Fine-tuning on AraT5v2. HuggingFace Model Hub.
Glossary
- QA: Question Answering - the task of automatically answering questions posed in natural language
- LoRA: Low-Rank Adaptation - a parameter-efficient fine-tuning method
- Extractive QA: Question answering where the answer is extracted directly from the context
- SQuAD: Stanford Question Answering Dataset - a popular QA benchmark format
- AraT5: Arabic T5 model pre-trained on large Arabic corpora
- Seq2Seq: Sequence-to-Sequence - models that transform one sequence into another
- Beam Search: Decoding strategy that explores multiple hypotheses to find the best answer
- F1 Score: Harmonic mean of precision and recall at the token level
- Exact Match: Strict metric requiring perfect answer match
Model Card Authors
Diaa Essam
Model Card Contact
Acknowledgments
- Base model: UBC-NLP for AraT5v2
- Framework: HuggingFace Transformers and PEFT teams
- Infrastructure: Kaggle for providing GPU resources
Version History
- v1.0 (2025-01): Initial release with LoRA fine-tuning on Arabic QA dataset
- Downloads last month
- -
Model tree for Diaa-Essam/AraT5v2_on_QA_dataset_using_LoRA
Base model
UBC-NLP/AraT5v2-base-1024