SentenceTransformer based on Alibaba-NLP/gte-multilingual-base

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-base on the silma-arabic-triplets-dataset-v1.0 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: PeftModelForFeatureExtraction 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("RamzyBakir/arabic-gte-multilingual-base-200k-lora", trust_remote_code=True)
# Run inference
sentences = [
    'كيف لي أن أصبح مصمماً جيداً؟',
    'كيف أكون مصمماً جيداً؟',
    'كيف هي الحياة كمواطن هندي في فرنسا؟',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.9603

Training Details

Training Dataset

silma-arabic-triplets-dataset-v1.0

  • Dataset: silma-arabic-triplets-dataset-v1.0 at 77f8f6b
  • Size: 200,000 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 5 tokens
    • mean: 18.69 tokens
    • max: 84 tokens
    • min: 5 tokens
    • mean: 15.9 tokens
    • max: 62 tokens
    • min: 5 tokens
    • mean: 16.05 tokens
    • max: 71 tokens
  • Samples:
    anchor positive negative
    أول المستجيبين ينظرون إلى سيارة انزلقت من على كتلة ثلجية الناس يتفقدون ما حدث لسيارة. عربة الكهوف تقود على طول الشاطئ
    For young people between 18 and 24, as many as 79% were in favor. بالنسبة للشباب الذين تتراوح أعمارهم بين 18 و 24 سنة، فإن ما يصل نسبته إلى 79 ٪ كانو يؤيدونه. إن الطلاب هم أكثر مجموعة مربحة ممن تتراوح أعمارهم من 18 إلى 24 عامًا.
    So the image that I like to think of, instead of glass ceiling, is the high bridge. إذاً ، فالصورة التي أود أن أفكر بها ، بدلاً من السقف الزجاجي (الصورة التحكمية) ، هي صورة الجسر العالي. و في حاله الإرتفاع الشاهق للسطح، سترى السقف فقط.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

silma-arabic-triplets-dataset-v1.0

  • Dataset: silma-arabic-triplets-dataset-v1.0 at 77f8f6b
  • Size: 22,223 evaluation samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 5 tokens
    • mean: 18.78 tokens
    • max: 90 tokens
    • min: 4 tokens
    • mean: 15.85 tokens
    • max: 63 tokens
    • min: 5 tokens
    • mean: 15.92 tokens
    • max: 84 tokens
  • Samples:
    anchor positive negative
    امرأة ترتدي قميصاً مكتوب عليه اسبانياً في حدث ما هم في حدث. المرأة راقصة ترتدي قميص التوتو
    حكم يرتدي زلاجات و خوذة يمسح أرضية ما يبدو كملعب كرة سلة. الحكم كان لديه قدمين الحكم لم يكن لديه شيء في يديه
    لم يكن ذلك ممتعاً جداً لم يكن الأمر لطيفاً حقاً كان ذلك قاسياً
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • learning_rate: 0.0002
  • num_train_epochs: 1
  • warmup_ratio: 0.06
  • fp16: True
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 0.0002
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.06
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss triplet-eval_cosine_accuracy
0.16 500 0.3917 0.2733 0.9564
0.32 1000 0.3188 0.2621 0.9575
0.48 1500 0.3088 0.2537 0.9594
0.64 2000 0.3046 0.2490 0.9603
0.8 2500 0.2924 0.2474 0.9604
0.96 3000 0.2976 0.2456 0.9603
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.4
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.8.1
  • Datasets: 3.6.0
  • Tokenizers: 0.21.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RamzyBakir/arabic-gte-multilingual-base-200k-lora

Finetuned
(99)
this model

Dataset used to train RamzyBakir/arabic-gte-multilingual-base-200k-lora

Collection including RamzyBakir/arabic-gte-multilingual-base-200k-lora

Papers for RamzyBakir/arabic-gte-multilingual-base-200k-lora

Evaluation results