SentenceTransformer based on BAAI/bge-base-en-v1.5
This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-base-en-v1.5
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'[SUBSECTION c] Limitation of copayment amount to inpatient hospital deductible amount.. The copayment amount for a procedure performed in a year cannot exceed the amount of the inpatient hospital deductible established under section 1813(b) of the Act for that year. [CITATIONS]',
'[SUBSECTION c] Limitation of copayment amount to inpatient hospital deductible amount.. The copayment amount for a procedure performed in a year cannot exceed the amount of the inpatient hospital deductible established under section 1813(b) of the Act for that year. [CITATIONS]',
'[SUBSECTION A] The hospital must provide patient origin data (for example, the number of patients from each zip code from which the hospital draws inpatients) for all inpatient discharges to document the boundaries of its service area.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 1.0000, 0.6866],
# [1.0000, 1.0000, 0.6866],
# [0.6866, 0.6866, 1.0000]])
Training Details
Training Dataset
Unnamed Dataset
- Size: 24,880 training samples
- Columns:
sentence_0andsentence_1 - Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 35 tokens
- mean: 108.81 tokens
- max: 355 tokens
- min: 37 tokens
- mean: 112.5 tokens
- max: 355 tokens
- Samples:
sentence_0 sentence_1 [SUBSECTION g] Respiratory illness reporting Ongoing reporting.. —(1) The facility must electronically report information on acute respiratory illnesses, including influenza, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)/coronavirus 2019 (COVID-19), and respiratory syncytial virus (RSV). [ITEM i] The report must be in a standardized format and frequency specified by the Secretary. [ITEM ii] To the extent as required by the Secretary, this report must include all of the following data elements:[SUBSECTION g] Respiratory illness reporting Ongoing reporting.. —(1) The facility must electronically report information on acute respiratory illnesses, including influenza, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)/coronavirus 2019 (COVID-19), and respiratory syncytial virus (RSV). [ITEM i] The report must be in a standardized format and frequency specified by the Secretary. [ITEM ii] To the extent as required by the Secretary, this report must include all of the following data elements:[SUBSECTION d] Duration of Scholarship award.. Subject to the availability of funds for the Scholarship Program, the Secretary will award a participant a scholarship under this part for a period of 1 school year.[SUBSECTION d] Duration of Scholarship award.. Subject to the availability of funds for the Scholarship Program, the Secretary will award a participant a scholarship under this part for a period of 1 school year.[SUBSECTION b] Only those officers or employees [MASK] by the officer in charge for such purpose may obligate and expend monies from the patient fund. The names of officials so designated shall be provided to the relevant fiscal control office.[SUBSECTION b] Only those officers or employees specifically designated in writing by the officer in charge for such purpose may obligate and expend monies from the patient fund. The names of officials so designated shall be provided to the relevant fiscal control office. - Loss:
DenoisingAutoEncoderLoss
Training Hyperparameters
Non-Default Hyperparameters
per_device_train_batch_size: 16per_device_eval_batch_size: 16multi_dataset_batch_sampler: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 3max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters:auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}
Training Logs
| Epoch | Step | Training Loss |
|---|---|---|
| 0.3215 | 500 | 5.9119 |
| 0.6431 | 1000 | 4.6095 |
| 0.9646 | 1500 | 4.0751 |
| 1.2862 | 2000 | 3.7418 |
| 1.6077 | 2500 | 3.5111 |
| 1.9293 | 3000 | 3.3365 |
| 2.2508 | 3500 | 3.1787 |
| 2.5723 | 4000 | 3.0507 |
| 2.8939 | 4500 | 2.9646 |
Framework Versions
- Python: 3.12.6
- Sentence Transformers: 5.2.0
- Transformers: 4.56.0
- PyTorch: 2.8.0+cu129
- Accelerate: 1.10.1
- Datasets: 4.4.1
- Tokenizers: 0.22.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
DenoisingAutoEncoderLoss
@inproceedings{wang-2021-TSDAE,
title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning",
author = "Wang, Kexin and Reimers, Nils and Gurevych, Iryna",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
pages = "671--688",
url = "https://arxiv.org/abs/2104.06979",
}
- Downloads last month
- 5
Model tree for atx-labs/bge-base-custom-noise-cfr
Base model
BAAI/bge-base-en-v1.5