π QVAC Cross-Platform 1 Bit and 2 Bit LoRA Adapters
Fine-tuned bitnet LoRA adapters trained using qvac-rnd-fabric-llm-bitnet - the first truly cross-platform bitnet inference and LoRA fine-tuning framework for Large Language Models. These adapters work on any GPU (Adreno, Mali, Apple Silicon, AMD, Intel, NVIDIA) using Vulkan and Metal backends.
β οΈ Important Disclaimer
These adapters are domain-specific and intended for biomedical Q&A tasks only.
The LoRA adapters were fine-tuned on PubMedQA biomedical data using a structured
Q: ... A:prompt format. They are not general-purpose conversational models.What to expect with off-topic prompts: If you provide casual or unrelated input the model will not crash, but it will produce nonsensical or hallucinated biomedical-sounding text. This is expected behavior β the adapter has shifted the model's output distribution toward medical literature, so it will attempt to generate biomedical content regardless of the input.
For best results:
- Use the structured format:
"Q: <your biomedical question>\nA:"- Keep prompts within the biomedical/clinical domain
- Use recommended temperature settings (0.3β0.5 for factual answers)
This model is a research artifact and must NOT be used for actual medical diagnosis, treatment decisions, or clinical advice. The outputs may contain inaccuracies, hallucinations, or contradictory statements. Always consult qualified healthcare professionals for medical guidance.
π¦ Available Adapters
| Adapter | Size | Base Model |
|---|---|---|
| Bitnet_b1_58-xl Tq1_0 Lora Adapter | 30MB | bitnet_b1_58-xl-TQ1_0 |
| Bitnet_b1_58-xl Tq2_0 Lora Adapter | 30MB | bitnet_b1_58-xl-TQ2_0l |
π Empowering the Community with Open Resources
To accelerate development and innovation, Tether is publicly releasing:
Research documentation, benchmarks, datasets, and evaluation scripts
π qvac-rnd-fabric-llm-bitnetSource Code and builds
π qvacβfabricβllm.cpp
π― Quick Start Guide
Direct Inference
Use the adapter directly without merging
Step 1: Download Platform-Specific Binary
Please download the latest binaries from the release page.
Step 2: Download Base Model & Adapter
Choose your model and download both base model and adapter:
# Create directories
mkdir -p models adapters
wget https://huggingface.co/qvac/fabric-llm-finetune-bitnet/resolve/main/1bitLLM-bitnet_b1_58-xl-tq1_0.gguf
wget https://huggingface.co/qvac/fabric-llm-finetune-bitnet/resolve/main/tq1_0-biomed-trained-adapter.gguf
Note : Use same quantization model with same adapter. The adapters are in FP16 but they need to be used with the models they were trained with.
Step 3: Run Inference with Adapter
# Interactive chat mode
./bin/llama-cli \
-m models/base.gguf \
--lora adapters/adapter.gguf \
-ngl 999 \
-c 2048 \
--temp 0.7 \
-p "Q: Does vitamin D supplementation prevent fractures?\nA:"
# Single prompt mode
./bin/llama-cli \
-m models/base.gguf \
--lora adapters/adapter.gguf \
-ngl 999 \
-p "Explain the mechanism of action for beta-blockers in treating hypertension."
Expected Output:
Q: Does vitamin D supplementation prevent fractures?
A: Yes. Rationale: Meta-analysis of randomized controlled trials shows that
vitamin D supplementation, particularly when combined with calcium, significantly
reduces the risk of hip fractures and other non-vertebral fractures in elderly
populations...
Custom Temperature & Sampling
Fine-tune the generation parameters for your use case:
./bin/llama-cli \
-m models/base.gguf \
--lora adapters/adapter.gguf \
-ngl 999 \
--temp 0.3 \ # Lower = more focused (good for medical)
--top-p 0.9 \ # Nucleus sampling
--top-k 40 \ # Top-k sampling
--repeat-penalty 1.1 \
-n 512 \ # Max tokens to generate
-p "Your prompt"
Recommended settings for biomedical Q&A:
- Temperature: 0.3-0.5 (deterministic, factual)
- Temperature: 0.7-0.9 (creative explanations)
Batch Processing
Process multiple prompts from a file:
# Create prompts file
cat > prompts.txt << 'EOF'
Q: Does vitamin D supplementation prevent fractures?
Q: Is aspirin effective for primary prevention of cardiovascular disease?
Q: Do statins reduce mortality in patients with heart failure?
EOF
# Process all prompts
cat prompts.txt | while read prompt; do
echo "=== Processing: $prompt ==="
./bin/llama-cli \
-m models/base.gguf \
--lora adapters/adapter.gguf \
-ngl 999 \
--temp 0.4 \
-p "$prompt\nA:"
echo ""
done
π Command Line Reference
Essential Flags
| Flag | Description | Example | Default |
|---|---|---|---|
-m |
Base model path (REQUIRED) | -m model.gguf |
- |
--lora |
LoRA adapter path | --lora adapter.gguf |
none |
-ngl |
GPU layers (999 = all) | -ngl 999 |
0 |
-c |
Context size | -c 2048 |
512 |
-p |
Prompt text | -p "Question" |
- |
--temp |
Temperature (0-2) | --temp 0.7 |
0.8 |
-n |
Max tokens to generate | -n 512 |
-1 |
-b |
Batch size | -b 512 |
512 |
--flash-attn |
Flash attention | --flash-attn off |
on |
Mobile-Specific Flags
For Android/iOS with limited memory:
./bin/llama-cli \
-m model.gguf \
--lora adapter.gguf \
-ngl 99 \ # Partial GPU offload
-c 512 \ # Smaller context
-b 128 \ # Smaller batch
-fa off \ # Disable flash attention (Vulkan)
-ub 128 # Uniform batch size
π Cross-Platform Compatibility
Supported Platforms
These adapters work identically across:
| Platform | Hardware | Backend | Status |
|---|---|---|---|
| β Android | Qualcomm Adreno, ARM Mali | Vulkan | Supported |
| β iOS | Apple A-series | Metal | Supported |
| β macOS | Apple M1/M2/M3/M4 | Metal | Supported |
| β Linux | AMD, Intel, NVIDIA | Vulkan | Supported |
| β Windows | AMD, Intel, NVIDIA | Vulkan | Supported |
| β CPU | Any x86_64, ARM64 | CPU | Fallback |
No Conversion Needed
Unlike traditional frameworks:
- β No need to convert between different frameworks
- β No platform-specific model formats
- β No separate training for each device
- β Train once, run everywhere!
π Additional Resources
Documentation
Binary usage and Documentation
Platform-Specific Guides
Community
π Troubleshooting
Common Issues
1. "DeviceLost" error on Android/Adreno:
# Use smaller batch size and disable flash attention
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 99 -c 512 -b 128 -ub 128 -fa off
2. Out of Memory (OOM) errors:
# Reduce context size or use smaller model
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 50 -c 512
3. Slow inference on mobile:
# Offload fewer layers to GPU
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 20
4. Adapter not loading:
# Verify adapter file exists and matches model architecture
ls -lh adapters/
./bin/llama-cli -m model.gguf --lora adapter.gguf --verbose
π Citation
If you use these adapters in your research, please cite:
@article{qvac-bitnet-finetune,
title={LoRA Fine-Tuning BitNet b1.58 LLMs on Heterogeneous Edge GPUs via QVAC Fabric},
author={Subash SN, Akshay Nambiar, Milan Gritta, Zhen Cong Chen, Arsalan Anwari, Gianni, Amril Nurman},
blog={Huggingface Blog},
year={2026}
}
π License
- LoRA Adapters: Apache 2.0 License
- Base Models:
- Training Framework (qvac-fabric-llm-bitnet): Apache 2.0 License
π Acknowledgments
- bitnet.cpp - Foundation inference engine by microsoft based on llama.cpp by Georgi Gerganov
- LoRA - Parameter-efficient fine-tuning method (Hu et al., 2021)
- PubMedQA - Biomedical dataset source (Jin et al., 2019)
- Microsoft Team - Base models
- Hardware vendors who provided testing devices
Making LLM fine-tuning accessible to everyone, everywhere
From smartphones to datacenters β’ No vendor lock-in β’ Privacy-preserving
β Star the qvac-fabric-llm-bitnet-finetune repo if you find it useful!
- Downloads last month
- 67
1-bit
2-bit