Instructions to use CohereLabs/aya-23-35B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CohereLabs/aya-23-35B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="CohereLabs/aya-23-35B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-23-35B") model = AutoModelForCausalLM.from_pretrained("CohereLabs/aya-23-35B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use CohereLabs/aya-23-35B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "CohereLabs/aya-23-35B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CohereLabs/aya-23-35B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/CohereLabs/aya-23-35B
- SGLang
How to use CohereLabs/aya-23-35B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "CohereLabs/aya-23-35B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CohereLabs/aya-23-35B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "CohereLabs/aya-23-35B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CohereLabs/aya-23-35B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use CohereLabs/aya-23-35B with Docker Model Runner:
docker model run hf.co/CohereLabs/aya-23-35B
Finetune 35B Model
The notebook is for finetuning 8B model and not 35B.
On changing the model name, it throws an error as the device map is not correctly mapped.
Kindly provide the device mapping or if you can please update the finetuning notebook.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
This error message is caused by the tensors used being on different GPUs. Specifically, some tensors are located on device cuda:0 (first GPU) and others are located on device cuda:1 (second GPU). PyTorch expects all tensors used in the same process to be on the same device.
To solve this problem, we need to make sure that all tensors and the model are on the same GPU.
MODEL_NAME = "CohereForAI/aya-23-35b"
Determine the GPU to use
gpu_id = 0 # Use first GPU (cuda:0) or 1
device = torch.device(f"cuda:{gpu_id}" if torch.cuda.is_available() else "cpu")
torch.cuda.set_device(device) # Set active GPU
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=quantization_config,
attn_implementation=attn_implementation,
torch_dtype=torch.bfloat16,
device_map=device, #<<<== Set active GPU
)
These changes will resolve the incompatibility bug between different devices by ensuring that all operations occur on the same GPU.
If you have more than one GPU in your system and you are still experiencing problems, you can check the available GPUs and their usage status using the "nvidia-smi" command. You can also control which GPUs are visible to Python by setting the CUDA_VISIBLE_DEVICES environment variable.