Instructions to use CohereLabs/aya-23-35B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CohereLabs/aya-23-35B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="CohereLabs/aya-23-35B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-23-35B")
model = AutoModelForCausalLM.from_pretrained("CohereLabs/aya-23-35B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use CohereLabs/aya-23-35B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CohereLabs/aya-23-35B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/aya-23-35B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/CohereLabs/aya-23-35B

SGLang

How to use CohereLabs/aya-23-35B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "CohereLabs/aya-23-35B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/aya-23-35B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "CohereLabs/aya-23-35B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/aya-23-35B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use CohereLabs/aya-23-35B with Docker Model Runner:
```
docker model run hf.co/CohereLabs/aya-23-35B
```

Finetune 35B Model

by amitbcp - opened Jun 1, 2024

Discussion

amitbcp

Jun 1, 2024

The notebook is for finetuning 8B model and not 35B.
On changing the model name, it throws an error as the device map is not correctly mapped.

Kindly provide the device mapping or if you can please update the finetuning notebook.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

NovaYear

Jul 24, 2024

This error message is caused by the tensors used being on different GPUs. Specifically, some tensors are located on device cuda:0 (first GPU) and others are located on device cuda:1 (second GPU). PyTorch expects all tensors used in the same process to be on the same device.
To solve this problem, we need to make sure that all tensors and the model are on the same GPU.

MODEL_NAME = "CohereForAI/aya-23-35b"

Determine the GPU to use

gpu_id = 0 # Use first GPU (cuda:0) or 1
device = torch.device(f"cuda:{gpu_id}" if torch.cuda.is_available() else "cpu")
torch.cuda.set_device(device) # Set active GPU

model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=quantization_config,
attn_implementation=attn_implementation,
torch_dtype=torch.bfloat16,
device_map=device, #<<<== Set active GPU
)

These changes will resolve the incompatibility bug between different devices by ensuring that all operations occur on the same GPU.
If you have more than one GPU in your system and you are still experiencing problems, you can check the available GPUs and their usage status using the "nvidia-smi" command. You can also control which GPUs are visible to Python by setting the CUDA_VISIBLE_DEVICES environment variable.

alexrs changed discussion status to closed Jun 12, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment