Instructions to use google/gemma-3n-E4B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-3n-E4B-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-3n-E4B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-3n-E4B-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-3n-E4B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use google/gemma-3n-E4B-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-3n-E4B-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3n-E4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-3n-E4B-it
- SGLang
How to use google/gemma-3n-E4B-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-3n-E4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3n-E4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-3n-E4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-3n-E4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-3n-E4B-it with Docker Model Runner:
docker model run hf.co/google/gemma-3n-E4B-it
Non-text output?
Hi team, I noticed in the model description that the output tokens are text only. However, in the Gemma3n video here there is a live demo where Gemma3n behaves like a voice assistant with voice responses.
Is that an additional module that does text to speech (separate from the model) or does the model natively support audio output tokens?
Hi @sidhusmart ,
Welcome to Gemma family of open source models, thanks for reaching out to us - The Gemma 3n models have multimodality capabilities, these 3n models are capable of taking inputs like text, image, audio and produce the result in the text for the give input based the user query/prompt - Analyzing the image, answering the user question..etc.
The input for the 3n models:
Text string, such as a question, a prompt, or a document to be summarized
Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each
Audio data encoded to 6.25 tokens per second from a single channel
Total input context of 32K tokens
The output from the 3n models:
Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output length up to 32K tokens, subtracting the request input tokens
Please feel free to reach out to me for any additional assistance.
Thanks.