--- license: gemma language: - en pipeline_tag: text-generation tags: - litert - litert-lm - gemma - agent - tool-calling - function-calling - multimodal - on-device library_name: litert-lm --- # Agent Gemma 3n E2B - Tool Calling Edition A specialized version of **Gemma 3n E2B** optimized for **on-device tool/function calling** with LiteRT-LM. While Google's standard LiteRT-LM models focus on general text generation, this model is specifically designed for agentic workflows with advanced tool calling capabilities. ## Why This Model? Google's official LiteRT-LM releases provide excellent on-device inference but don't include built-in tool calling support. This model bridges that gap by: - ✅ **Native tool/function calling** via Jinja templates - ✅ **Multimodal support** (text, vision, audio) - ✅ **On-device optimized** - No cloud API required - ✅ **INT4 quantized** - Efficient memory usage - ✅ **Production ready** - Tested and validated Perfect for building AI agents that need to interact with external tools, APIs, or functions while running completely on-device. ## Model Details - **Base Model**: Gemma 3n E2B - **Format**: LiteRT-LM v1.4.0 - **Quantization**: INT4 - **Size**: ~3.2GB - **Tokenizer**: SentencePiece - **Capabilities**: - Advanced tool/function calling - Multi-turn conversations with tool interactions - Vision processing (images) - Audio processing - Streaming responses ## Tool Calling Example The model uses a sophisticated Jinja template that supports OpenAI-style function calling: ```python from litert_lm import Engine, Conversation # Load the model engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="cpu") conversation = Conversation.create(engine) # Define tools the model can use tools = [ { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City name"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} }, "required": ["location"] } }, { "name": "search_web", "description": "Search the internet for information", "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"} }, "required": ["query"] } } ] # Have a conversation with tool calling message = { "role": "user", "content": "What's the weather in San Francisco and latest news about AI?" } response = conversation.send_message(message, tools=tools) print(response) ``` ### Example Output The model will generate structured tool calls: ``` call:get_weather{location:San Francisco,unit:celsius} call:search_web{query:latest AI news} ``` You then execute the functions and send back results: ```python # Execute tools (your implementation) weather = get_weather("San Francisco", "celsius") news = search_web("latest AI news") # Send tool responses back tool_response = { "role": "tool", "content": [ { "name": "get_weather", "response": {"temperature": 18, "condition": "partly cloudy"} }, { "name": "search_web", "response": {"results": ["OpenAI releases GPT-5...", "..."]} } ] } final_response = conversation.send_message(tool_response) print(final_response) # "The weather in San Francisco is 18°C and partly cloudy. # In AI news, OpenAI has released GPT-5..." ``` ## Advanced Features ### Multi-Modal Tool Calling Combine vision, audio, and tool calling: ```python message = { "role": "user", "content": [ {"type": "image", "data": image_bytes}, {"type": "text", "text": "What's in this image? Search for more info about it."} ] } response = conversation.send_message(message, tools=[search_tool]) # Model can see the image AND call search functions ``` ### Streaming Tool Calls Get tool calls as they're generated: ```python def on_token(token): if "" in token: print("Tool being called...") print(token, end="", flush=True) conversation.send_message_async(message, tools=tools, callback=on_token) ``` ### Nested Tool Execution The model can chain tool calls: ```python # User: "Book me a flight to Tokyo and reserve a hotel" # Model: calls check_flights() → calls book_hotel() → confirms both ``` ## Performance Benchmarked on CPU (no GPU acceleration): - **Prefill Speed**: 21.20 tokens/sec - **Decode Speed**: 11.44 tokens/sec - **Time to First Token**: ~1.6s - **Cold Start**: ~4.7s - **Tool Call Latency**: ~100-200ms additional GPU acceleration provides 3-5x speedup on supported hardware. ## Installation & Usage ### Requirements 1. **LiteRT-LM Runtime** - Build from source: ```bash git clone https://github.com/google-ai-edge/LiteRT.git cd LiteRT/LiteRT-LM bazel build -c opt //runtime/engine:litert_lm_main ``` 2. **Supported Platforms**: Linux (clang), macOS, Android ### Quick Start ```bash # Download model wget https://huggingface.co/kontextdev/agent-gemma/resolve/main/gemma-3n-E2B-it-agent-fixed.litertlm # Run with simple prompt ./bazel-bin/runtime/engine/litert_lm_main \ --model_path=gemma-3n-E2B-it-agent-fixed.litertlm \ --backend=cpu \ --input_prompt="Hello, I need help with some tasks" # Run with GPU (if available) ./bazel-bin/runtime/engine/litert_lm_main \ --model_path=gemma-3n-E2B-it-agent-fixed.litertlm \ --backend=gpu \ --input_prompt="What can you help me with?" ``` ### Python API (Recommended) ```python from litert_lm import Engine, Conversation, SessionConfig # Initialize engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="gpu") # Configure session config = SessionConfig( max_tokens=2048, temperature=0.7, top_p=0.9 ) # Start conversation conversation = Conversation.create(engine, config) # Define your tools tools = [...] # Your function definitions # Chat with tool calling while True: user_input = input("You: ") response = conversation.send_message( {"role": "user", "content": user_input}, tools=tools ) # Handle tool calls if present if has_tool_calls(response): results = execute_tools(extract_calls(response)) response = conversation.send_message({ "role": "tool", "content": results }) print(f"Agent: {response['content']}") ``` ## Tool Call Format The model uses this format for tool interactions: **Function Declaration** (system/developer role): ``` developer { "name": "function_name", "description": "What it does", "parameters": {...} } ``` **Function Call** (assistant): ``` call:function_name{arg1:value1,arg2:value2} ``` **Function Response** (tool role): ``` response:function_name{result:value} ``` ## Use Cases ### Personal AI Assistant - Calendar management - Email sending - Web searching - File operations ### IoT & Smart Home - Device control - Sensor monitoring - Automation workflows - Voice commands ### Development Tools - Code generation with API calls - Database queries - Deployment automation - Testing & debugging ### Business Applications - CRM integration - Data analysis - Report generation - Customer support ## Model Architecture Built on Gemma 3n E2B with 9 optimized components: ``` Section 0: LlmMetadata (Agent Jinja template) Section 1: SentencePiece Tokenizer Section 2: TFLite Embedder Section 3: TFLite Per-Layer Embedder Section 4: TFLite Audio Encoder (HW accelerated) Section 5: TFLite End-of-Audio Detector Section 6: TFLite Vision Adapter Section 7: TFLite Vision Encoder Section 8: TFLite Prefill/Decode (INT4) ``` All components are optimized for on-device inference with hardware acceleration support. ## Comparison | Feature | Standard Gemma LiteRT-LM | This Model | |---------|-------------------------|------------| | Text Generation | ✅ | ✅ | | Tool Calling | ❌ | ✅ | | Multimodal | ✅ | ✅ | | Streaming | ✅ | ✅ | | On-Device | ✅ | ✅ | | Jinja Templates | Basic | Advanced Agent Template | | INT4 Quantization | ✅ | ✅ | ## Limitations - **Tool Execution**: The model generates tool calls but doesn't execute them - you need to implement the actual functions - **Context Window**: Limited to 4096 tokens (configurable) - **Streaming Tool Calls**: Partial tool calls may need buffering - **Hardware Requirements**: Minimum 4GB RAM recommended - **No Native GPU on CPU-only systems**: Falls back to CPU inference ## Tips for Best Results 1. **Clear Tool Descriptions**: Provide detailed function descriptions 2. **Schema Validation**: Validate tool call arguments before execution 3. **Error Handling**: Handle malformed tool calls gracefully 4. **Context Management**: Keep conversation history concise 5. **Temperature**: Use 0.7-0.9 for creative tasks, 0.3-0.5 for precise tool calls 6. **Batching**: Process multiple tool calls in parallel when possible ## License This model inherits the [Gemma license](https://ai.google.dev/gemma/terms) from the base model. ## Citation ```bibtex @misc{agent-gemma-litertlm, title={Agent Gemma 3n E2B - Tool Calling Edition}, author={kontextdev}, year={2025}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/kontextdev/agent-gemma}} } ``` ## Links - [LiteRT-LM GitHub](https://github.com/google-ai-edge/LiteRT/tree/main/LiteRT-LM) - [Gemma Model Family](https://ai.google.dev/gemma) - [LiteRT Documentation](https://ai.google.dev/edge/litert) - [Tool Calling Guide](https://ai.google.dev/gemma/docs/function-calling) ## Support For issues or questions: - Open an issue on [GitHub](https://github.com/google-ai-edge/LiteRT/issues) - Check the [LiteRT-LM docs](https://ai.google.dev/edge/litert/inference) - Community forum: [Google AI Edge](https://discuss.ai.google.dev/) --- Built with ❤️ for the on-device AI community