Instructions to use QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix") model = AutoModelForCausalLM.from_pretrained("QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix
- SGLang
How to use QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix with Docker Model Runner:
docker model run hf.co/QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix
Accessing LLM, response without<think>start tag
Accessing LLM, response withoutstart tag,for example:
======================request=====================
hello
======================response====================
Hmm, the user just sent "hello" - that's a pretty minimal opening.
First impression: they're probably testing the waters or starting a casual conversation. Could be a new user unsure how to engage, or someone busy just dropping a quick greeting. No context about their needs yet.
I should keep it warm but open-ended. A simple mirrored greeting feels right - "Hello!" matches their casual tone. Then add an invitation to guide them: "How can I assist you today?" gives them an easy way to steer the conversation without pressure.
...Wait, is there any chance they need urgent help? Unlikely with just "hello," but I'll keep the tone ready for anything. Better not overcomplicate this - they'll share more if they want to.
End with the glasses emoji - it's friendly but professional. Keeps the door wide open for whatever comes next.Hello! How can I assist you today? 😊
https://github.com/vllm-project/vllm/issues/31319
one can use deepseek_r1 reasoning parser as a temporary fix
cool, thank you
Thank! but deepseek_r1 work for thinking mode :
"chat_template_kwargs": {
"enable_thinking": true
}
but with "enable_thinking": false,output is all rasoning, not content.
yes I made my own fix for the reasoning and the tool parser, not sure if they have been fixed in the repo yet. But this model is great best quality so far. Not a single tool call error. running in vllm.
yes I made my own fix for the reasoning and the tool parser, not sure if they have been fixed in the repo yet. But this model is great best quality so far. Not a single tool call error. running in vllm.
It is a fantastic quant. Better than the AWQ. I get no thinking output at all though, but I know its thinking due to the time it takes. How did you resolve?