Instructions to use QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix")
model = AutoModelForCausalLM.from_pretrained("QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix

SGLang

How to use QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix with Docker Model Runner:
```
docker model run hf.co/QuantTrio/GLM-4.7-GPTQ-Int4-Int8Mix
```

Accessing LLM, response without<think>start tag

by sudage - opened Dec 29, 2025

Discussion

sudage

Dec 29, 2025

Accessing LLM, response withoutstart tag，for example:

======================request=====================
hello
======================response====================
Hmm, the user just sent "hello" - that's a pretty minimal opening.

First impression: they're probably testing the waters or starting a casual conversation. Could be a new user unsure how to engage, or someone busy just dropping a quick greeting. No context about their needs yet.

I should keep it warm but open-ended. A simple mirrored greeting feels right - "Hello!" matches their casual tone. Then add an invitation to guide them: "How can I assist you today?" gives them an easy way to steer the conversation without pressure.

...Wait, is there any chance they need urgent help? Unlikely with just "hello," but I'll keep the tone ready for anything. Better not overcomplicate this - they'll share more if they want to.

End with the glasses emoji - it's friendly but professional. Keeps the door wide open for whatever comes next.Hello! How can I assist you today? 😊

tclf90

QuantTrio org Dec 29, 2025

https://github.com/vllm-project/vllm/issues/31319

one can use deepseek_r1 reasoning parser as a temporary fix

sudage

Dec 29, 2025

cool, thank you

i92bacam

Dec 30, 2025

Thank! but deepseek_r1 work for thinking mode :
"chat_template_kwargs": {
"enable_thinking": true
}

but with "enable_thinking": false,output is all rasoning, not content.

chriswritescode

Jan 7

yes I made my own fix for the reasoning and the tool parser, not sure if they have been fixed in the repo yet. But this model is great best quality so far. Not a single tool call error. running in vllm.

darkstar3537

Feb 4

yes I made my own fix for the reasoning and the tool parser, not sure if they have been fixed in the repo yet. But this model is great best quality so far. Not a single tool call error. running in vllm.

It is a fantastic quant. Better than the AWQ. I get no thinking output at all though, but I know its thinking due to the time it takes. How did you resolve?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment