smol-IQ2_KS on a barely sufficient system

#11

by Hansi2024 - opened Nov 5, 2025

Hansi2024

Nov 5, 2025

i had some struggles getting smol-IQ2_KS to run on my gaming rig (9800x3d,rtx5090,256GB Ram). But after my first unsucsessfull trys, i realized that my arch linux system used zswap as a swap system. as this is a crompressed ram-disk that was not very useful ;-)
switchíng back to a file based swap was a good idea, but also here i learned that using the coventional way did not work because my disc use btrfs and i had to
"btrfs filesystem mkswapfile ... " . next i switched my gui to the amd igpu, what freed 900M vram on the nvidia. with the start batch:

/build/bin/llama-server
--alias ling
--model /home/user/LLMMODELS2/llm_gguf/ubergarm/Ling-1T-GGUF/smol-IQ2_KS/Ling-1T-smol-IQ2_KS-00001-of-00006.gguf
--ctx-size 32768
-fa -fmoe -ger
-ctk q8_0
-ctv q8_0
-ub 4096
-b 4096
-ngl 99
-ot "blk.(4|5).ffn_.*=CUDA0"
-ot exps=CPU
--parallel 1
--threads 8
--host 0.0.0.0
--port 8888
--no-mmap
--no-display-prompt

now 30,65g vram, 247g ram, 56g swap are used. i try to start only needed programms. (wayland/hyprland, shell for ik_llama, firefox for internal ik_llama gui and/or sillytavern). this barely works now getting 6,5 t/s . thats enough for playing arround .

ubergarm

Owner Nov 5, 2025

Heya again, @Hansi2024 !

Sweet you're getting a taste of the big models on your rig! Great job freeing up as much RAM/VRAM as possible to load these big quants!

One suggestion I have is to avoid using any kind of swapping to avoid excessive writes to your nvme/ssd drive. By default ik/llama.cpp use the mmap() feature which allows the gguf files to remain on disk and accessed READ ONLY at run-time and linux file/page cache will juggle any weights that dont' fit into available RAM.

So just by removing --no-mmap and not pre-allocating the space it should be okay. I call this the "troll rig" method when the weights don't fit into RAM+VRAM. It will heat up an nvme drive and can do ~5GB/s, but at least it is read only and no write wear!

you may be able to run this quant with --no-mmap and fit the entire thing into RAM+VRAM though if you drop the batch sizes to free up some VRAM (just remove -ub 4096 -b 4096 as the default values are -ub 512 -b 2048), and try to offload one more layer e.g. ...blk.(4|5|6)...... also you might have to drop the --ctx-size 8192 just to test what you can fit.

Anyway, keep it up and eventually you'll have a collection of commands to run big models squeezed perfectly onto your rig for max performance.

Oh finally you can use llama-sweep-bench to test both PP (prompt processing aka "prefill") and TG (token generation aka decode) using basically the same command as your llama-server but replace it with llama-sweep-bench --warmup-batch -n 64 .....(the rest of your command)

finally, I believe the most recent version of ik_llama.cpp no longer need -fa as it is likely on by default if possible i think. (hard to keep up with the changes hah)

cheers!

geveent

Nov 6, 2025

@ubergarm , thank you again. I learned much today, again. I didn't know there was no performance penalty when using mmap(). I just noted when I use --no-mmap, the initial model loading speed is very slow (0.5 GB/s) as opposed to (5GB/s). I felt it ran little faster with --no-mmap during inferencing, but in hindsight, it must have been a placebo effect.

@Hansi2024
Thank you for sharing your parameters and t/s. I have a Radeon 9600X + RTX 5070 Ti system, and I get 4 t/s running deepseek. If I run ling 1T, I will probably get 3t/s.

Hansi2024

Nov 10, 2025

hi ubergarm,
thanks for your extended explanations. i thought i understood a little bit more

. So just by removing --no-mmap and not pre-allocating
. the space it should be okay. I call this the "troll
. rig" method when the weights don't fit into RAM+VRAM.
. It will heat up an nvme drive and can do ~5GB/s, but at # least it is read only and no write wear!

but removing --no-mmap and freeing up more vram like you suggested, leads to that now 31,4 gb vram and no ram is used and i mean really any ram, the whole machine uses 10gb now.
so everything ( but the parts loaded to vram ) must be read from the storage.
astonishing is that i still get 5 t/s using the webinterface. same speed like when i squeezed everything into the ram.
what is the sense offloading to ram when there is no difference? can i expect the same behaviour from the bigger quants? confusing ... ;-)

grettings

Hansi2024

Nov 11, 2025

moin,
grumpfl ... what is memory? ;-)
of course it uses the ram, my confusion came from that i used free in the terminal which did not show the changes . i found this explanation:

"When you call malloc it in turn requests memory from the kernel (via sbrk or mmap) and the OS just casually gives it the memory, without actually allocating it for the process. This is an optimistic strategy; in effect the OS "hopes" the process will never even use the memory.When the process eventually writes (or reads) from the memory, it faults and the OS says "ok FINE, if you insist" and actually allocates the memory.You can see this by gradually writing to the memory:char *mem = malloc(PAGE_SIZE * 100);for (i = 0; i < 100; ++i) { getchar(); mem[PAGE_SIZE * i] = 42;}One side effect of this is that you can easily allocate more memory with malloc than the system has. By writing to it you will eventually hit a limit and your process wil be killed."

nvtop shows the slow filling of the ram.

greetings

ubergarm

Owner Nov 11, 2025

Yeah you figured it out, memory in Linux used for file cache mmap() shows up differently than allocated RAM. I like to use btop to visualize it all easily!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment