--- license: apache-2.0 language: - en base_model: - black-forest-labs/FLUX.2-dev pipeline_tag: text-to-image tags: - Flux2D - gguf - text-to-image - comfyUI - comfyUI-workflow --- # Simplified t2i Workflow for Flux2D (also Workaround for broken MultiGPU Nodes) **Workaround on the fly.** ~~Due to ComfyUI updates, the "large" DisTorch2MultiGPU nodes are no longer functional. This workflow has been modified to use the old "small" MultiGPU nodes from the first version.~~ The workflow should run on the DisTorch2MultiGPUv2 nodes to control VRAM (RAM) allocation, preventing VRAM overflow and the resulting swapping. Unfortunately, these nodes are currently broken due to the latest ComfyUI updates. As a fallback, the older MultiGPUv1 nodes are used (with up to 10–15% lower inference speed compared to DisTorch2). 🔗 GitHub (right-click to open in new tab)
pollockjj/ComfyUI-MultiGPU --- ![Workflow Flux2 t2i](Flux2-tvi-workflow.png) [Flux2 t2i.json](Flux2-t2i.json) --- ## Model Loading & GPU|CPU Distribution **UnetLoaderGGUFMultiGPU** and **VAELoaderMultiGPU** are assigned directly to `cuda:0`. **ClipLoaderGGUFMultiGPU** is fixed to `cpu` (CPU offloading). By setting fixed devices, noticeable VRAM swapping is prevented. **Test System: RTX 3090 (24GB VRAM) + 32GB RAM** - RAM usage: ~65-77% - VRAM usage: ~22-23GB - Virtual VRAM: not used ## Quick Reference: FLUX.2 + Mistral-3-Small GGUF | VRAM | FLUX.2 UNet | Mistral Text Encoder (CPU) | Notes | |------|-------------|----------------------------|-------| | **24GB** | Q8_0 (35GB) | Q8_K (29GB) | Best quality setup | | **16GB** | Q5_K_M (24.1GB) - try Q6_K (27.4GB) | Q6_K (19.3GB) or Q4_K_M (14.3GB) | Balanced quality | | **12GB** | Q4_K_M (20.1GB) | Q4_K_M (14.3GB) or Q3_K_M (11.5GB) | Speed priority | 🔗 HF (right-click to open in new tab)
city96/FLUX.2-dev-gguf
unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF
### Active Configuration **VRAM (24GB GPU):** - ~18-20GB: **UNet:** `flux2-dev-Q8_0.gguf` (Alternative: `FP8-Mixed.safetensors`) - ~1-1.5GB: **VAE:** `flux2-vae.safetensors` - ~1-2GB: Overhead - **= ~22-23GB total** **RAM (32GB CPU):** - ~25GB: **Text Encoder:** `Mistral-Small-3.2-24B-Instruct-2506-UD-Q8_K_XL.gguf` (Alternative: `fp8.safetensors`) 🔗 HF (right-click to open in new tab)
Comfy-Org/flux2-dev
--- ## Memory Management **RAMCleanup Node (active):** - **Required:** On first run - **Optional:** For resolutions ~1MP (e.g. 832×1216px) - **Required:** From >1MP, ≥2MP onwards (see Performance section) **Settings:** - Clean File Cache: ✓ - Clean Processes: ✓ - ~~Clean dlls~~ - Retry attempts: 3 - Runs between VAE-Decode and Save 🔗 GitHub (right-click to open in new tab)
LAOGOU-666/Comfyui-Memory_Cleanup --- ## run_nvidia_gpu.bat ```batch .\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --use-sage-attention --fast ``` **Important:** - ❌ **Remove** `--lowvram`, `--disable-smart-memory` or similar flags - ⚠️ **No VRAM/Memory flags!** Let ComfyUI breathe.
The details of this memory management are intelligently implemented by
comfyUI → city96's GGUF nodes → and the MultiGPU nodes on them. --- ### SageAttention Setup **With GGUF:** The acceleration provided by SageAttention is significantly lower than with FP8 models and may sometimes not take effect at all, since GGUF formats are primarily designed for CPU‑optimized inference and do not fully leverage the GPU kernels of SageAttention.
→ Disable --use-sage-attention. Instead, use --fast (standard PyTorch optimization) and rely on the internal optimizations of the GGUF Nodes(backend). **Check Triton installation (Windows):** ```batch .\python_embeded\python.exe -m pip show triton ``` If not installed: ```batch .\python_embeded\python.exe -m pip install triton-windows ``` **GPU-specific SageAttention versions:** | GPU Series | Version | Reason | |-----------|---------|--------| | **RTX 30xx (Ampere)** | 1.0.6 | Version 2.x offers no performance improvement | | **RTX 40xx (Ada)** | 2.2.0 | Primarily optimized for these or newer architectures | **Installation:** ```batch # RTX 30xx .\python_embeded\python.exe -m pip install sageattention==1.0.6 # RTX 40xx .\python_embeded\python.exe -m pip install sageattention==2.2.0 ``` 💡 **Tip:** New to SageAttention 2.2.0 installation?
🔗 GitHub (right-click to open in new tab)
Check out this 🔧 Installation Guide: SageAttention + Triton for ComfyUI --- ## Performance **Test Setup:** Guidance: 4 | Steps: 20 (Production: 30-40 Steps) Based on: ~80 runs with different resolutions **First Run** Initial loading of required layers into VRAM|RAM, first inference; Further inferences are then significantly faster as memory management is already initialized. Exact timings, loaded partially etc. see Console-Output / Screenshots. --- ### FP8 Format **First Run** - 832×1216px: ~380-400s

--- **Subsequent Runs:** - 832×1216px: 75-80s (~3.70-3.90s/it)

- 1080×1920px: 135-150s (~6.75-7.50s/it)

- 1440×2160px: 225-240s (~11.00-11.50s/it)

Dare to click — opens fixed-size copy --- ### GGUF Format (expectedly higher runtimes) **First Run** - 832×1216px: ~420-440s

--- **Subsequent Runs:** - 832×1216px: 105-120s (~5.30-5.50s/it) -

- 1440×2160px: 250-260s (~12.00-12.75s/it) -

Dare to click — opens fixed-size copy