| Safetensors | Safe store for tensors |
| GGUF | Georgi Gerganov Universal Format (it can mix various precisions) |
| PT | PyTorch format |
Legacy Quantizations (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1): These are simpler, faster methods but may have higher quantization error compared to newer types.
K-Quantizations (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K): Introduced in llama.cpp PR #1684, these use super-blocks for smarter bit allocation, reducing quantization error.
I-Quantizations (IQ2_XXS, IQ3_S, etc.): State-of-the-art for low-bit widths, using lookup tables for improved accuracy but potentially slower on older hardware.
GGUF Q8_0: Very close to FP16 (perplexity 7.4933), indicating minimal accuracy loss.
GGUF Q4_K_M: Slightly higher perplexity (7.5692), still usable for most tasks.
UD IQ4 (4x performance boost on Blackwell) or IQ4_NL may be better variant than Q4_K_M
| Model | Maker |
|---|---|
| Gemma 3 | |
| Nemotron | NVIDIA |
| LLama 3 | |
| DeepSeek | DeepSeek |
| Qwen | Alibaba |
| Mistral | Mistral AI |
by Stability AI https://stability.ai
by Black Forest Labs https://bfl.ai
Self-attention Transformer as a text encoder
curl https://localhost/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-3-4b-q4.gguf",
"messages": [
{
"role": "system",
"content": "Answer short in czech language"
},
{
"role": "user",
"content": "Who is Albert Einstein"
}
],
"temperature": 0.7,
"max_tokens": -1,
"stream": false
}'