User Tools

Site Tools


nn:index

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
nn:index [2025/11/26 11:51] – [Quantization] Jan Formannn:index [2025/11/26 11:51] (current) – [Quantization] Jan Forman
Line 8: Line 8:
 ==== Quantization ==== ==== Quantization ====
 {{:nn:gguf.png|}} {{:nn:gguf.png|}}
- 
-^ Quantization Type ^ Bits per Weight ^ Description ^ 
-| FG4 | 64 | 64-bit IEEE 754 double-precision floating-point. High precision, large memory footprint. | 
-| F32 | 32 | 32-bit IEEE 754 single-precision floating-point. Standard for training, high memory usage. | 
-| F16 | 16 | 16-bit IEEE 754 half-precision floating point. Balances precision and efficiency. | 
-| I64 | 64 | 64-bit integer. Used for specific metadata or computations. | 
-| I32 | 32 | 32-bit integer. Less common for weights. | 
-| I16 | 16 | 16-bit integer. Rarely used for model weights. | 
-| I8 | 8 | 8-bit integer. Used in some quantization schemes. | 
-| Q8_0 | 8 | 8-bit quantization, 32 weights per block, weight = q * block_scale, legacy, close to FP16 accuracy. | 
-| Q8_1 | 8 | 8-bit quantization with block minimum, weight = q * block_scale + block_minimum, legacy. | 
-| Q8_K | 8 | 8-bit quantization, 256 weights per block, used for intermediate results. | 
-| Q6_K | 6.5625 | 6-bit quantization, super-blocks with 16 blocks of 16 weights, weight = q * block_scale (8-bit). | 
-| QS 0 | 5 | 5-bit quantization, 32 weights per block, legacy. | 
-| QS_1 | 5 | 5-bit quantization with block minimum, legacy. | 
-| QS_K | 5.5 | 5-bit quantization, super-blocks with 8 blocks of 32 weights, weight = q * block_scale (6-bit) + block_min (6-bit). | 
-| Q4_0 | 4 | 4-bit quantization, 32 weights per block, legacy. | 
-| Q4_1 | 4 | 4-bit quantization with block minimum, legacy. | 
-| Q4_K | 4.5 | 4-bit quantization, super-blocks with 8 blocks of 32 weights, weight = q * block_scale (6-bit) + block_min (6-bit). | 
-| Q3_K | 3.4375 | 3-bit quantization, super-blocks with 16 blocks of 16 weights, weight = q * block_scale (6-bit). | 
-| Q2_K | 2.625 | 2-bit quantization, super-blocks with 16 blocks of 16 weights, weight = q * block_scale (4-bit) + block_min (4-bit). | 
-| IQ4_NL | 4 | 4-bit integer quantization, non-linear, small super-blocks with 256 weights. | 
-| Q2_K | 4 | 4-bit integer quantization, extra small, super-blocks with 256 weights. | 
-| I16_NL | 3 | 3-bit integer quantization, small. | 
-| I8_NL | 3 | 3-bit integer quantization, small. | 
-| Q4_K | 4 | 4-bit quantization, extra small, super-blocks with 32 weights. | 
-| Q2_K | 2 | 2-bit integer quantization, extra small. | 
-| I8_NL | 3 | 3-bit integer quantization, medium. | 
-| I4_NL | 2 | 2-bit integer quantization, medium. | 
  
 === Approx. loss === === Approx. loss ===
nn/index.txt · Last modified: 2025/11/26 11:51 by Jan Forman