This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revision | |||
| nn:index [2025/11/26 11:51] – [Quantization] Jan Forman | nn:index [2025/11/26 11:51] (current) – [Quantization] Jan Forman | ||
|---|---|---|---|
| Line 8: | Line 8: | ||
| ==== Quantization ==== | ==== Quantization ==== | ||
| {{: | {{: | ||
| - | |||
| - | ^ Quantization Type ^ Bits per Weight ^ Description ^ | ||
| - | | FG4 | 64 | 64-bit IEEE 754 double-precision floating-point. High precision, large memory footprint. | | ||
| - | | F32 | 32 | 32-bit IEEE 754 single-precision floating-point. Standard for training, high memory usage. | | ||
| - | | F16 | 16 | 16-bit IEEE 754 half-precision floating point. Balances precision and efficiency. | | ||
| - | | I64 | 64 | 64-bit integer. Used for specific metadata or computations. | | ||
| - | | I32 | 32 | 32-bit integer. Less common for weights. | | ||
| - | | I16 | 16 | 16-bit integer. Rarely used for model weights. | | ||
| - | | I8 | 8 | 8-bit integer. Used in some quantization schemes. | | ||
| - | | Q8_0 | 8 | 8-bit quantization, | ||
| - | | Q8_1 | 8 | 8-bit quantization with block minimum, weight = q * block_scale + block_minimum, | ||
| - | | Q8_K | 8 | 8-bit quantization, | ||
| - | | Q6_K | 6.5625 | 6-bit quantization, | ||
| - | | QS 0 | 5 | 5-bit quantization, | ||
| - | | QS_1 | 5 | 5-bit quantization with block minimum, legacy. | | ||
| - | | QS_K | 5.5 | 5-bit quantization, | ||
| - | | Q4_0 | 4 | 4-bit quantization, | ||
| - | | Q4_1 | 4 | 4-bit quantization with block minimum, legacy. | | ||
| - | | Q4_K | 4.5 | 4-bit quantization, | ||
| - | | Q3_K | 3.4375 | 3-bit quantization, | ||
| - | | Q2_K | 2.625 | 2-bit quantization, | ||
| - | | IQ4_NL | 4 | 4-bit integer quantization, | ||
| - | | Q2_K | 4 | 4-bit integer quantization, | ||
| - | | I16_NL | 3 | 3-bit integer quantization, | ||
| - | | I8_NL | 3 | 3-bit integer quantization, | ||
| - | | Q4_K | 4 | 4-bit quantization, | ||
| - | | Q2_K | 2 | 2-bit integer quantization, | ||
| - | | I8_NL | 3 | 3-bit integer quantization, | ||
| - | | I4_NL | 2 | 2-bit integer quantization, | ||
| === Approx. loss === | === Approx. loss === | ||