Question 1

What is EXL2?

Accepted Answer

EXL2 is the native quantization format for ExLlamaV2, a high-performance inference engine for NVIDIA GPUs. Unlike GGUF's fixed quantization levels (Q4, Q5, Q8), EXL2 supports arbitrary bits-per-weight from 2.0 to 8.0 in 0.05-bit increments, letting you precisely fill available VRAM. A 12 GB GPU can load a 13B model at exactly 5.8 bits per weight — the highest quality that fits — rather than rounding to Q5_K_M. ExLlamaV2 is NVIDIA-only and generally faster than llama.cpp on CUDA hardware.

Question 2

Why does EXL2 matter for local AI?

Accepted Answer

EXL2 is the format of choice for NVIDIA GPU users who want maximum quality within their VRAM budget. If you have an RTX 5070 with 12 GB, running a 13B model at 5.8bpw EXL2 beats the equivalent GGUF Q5_K_M in both speed and output quality on most benchmarks.

What is EXL2?

Full Explanation

Why It Matters for Local AI

Hardware Relevant to EXL2

Related Terms