What is AWQ?
Activation-Aware Weight Quantization — a 4-bit quantization method that outperforms GGUF Q4 in quality by identifying and preserving the most important weights. Primarily used with vLLM and HuggingFace.
Full Explanation
AWQ (Activation-Aware Weight Quantization) analyzes activation magnitudes to identify the 1% of weights that contribute most to model quality, protecting them from aggressive quantization while compressing the rest to 4-bit integers. This produces models that are comparable to GGUF Q4_K_M in size but often higher quality on reasoning benchmarks. AWQ models are distributed as SafeTensors files on HuggingFace and are primarily used with vLLM, TGI, and lmdeploy rather than llama.cpp.
Why It Matters for Local AI
If you're building a production local inference server on Linux with an NVIDIA GPU, AWQ + vLLM is often the highest-throughput option, outperforming GGUF in batch scenarios. For single-user interactive chat, GGUF with llama.cpp is simpler. Choose AWQ when serving multiple concurrent users.
Hardware Relevant to AWQ
gpu · Check Price on Amazon · 16 GB VRAM · 960 GB/s
gpu · Check Price on Amazon · 24 GB VRAM · 1008 GB/s
Related Terms
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
GGUF→
The standard file format for quantized LLMs used by llama.cpp and Ollama. Replaces the older GGML format. Stores model weights and metadata in a single portable file.
EXL2→
ExLlamaV2's quantization format — offers mixed-precision quantization (e.g., 3.0 to 6.0 bits per weight) and is often the highest-quality option for a given VRAM budget on NVIDIA GPUs.
CUDA→
NVIDIA's proprietary parallel computing platform. Industry standard for AI/ML. Nearly every AI framework (PyTorch, Ollama, ComfyUI) supports CUDA natively and first.
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.