Question 1

What is AWQ?

Accepted Answer

AWQ (Activation-Aware Weight Quantization) analyzes activation magnitudes to identify the 1% of weights that contribute most to model quality, protecting them from aggressive quantization while compressing the rest to 4-bit integers. This produces models that are comparable to GGUF Q4_K_M in size but often higher quality on reasoning benchmarks. AWQ models are distributed as SafeTensors files on HuggingFace and are primarily used with vLLM, TGI, and lmdeploy rather than llama.cpp.

Question 2

Why does AWQ matter for local AI?

Accepted Answer

If you're building a production local inference server on Linux with an NVIDIA GPU, AWQ + vLLM is often the highest-throughput option, outperforming GGUF in batch scenarios. For single-user interactive chat, GGUF with llama.cpp is simpler. Choose AWQ when serving multiple concurrent users.

What is AWQ?

Full Explanation

Why It Matters for Local AI

Hardware Relevant to AWQ

Related Terms