What is Context Window?
The maximum amount of text (in tokens) a model can "see" at once. Larger context = more document history, longer conversations, bigger code files — but requires more VRAM.
Full Explanation
A model's context window is the total number of tokens it processes in a single forward pass — both your input and its output combined. Llama 3.1 supports up to 128K tokens (~96,000 words). However, every additional token in the context increases VRAM consumption. Running a 128K context requires significantly more memory than the base model weight alone, often doubling VRAM usage. Most local AI setups cap context at 4K–8K tokens to keep memory consumption manageable.
Why It Matters for Local AI
For coding assistants working on large codebases, a 32K+ context window is transformative. For simple Q&A, 4K is sufficient. Check whether your hardware supports your target context size — a 13B model at 32K context may require 24+ GB of VRAM, pushing it beyond 12 GB cards.
Hardware Relevant to Context Window
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
Related Terms
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
KV Cache→
Key-Value Cache — stores intermediate attention computations so the model doesn't re-process earlier context on each new token. Larger context = larger KV cache = more VRAM needed.