What is RAG?
Retrieval-Augmented Generation — a technique that lets an LLM answer questions using external documents by fetching relevant chunks at query time instead of relying on training data alone.
Full Explanation
RAG (Retrieval-Augmented Generation) combines a vector database with an LLM to enable grounded, document-aware responses without fine-tuning. When a query arrives, relevant text chunks are retrieved from the vector store and injected into the prompt as context. The LLM then generates an answer based on those retrieved chunks rather than solely on its training data. Common local RAG stacks pair Ollama for inference with tools like Chroma, LanceDB, or pgvector for retrieval.
Why It Matters for Local AI
RAG is the primary use case for local AI in enterprise and productivity contexts — querying internal documents, codebases, or knowledge bases privately. Context window size matters here: a longer context window (32K+) lets you inject more retrieved chunks without truncation, improving answer quality on complex document sets.
Hardware Relevant to RAG
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
Related Terms
Context Window→
The maximum amount of text (in tokens) a model can "see" at once. Larger context = more document history, longer conversations, bigger code files — but requires more VRAM.
Embedding Model→
A model that converts text into numerical vectors for similarity search. Required for RAG pipelines. Much smaller and faster than chat LLMs — runs comfortably on CPU.
Ollama→
Free open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.
KV Cache→
Key-Value Cache — stores intermediate attention computations so the model doesn't re-process earlier context on each new token. Larger context = larger KV cache = more VRAM needed.