What is Embedding Model?
A model that converts text into numerical vectors for similarity search. Required for RAG pipelines. Much smaller and faster than chat LLMs — runs comfortably on CPU.
Full Explanation
Embedding models transform text into high-dimensional numerical vectors that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search over document collections. In a RAG pipeline, every document chunk is converted to an embedding and stored in a vector database; at query time, the query is also embedded and the nearest vectors retrieved. Popular local embedding models include nomic-embed-text and mxbai-embed-large, both available via Ollama and requiring less than 1 GB of memory.
Why It Matters for Local AI
Embedding models are always-on background services in a RAG setup. Because they're tiny compared to chat models, they can run on CPU without a dedicated GPU — making any mini PC a viable RAG server. An RTX 5070 system can run embeddings on CPU while the GPU handles chat inference simultaneously.
Hardware Relevant to Embedding Model
mini-pc · Check Price on Amazon · 16 GB Unified · 120 GB/s
mini-pc · Check Price on Amazon · 16 GB Unified · 68 GB/s
Related Terms
RAG→
Retrieval-Augmented Generation — a technique that lets an LLM answer questions using external documents by fetching relevant chunks at query time instead of relying on training data alone.
Context Window→
The maximum amount of text (in tokens) a model can "see" at once. Larger context = more document history, longer conversations, bigger code files — but requires more VRAM.
Ollama→
Free open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.
CPU Inference→
Running LLMs on the CPU rather than a GPU. Works on any hardware, no special drivers needed. Limited to ~8–12 t/s on 7B models — fine for background tasks, slow for interactive use.