What is Multimodal?
Models that process both text and images (and sometimes audio or video). Examples: LLaVA, Qwen-VL, Gemma 3. Require additional VRAM for the vision encoder on top of the language model.
Full Explanation
Multimodal LLMs combine a language model with a vision encoder (typically a CLIP or SigLIP variant) that converts images into token embeddings the language model can process. This adds 0.5–2 GB of VRAM overhead on top of the base model. A 7B multimodal model like LLaVA-1.6 requires roughly 6–7 GB of VRAM to run at Q4 — fitting comfortably on a 12 GB GPU but tight on 8 GB cards. Most multimodal models are supported natively in Ollama and llama.cpp.
Why It Matters for Local AI
Multimodal models enable use cases like analyzing screenshots, extracting data from photos of documents, describing images for accessibility tools, or building local alternatives to GPT-4o Vision. For GPU buyers, the practical advice is: if you plan to use vision models, add 2 GB to your minimum VRAM estimate.
Hardware Relevant to Multimodal
gpu · Check Price on Amazon · 12 GB VRAM · 672 GB/s
gpu · Check Price on Amazon · 16 GB VRAM · 960 GB/s
mini-pc · Check Price on Amazon · 24 GB Unified · 273 GB/s
Related Terms
VRAM→
Video RAM — dedicated memory on a GPU. Determines the maximum model size you can run with full GPU acceleration. Once a model exceeds VRAM, it spills to system RAM over the slow PCIe bus.
Ollama→
Free open-source tool for running LLMs locally on macOS, Linux, and Windows. Download a model with a single command. No cloud account required. Supports Llama, Mistral, Qwen, Phi, and more.
Quantization→
Compressing a model by reducing numeric precision. Q4 = 4-bit (smallest, fastest), Q8 = 8-bit (balanced), FP16 = full precision. Less bits = less VRAM required, slight quality reduction.
Context Window→
The maximum amount of text (in tokens) a model can "see" at once. Larger context = more document history, longer conversations, bigger code files — but requires more VRAM.