Local LLM RAG Setup on Consumer Hardware
Retrieval-Augmented Generation (RAG) lets your local LLM answer questions from your own documents — PDFs, notes, codebases, support tickets — without sending data to the cloud. This guide walks through a complete local RAG pipeline on a single Mac or PC, using open-source tools and models that fit in 8–32 GB of RAM.
What Is Local RAG and Why Run It on Consumer Hardware?
RAG pipelines retrieve relevant document chunks before the LLM generates a response. Instead of training the model on your data, you feed it the most relevant context at query time. This gives accurate, citeable answers without expensive fine-tuning.
Running RAG locally means:
- Data never leaves your machine — critical for legal, medical, or proprietary documents
- No API costs — a one-time hardware investment with zero per-query fees
- Offline operation — works in air-gapped environments or when internet is unreliable
- Complete control — you choose the embedding model, the vector store, and the LLM
Consumer hardware (Mac Mini with 16 GB, a mid-range PC with 32 GB, or even a laptop) can handle document collections of hundreds to thousands of pages with sub-second retrieval times.
Pipeline Overview — Four Components
Every local RAG system has the same four building blocks:
| Component | Role | Local Options |
|---|---|---|
| Document loader | Extract text from PDFs, Markdown, HTML, Word, etc. | Unstructured, LlamaParse, pypdf, langchain_community |
| Embedding model | Convert text chunks into vector representations | BGE-small, all-MiniLM-L6-v2, nomic-embed-text |
| Vector store | Store and similarity-search embeddings | Chroma, FAISS, LanceDB |
| LLM | Generate answers from retrieved context | Llama 3.1 8B, Mistral 7B, Phi-4, Qwen 2.5 7B |
All four run on a single machine. The embedding model and vector store together use less than 2 GB of RAM. The LLM is the main consumer — choose a 7B or 8B parameter model for smooth performance on 16–32 GB systems.
Step 1 — Install the Foundation: Ollama + Embeddings
Start by installing Ollama, which serves both the LLM and the embedding model. It runs as a local API on port 11434.
If you have less than 16 GB RAM, use phi-4:3.8b or qwen2.5:7b instead of llama3.1:8b. These quantized models run comfortably on 8 GB machines.
Test that embeddings work:
Expected output: Embedding dimension: 768. If Ollama isn't running, start it with ollama serve in a terminal.
Step 2 — Load and Chunk Your Documents
Documents must be split into chunks small enough to fit the LLM's context window. A good default is 512 characters with 128-character overlap — enough to preserve paragraph continuity without wasting context on redundant boundaries.
The DirectoryLoader processes all PDFs in a folder. For Markdown or HTML files, use TextLoader or UnstructuredHTMLLoader instead. The recursive splitter preserves paragraph boundaries when possible — cleaner chunks produce better retrieval.
Memory note: A 500-page PDF generates roughly 2,000–3,000 chunks at 512 chars each. Storing embeddings for this collection takes about 50–80 MB in Chroma. Even a 10,000-page corpus fits within 2 GB of disk.
Step 3 — Store Embeddings in a Local Vector Database
Chroma is the simplest vector store for local setups — zero-config, persist-to-disk, and runs entirely in-process. No separate server process needed.
First run takes a few minutes depending on document count — the embedding model processes each chunk sequentially. Subsequent runs are instant: Chroma loads the persisted index from disk.
Test the retrieval:
Scores below 0.5 indicate strong matches. If results seem irrelevant, try a smaller chunk size (256 chars) or increase overlap to 25%.
Step 4 — Wire the LLM to Answer with Retrieved Context
The final piece: feed retrieved chunks into the LLM prompt and let it answer with citations. A simple RAG prompt template:
Key parameters:
k=3— retrieve 3 document chunks. Increasing to 5 improves coverage but consumes more context windowtemperature=0.1— low temperature keeps answers factual and groundedformat_docs— adds source attribution so you can verify where each claim came from
On a 16 GB Mac Mini with llama3.1:8b, end-to-end query time (retrieval + generation) is typically 3–8 seconds per question. The LLM step dominates — chunk retrieval completes in under 200 ms.
Performance Tuning — What Works on 8 GB, 16 GB, and 32 GB
Not all consumer hardware is equal. Here are tested configurations:
| Hardware | LLM | Query speed | Max document pages | Notes |
|---|---|---|---|---|
| 8 GB Mac / PC | Phi-4 3.8B (Q4_K_M) | 2–4 sec | ~500 | Use light embedding (all-MiniLM-L6-v2). Keep Ollama running in server mode. |
| 16 GB Mac Mini | Llama 3.1 8B (Q4_K_M) | 3–8 sec | ~3,000 | Sweet spot. nomic-embed-text works well. Serves 1–2 concurrent users. |
| 32 GB PC | Qwen 2.5 14B or Mistral 12B | 5–12 sec | ~10,000 | Can run a 14B model comfortably. Add sentence-transformers for bulk embedding. |
For batch processing (indexing thousands of pages), pre-compute embeddings offline. The embedding model processes ~50–100 chunks per second on CPU. Larger document collections benefit from a one-time batch run before the first query.
Common Pitfalls and How to Avoid Them
Chunks are too large or too small
A chunk of 1,024 characters may contain two unrelated topics, confusing the LLM. A chunk of 128 characters loses context. Stick to 512 with 128 overlap — tested across PDFs, Markdown, and HTML.
Retrieval returns irrelevant results
Switch from cosine similarity to the MMR (Maximum Marginal Relevance) retriever, which diversifies the returned chunks:
MMR prevents three near-identical chunks from crowding out other relevant sections.
The LLM ignores the context and hallucinates
Strengthen the system prompt. Replace "Answer using ONLY the context below" with explicit formatting instructions and a refusal directive. If it still hallucinates, switch to a more instruction-following model like phi-4 or qwen2.5-instruct.
Memory pressure on 8 GB machines
Close other applications. Use ollama run phi-4 instead of the 8B model. Reduce chunk count by filtering low-value documents (boilerplate, headers, page numbers) during the load step.