Local LLM · RAG

Local LLM RAG Setup on Consumer Hardware

Retrieval-Augmented Generation (RAG) lets your local LLM answer questions from your own documents — PDFs, notes, codebases, support tickets — without sending data to the cloud. This guide walks through a complete local RAG pipeline on a single Mac or PC, using open-source tools and models that fit in 8–32 GB of RAM.

FreeLast tested: 2026-06-28Audience: Developers, data teams

What Is Local RAG and Why Run It on Consumer Hardware?

RAG pipelines retrieve relevant document chunks before the LLM generates a response. Instead of training the model on your data, you feed it the most relevant context at query time. This gives accurate, citeable answers without expensive fine-tuning.

Running RAG locally means:

Consumer hardware (Mac Mini with 16 GB, a mid-range PC with 32 GB, or even a laptop) can handle document collections of hundreds to thousands of pages with sub-second retrieval times.

Pipeline Overview — Four Components

Every local RAG system has the same four building blocks:

ComponentRoleLocal Options
Document loaderExtract text from PDFs, Markdown, HTML, Word, etc.Unstructured, LlamaParse, pypdf, langchain_community
Embedding modelConvert text chunks into vector representationsBGE-small, all-MiniLM-L6-v2, nomic-embed-text
Vector storeStore and similarity-search embeddingsChroma, FAISS, LanceDB
LLMGenerate answers from retrieved contextLlama 3.1 8B, Mistral 7B, Phi-4, Qwen 2.5 7B

All four run on a single machine. The embedding model and vector store together use less than 2 GB of RAM. The LLM is the main consumer — choose a 7B or 8B parameter model for smooth performance on 16–32 GB systems.

Step 1 — Install the Foundation: Ollama + Embeddings

Start by installing Ollama, which serves both the LLM and the embedding model. It runs as a local API on port 11434.

# Install Ollama (macOS / Linux) curl -fsSL https://ollama.com/install.sh | sh # Pull a compact LLM for RAG — tested on 16 GB Mac Mini with fluent output ollama pull llama3.1:8b # Pull the embedding model — nomic-embed-text is fast and small (137 MB) ollama pull nomic-embed-text # Verify both models are ready ollama list

If you have less than 16 GB RAM, use phi-4:3.8b or qwen2.5:7b instead of llama3.1:8b. These quantized models run comfortably on 8 GB machines.

Test that embeddings work:

curl -s http://localhost:11434/api/embeddings \ -d '{"model": "nomic-embed-text", "prompt": "What is RAG?"}' | \ python3 -c "import json,sys; d=json.load(sys.stdin); print(f'Embedding dimension: {len(d[\"embedding\"])}')"

Expected output: Embedding dimension: 768. If Ollama isn't running, start it with ollama serve in a terminal.

Step 2 — Load and Chunk Your Documents

Documents must be split into chunks small enough to fit the LLM's context window. A good default is 512 characters with 128-character overlap — enough to preserve paragraph continuity without wasting context on redundant boundaries.

pip install langchain langchain-community chromadb pypdf # document_loader.py — load PDFs from a directory from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter loader = DirectoryLoader("./docs/", glob="*.pdf", loader_cls=PyPDFLoader) docs = loader.load() splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=128, separators=["\n\n", "\n", ".", " ", ""] ) chunks = splitter.split_documents(docs) print(f"Loaded {len(docs)} documents → {len(chunks)} chunks")

The DirectoryLoader processes all PDFs in a folder. For Markdown or HTML files, use TextLoader or UnstructuredHTMLLoader instead. The recursive splitter preserves paragraph boundaries when possible — cleaner chunks produce better retrieval.

Memory note: A 500-page PDF generates roughly 2,000–3,000 chunks at 512 chars each. Storing embeddings for this collection takes about 50–80 MB in Chroma. Even a 10,000-page corpus fits within 2 GB of disk.

Step 3 — Store Embeddings in a Local Vector Database

Chroma is the simplest vector store for local setups — zero-config, persist-to-disk, and runs entirely in-process. No separate server process needed.

# embed_and_store.py from langchain_community.embeddings import OllamaEmbeddings from langchain_community.vectorstores import Chroma embeddings = OllamaEmbeddings(model="nomic-embed-text") vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db/" ) vectorstore.persist() print(f"Stored {vectorstore._collection.count()} embeddings in chroma_db/")

First run takes a few minutes depending on document count — the embedding model processes each chunk sequentially. Subsequent runs are instant: Chroma loads the persisted index from disk.

Test the retrieval:

results = vectorstore.similarity_search_with_score( "What is the training budget?", k=3 ) for doc, score in results: print(f"Score: {score:.3f} | {doc.page_content[:80]}...")

Scores below 0.5 indicate strong matches. If results seem irrelevant, try a smaller chunk size (256 chars) or increase overlap to 25%.

Step 4 — Wire the LLM to Answer with Retrieved Context

The final piece: feed retrieved chunks into the LLM prompt and let it answer with citations. A simple RAG prompt template:

# rag_query.py from langchain_community.chat_models import ChatOllama from langchain.prompts import ChatPromptTemplate from langchain.schema.runnable import RunnablePassthrough llm = ChatOllama(model="llama3.1:8b", temperature=0.1) template = """You are a precise assistant. Answer the question using ONLY the context below. If the context doesn't contain enough information, say "I don't have enough information to answer this." Cite the source document for each claim. Context: {context} Question: {question} Answer:""" prompt = ChatPromptTemplate.from_template(template) retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) def format_docs(docs): return "\n\n---\n\n".join( f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}" for d in docs ) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm ) answer = rag_chain.invoke("What is the total project budget?") print(answer.content)

Key parameters:

On a 16 GB Mac Mini with llama3.1:8b, end-to-end query time (retrieval + generation) is typically 3–8 seconds per question. The LLM step dominates — chunk retrieval completes in under 200 ms.

Performance Tuning — What Works on 8 GB, 16 GB, and 32 GB

Not all consumer hardware is equal. Here are tested configurations:

HardwareLLMQuery speedMax document pagesNotes
8 GB Mac / PCPhi-4 3.8B (Q4_K_M)2–4 sec~500Use light embedding (all-MiniLM-L6-v2). Keep Ollama running in server mode.
16 GB Mac MiniLlama 3.1 8B (Q4_K_M)3–8 sec~3,000Sweet spot. nomic-embed-text works well. Serves 1–2 concurrent users.
32 GB PCQwen 2.5 14B or Mistral 12B5–12 sec~10,000Can run a 14B model comfortably. Add sentence-transformers for bulk embedding.

For batch processing (indexing thousands of pages), pre-compute embeddings offline. The embedding model processes ~50–100 chunks per second on CPU. Larger document collections benefit from a one-time batch run before the first query.

Common Pitfalls and How to Avoid Them

Chunks are too large or too small

A chunk of 1,024 characters may contain two unrelated topics, confusing the LLM. A chunk of 128 characters loses context. Stick to 512 with 128 overlap — tested across PDFs, Markdown, and HTML.

Retrieval returns irrelevant results

Switch from cosine similarity to the MMR (Maximum Marginal Relevance) retriever, which diversifies the returned chunks:

retriever = vectorstore.as_retriever( search_type="mmr", search_kwargs={"k": 3, "fetch_k": 10, "lambda_mult": 0.5} )

MMR prevents three near-identical chunks from crowding out other relevant sections.

The LLM ignores the context and hallucinates

Strengthen the system prompt. Replace "Answer using ONLY the context below" with explicit formatting instructions and a refusal directive. If it still hallucinates, switch to a more instruction-following model like phi-4 or qwen2.5-instruct.

Memory pressure on 8 GB machines

Close other applications. Use ollama run phi-4 instead of the 8B model. Reduce chunk count by filtering low-value documents (boilerplate, headers, page numbers) during the load step.