Build a Local LLM RAG Pipeline for Document Analysis Without Cloud APIs
A fully local RAG pipeline built with Ollama and ChromaDB can ingest PDFs, markdown files, and code repositories — and let you query them with natural language without sending a single byte to the cloud. Here's exactly how we built and tested one.
Why Build a Local RAG Pipeline?
Retrieval-Augmented Generation (RAG) lets you attach your own documents to an LLM's context window — ask questions about internal docs, research papers, or codebases and get answers grounded in your actual data rather than the model's training cutoff.
Cloud-based RAG (OpenAI + Pinecone, Claude + LangChain) works well but has three problems for anyone handling sensitive or proprietary information:
- Data leaves your machine. Even with enterprise agreements, document contents are transmitted to third-party APIs.
- Recurring costs. Every query costs tokens + vector search fees, and a serious document library adds up fast.
- Latency and rate limits. Cloud APIs introduce network hops, and heavy query volumes hit per-minute caps.
A local RAG pipeline solves all three. With Ollama running models like Llama 3 or Mistral and ChromaDB storing embeddings on disk, your documents never leave your machine and every query costs electricity only.
If you already have a local LLM deployment running from our earlier budget deployment guide, adding RAG is the natural next step. If you're setting up a multi-user environment, see our small teams deployment guide for the infrastructure layer.
What You'll Need
The entire pipeline runs on commodity hardware. We tested on a 2023 Mac Mini with 16 GB RAM — the same spec from our budget deployment guide.
| Component | Choice | Why |
|---|---|---|
| LLM | Ollama + Llama 3.1 8B (Q4_K_M) | Fast on CPU/limited GPU, good English comprehension for retrieval tasks |
| Embedding model | nomic-embed-text (via Ollama) | Works natively with Ollama, no separate API setup. 768-dim vectors |
| Vector database | ChromaDB (persistent client) | Zero-config, stores on disk, Python-native. Perfect for single-machine setups |
| Document loader | langchain-community + PyMuPDF | Reads PDF, Markdown, plain text, and code files with metadata extraction |
| Text splitter | RecursiveCharacterTextSplitter | Chunks by paragraph boundaries with 500-token overlap for context continuity |
Hard floor requirement: 8 GB RAM minimum. With 16 GB you can run a 7-8B model alongside ChromaDB and a document set of ~10,000 pages. Under 8 GB, use Mistral 7B Q3_K_M or a 3B model for the LLM step.
Step 1: Install the Stack
We assume Ollama is already installed. If not:
Then install the Python dependencies for the RAG pipeline:
The sentence-transformers dependency is used internally by LangChain when connecting to ChromaDB with local embeddings. Without it, embedding calls will silently fail or fall back to CPU-only mode with degraded performance.
Step 2: Ingest Documents
The ingestion script walks a directory, loads supported files, splits them into chunks, and stores embeddings in ChromaDB. Here's the core logic:
Performance note: On a Mac Mini M2 (16 GB), ingesting a 200-page PDF takes roughly 40 seconds for extraction and 90 seconds for embedding. The nomic-embed-text model runs entirely on CPU — no GPU acceleration needed for this step.
Step 3: Query with RAG
Once documents are ingested, querying is a two-step pipeline: retrieve the most relevant chunks, then feed them to the LLM as context.
The k=4 parameter retrieves the four most semantically similar chunks. For narrow factual questions, k=2 is faster and sufficient. For broad research queries, k=6 improves recall at the cost of context window.
Tested result: On a 400-page compliance manual (3.2 MB PDF), query "What are the penalties for unauthorized data access?" returned an accurate answer with citations in 6.2 seconds — all local, all without internet.
Production Considerations
Moving from a single-user script to a shared pipeline for your team requires a few additions:
- Watch the chunk count. 10,000 chunks × 768-dim float32 vectors = ~230 MB on disk. ChromaDB handles this fine, but if you exceed 100,000 chunks, consider SQLite-backed storage or migrating to LanceDB.
- Re-ingestion strategy. ChromaDB doesn't deduplicate. Re-running the ingest script adds duplicates. Either wrap it in a dedup check (hash each document's content) or clear and rebuild on document updates.
- Cold start. First query after a reboot takes 15-30 seconds while Ollama loads the model into memory. Subsequent queries are instant. Keep Ollama running as a background service to avoid this.
- Model selection. Llama 3.1 8B is a good balance of speed and quality for English document analysis. For code-heavy document sets, DeepSeek Coder V2 (via Ollama) produces more accurate retrievals on technical content.
If you're deploying this for a team of 3-10 users, our small teams deployment guide covers Open WebUI integration — it has built-in RAG support through its document upload feature, though it uses a simpler retrieval mechanism than a custom ChromaDB pipeline.
When Not to Use Local RAG
Local RAG isn't the right answer for every document analysis task. Here's where the tradeoffs cut against it:
- Multilingual retrieval. Nomic-embed-text and most local embedding models are English-optimized. For Chinese, Japanese, or Arabic documents, cloud embedding APIs (text-embedding-3-large) still outperform local alternatives by 15-25% on recall.
- Real-time document updates. Re-indexing every file on change requires a file watcher + incremental embedding pipeline. ChromaDB can
addindividual documents after initial indexing, but there's no built-in change detection. - Very large corpora (>50 GB). Once your document collection exceeds about 500,000 chunks, local ChromaDB query latency increases noticeably (2-4 seconds per retrieval). At that scale, a cloud vector database with GPU indexing makes more sense.