Building a RAG System: There's No Recipe, But Here's a Map

Andros Fenollosa's "From Zero to a RAG System: Successes and Failures" is worth your time if you've ever faced the problem of making an LLM answer questions about your documents - not the web, not its training data, but a specific corpus you own and control.

That problem is what RAG - Retrieval-Augmented Generation - actually solves, and it's worth being precise about what that means before diving in. When you ask a standard LLM a question, it answers from its training weights: everything it learned during training, frozen at the cutoff date, with no access to your data. RAG changes that by inserting a retrieval step: before the model generates a response, the system searches your document corpus, pulls the most relevant chunks, and hands them to the model along with the question. The model then generates its answer grounded in those documents rather than guessing from weights. This is also why fine-tuning isn't the same thing - training the model on your data still doesn't let it cite the source document that grounded a specific response, and it doesn't eliminate hallucination. RAG does neither perfectly, but it gets you closer to both.

It's also why large context windows don't make RAG obsolete, despite the argument you'll hear. You might be able to stuff something like To Kill a Mockingbird or The Silmarillion into a modern context window. You cannot stuff 451GB of proprietary engineering documents into one. Not yet, at least. At that scale, retrieval is still the architecture.

The Problem Fenollosa Was Solving

The requirement: build an internal chat tool that could answer natural-language questions against nearly a decade's worth of company project files - 1TB of unstructured documents on Azure, spanning PDFs, technical reports, simulation outputs, corrupt spreadsheets, and everything in between, with no meaningful organization beyond folder hierarchy. Confidentiality ruled out external APIs or external access. The entire stack had to be able to run locally.

He'd never built a RAG system before, and he says so upfront. That honesty is part of what makes the article interesting. It's not a tutorial built backward from a working system - it's a record of what he actually experienced.

What Broke, and What Fixed It

  • Memory. The first attempt - naive full-load processing with LlamaIndex - collapsed RAM within minutes. The fix was aggressive file filtering before the indexer touched anything. Video, images, executables, simulation outputs, backups, email files, all candidates for exclusion. He says the file count dropped by 54%. Your indexer doesn't know what's junk. You have to tell it first. That means knowing what's junk.

  • Scale. JSON-backed indexing didn't survive 451GB; every restart meant reprocessing from scratch. The fix was the use of ChromaDB (good!) as a dedicated vector store over SQLite (...what?), with batch processing at 150 files at a time, explicit garbage collection between batches, and checkpointing so interruptions don't cost days of progress. The right data structure for "trying this out" is not the right data structure for production.

  • Hardware. GPU time was the real bottleneck - 500MB of documents takes 4–5 hours on CPU. He rented an NVIDIA RTX 4000 SFF Ada on Hetzner for €184 total and ran the full indexing job over two to three weeks. This gave him a final result of 738,470 vectors and a 54GB index. This is a limit that most RAG implementers are going to run into.

  • Disk. The production server had 100GB - not enough for the original documents alongside everything else. The fix here seems obvious: keep the vector index locally, and serve the original documents on demand from Azure Blob Storage via SAS token links. The index and the source documents don't have to live together.

The Lessons That Transfer

Regardless of your stack, these hold:

  • Filter aggressively before indexing. Your pipeline will process whatever you give it. Repeat: your pipeline will process whatever you give it.
  • Batch with explicit memory management between batches. Don't trust the memory manager, or the database transaction manager, at scale. This is good data science in any event.
  • Use a real vector database once your corpus gets serious. JSON-backed indexes are for prototypes.
  • Build in per-file error tolerance. One corrupt document shouldn't abort a batch, unless the presumption is that no corrupt documents are in your dataset.
  • Checkpoint everything. Multi-day runs will be... hold on, someone called, I lost my train of thought. Oh! Interrupted. Runs will be interrupted.

The Gap the Article Leaves Open

The article doesn't address chunking strategy - how documents are split before embedding - and that's a significant omission, because chunking is often where retrieval quality is actually won or lost. A practitioner in a ycombinator discussion independently named chunking as their hardest problem on a similar build. Chunk too large and you blur the semantic signal; chunk too small and you lose the context that makes a passage meaningful when retrieved in isolation. It's highly corpus-dependent and there's no universal answer.

Anthropic's Contextual Retrieval post is the best single reference on why it matters and one concrete approach: prepending chunk-specific context before embedding, combined with hybrid BM25/semantic search, reduced failed retrievals by 49% in their testing - 67% with reranking added.

There's no complete recipe for RAG - the right answers depend too heavily on corpus size, format chaos, confidentiality constraints, and what "fast" means to your users. A complete recipe for RAG is like a perfect suit: the pattern exists, but it has to be cut for the body wearing it. What we have instead are practitioner records. Andros did the work, documented it honestly, and that's worth more than most "comprehensive guides" that treat every decision as obvious.

Comments (0)

Sign in to comment

No comments yet.