Steve Jobs told Stanford's class of 2005 something that's stuck with me: "You can't connect the dots looking forward; you can only connect them looking backwards."
He was talking about dropping out of college and auditing a calligraphy class, a seemingly useless decision that later became the foundation for the Mac's beautiful typography.
I had my own dot-connecting moment this week. The unlikely connection? A hobby project building an auto-trading algorithm six months ago just solved a critical problem in my legal RAG system.
Here's the story.
The Problem: High-Stakes Legal Questions, Zero Tolerance for Error
A property management company came to me with a familiar pain point: their team was drowning in lease document questions.
Some questions are straightforward—when does this lease expire? But others require real interpretation and take significant time to find the answer:
- What's the CAM breakdown for this specific property?
- Can the tenant sublease under these terms?
- Who's responsible for roof repairs when an HVAC vendor caused the damage?
These questions require precise answers. One wrong interpretation could cost tens of thousands of dollars—or worse, a lawsuit.
The obvious solution? Build an AI-powered document retrieval system. Feed it the leases, let people ask questions in plain English, get accurate answers instantly.
Simple in theory. Harder than it looks.
Why Existing Solutions Fall Short
Solutions already exist for this problem. Gemini has RAG agents. SharePoint has built-in document intelligence. ChatGPT, allows you to drop the file relevant file in manually - if it doesn't exceed the pdf limit.
But here's where those generic systems fall short: they're not optimized for messy real-world documents, they require actually providing the document and arent context aware.
Every commercial lease has been signed and scanned at some point. Many have handwritten amendments, initialed changes, and annotated clauses. The pages are slightly crooked. The scan quality varies. Generic OCR—the technology most systems use to extract text from PDFs—struggles with anything that isn't crisp, typed text.
If your text extraction is garbage, your retrieval is garbage. Hallucinations become inevitable.
The first step in building a reliable system wasn't AI at all. It was spending days on document processing—cleaning scanned PDFs, handling handwritten sections, and producing structured, accurate text files. Without clean extraction, everything downstream falls apart.
The Chip Index: Solving the Context Problem
Once I had clean text, I faced the next challenge: chunking.
Here's how RAG systems work at a basic level: you take a document, break it into smaller pieces (chunks), and convert each chunk into a mathematical representation called an embedding. When a user asks a question, you convert their question into the same type of embedding and find the chunks with the closest mathematical "meaning."
The problem? Chunks embedded in isolation lose context.
Think about it. A chunk from page 47 of a lease might say: "The tenant shall be responsible for all repairs to HVAC equipment excluding damage caused by third-party vendors."
Great information. But which lease? Which property? Which tenant? When a property management company has 100 leases, that isolated chunk is nearly useless.
The traditional solution is metadata—tagging each chunk with information like property name, tenant, and lease dates. But that requires reliable metadata already baked into the files or manual tagging by users. Neither was realistic for this use case.
So I designed what I call a chip index.
Instead of relying on external metadata, I embed key identifiers directly into the beginning of each chunk. Before that HVAC clause, the chunk now reads: "Property: Riverside Plaza | Tenant: ABC Corp | Lease Start: 2022-01-15 | Lease End: 2027-01-14 | The tenant shall be responsible for..."
Every chunk carries its own context. The embedding model captures not just the meaning of the clause, but the identity of the lease it belongs to.
This architecture accepts any file intake without requiring users to rename documents or rely on existing metadata. Upload a lease, the system extracts the chips automatically, and every chunk becomes self-identifying.
The Breakthrough That Came From Nowhere
With clean text extraction and chip indexing in place, I ran my initial tests.
The results were... inconsistent. Sometimes perfect. Sometimes wildly off. Not good enough for high-stakes legal questions.
I started thinking about what else I could tune. Chunk length? Embedding model? Number of chunks returned? Minimum similarity threshold?
Then I remembered something completely unrelated.
Six months earlier, I'd built an auto-trading algorithm as a hobby project. Nothing serious—just experimenting with parameters that would determine when to buy or sell based on market signals.
The testing process was painful. I'd pick a set of parameters, run it on paper accounts for a few days, and try to guess if it was actually working. There was no way to validate anything at scale.
Then it clicked: I had access to historical market data via APIs. Instead of waiting days to test one configuration, I could run my algorithm against years of historical data in minutes.
And then the bigger insight: what if I didn't just test my guessed parameters, but every possible combination of parameters?
I turned the hardcoded values into variables, built a stepping system to iterate through every variation, and let it run. I'd accidentally stumbled into building an optimization algorithm—a way to systematically discover the best configuration instead of guessing.
At the time, it felt like a fun distraction. The project never made me any money.
But sitting there, staring at my inconsistent RAG results, the parallel hit me.
Building Validation Data for AI
The auto-trading optimization worked because I had a clear success metric: returns on historical trades. I could run 10,000 configurations and objectively measure which one performed best.
For the RAG system, I needed the same thing: ground truth data to validate against.
I built a test dataset. First, I manually created 20 question-answer pairs from the 100 leases. Real questions a property manager might ask, with verified correct answers pulled directly from the source documents.
Then I used AI to generate 80 more question-answer pairs scattered throughout the dataset—varied enough to stress-test the system, but validated against the actual lease content.
100 questions. 100 known-correct answers. Now I had something to optimize against.
The Parameter Landscape
With validation data in place, I defined the parameters to tune:
- Chunk length: How big should each piece of text be?
- Embedding model: Which model converts text to mathematical representations?
- Chip prepend + append: How much identifying context to add to each chunk?
- Chip quantity: How many key identifiers per chunk?
- Chunks returned: How many relevant pieces to retrieve per question?
- Similarity threshold: How confident does the match need to be before returning it?
This isn't the complete RAG tuning landscape—it's a starting point. The fun part about this problem is that you can always invent new parameters to test.
I built the optimization framework and let it run through the parameter space.
The Results: Zero Hallucinations
Out of 100 test questions, the optimized model returned:
- 87 correct answers — accurate information pulled from the right lease, properly contextualized
- 13 appropriate "I don't know" responses — the system recognized (via a skeptical RAG prompt I'd designed) that the available information couldn't reliably answer the question
In other words: zero hallucinations.
Not "low hallucination rate." Zero. The system either gave the right answer or admitted it didn't have enough information to answer confidently. A wrong answer is worse than no answer—the model was calibrated to know its own limitations.
The true test is always real-world usage—synthetic validation data can only take you so far. But going from inconsistent results to zero hallucinations on 100 diverse test questions gave me confidence the architecture was sound.
The Dots Only Connect Backwards
I thought the auto-trading project was a waste of time.
It never made me money. The algorithm never went live. From a practical standpoint, those few months of tinkering produced nothing.
But that project built something I didn't realize I was building: an intuition for optimization.
The mental model of "I have a black box of data, a set of tunable parameters, and a way to measure success—now let me systematically find the optimal configuration" became second nature.
When I hit the wall with the RAG system, that intuition was just sitting there, waiting to be applied.