Everything you need to know about production ready RAG systems : Part 3

November 28, 2025

In Part 1 and Part 2 we saw the retrieval and augmentation phase of RAG.

Series

In this part, we will talk about the generation part of RAG, along with ways to evaluate a RAG setup.

Generation

By this point in the pipeline, we’ve already turned everything into vectors: we have an embedding for the user’s query and embeddings for all our document chunks. Using similarity search, we’ve fetched the nearest neighbor document chunks, the pieces of text that are most likely to be relevant to the question. These retrieved chunks become the context window for the next stage: generation.

In this generation phase, the retrieved text chunks, along with the initial user query, are fed into a language model with a prompt. The algorithm will use this information to generate a coherent response to the user’s questions through a chat interface. This integration allows the LLM to generate responses that are more accurate and contextually relevant, as it can draw from the most current and authoritative information available. This process ensures that the LLM has access to both the context provided by the user’s query and the additional, relevant data fetched during the retrieval phase.

Choosing the Model: Local vs API

Here, as an engineer, you have to make a decision: what type of LLM are you going to use?

There are two core questions you should ask yourself:

Do you want to run it locally?
If yes, how much compute and memory can you realistically dedicate to it?

If you want the best possible performance with minimal infra hassle, you’ll likely start with a hosted API such as GPT or Claude. The trade-offs:

Pros: strong model quality, easy to integrate, no infra to maintain.
Cons: you’re sending data to a third-party API (the data is leaving your system), and you pay per token.

If you want to run models locally (or in your own VPC), the second question becomes more important:

What hardware do you have (CPU/GPU, RAM)?
What latency and throughput do you need?
What’s your budget for infra?

Your answer to these will determine whether you use a small/medium open-weight model locally, or a large hosted model via API.

Why Prompting Matters in RAG

Writing effective prompts becomes a very important aspect in RAG as well because without effective prompting, even the most relevant retrieved information might not be used correctly by the LLM.

The key reasons prompting matters in RAG are:

Avoiding Hallucinations: a well-crafted prompt explicitly instructs the LLM to only use the information provided in the context, preventing it from fabricating details or drawing upon its general knowledge when the context is insufficient.
Generating Custom Outputs: prompting allows you to define the desired behavior, personality, tone, and style of the LLM’s response. Want a friendly, concise answer or a detailed, formal one? Prompting makes it possible.
Setting Up Constraints and Rules: you can use prompts to enforce specific rules, such as prohibiting the mention of certain topics or requiring a particular format (like bullet points or JSON).

Almost every generation prompt in a production RAG system is some combination of these 4 layers:

1. Role / System message

"Who are you?"

[System]
You are a domain expert assistant that answers questions strictly using the provided context.
If the answer is not clearly supported by the context, you MUST say:
"I don’t know based on the provided documents."
Do not use outside knowledge or make assumptions.

This is your guardrail layer. It’s your best defense against hallucinations.

2. Task instructions

"What should you do right now?"

[Instructions]
Your task is to answer the user’s question using ONLY the context provided below.
 
- If the context contains the answer, respond with a clear and concise explanation.
- If the context partially answers the question, explicitly state the limitations.
- If the answer cannot be found in the context, say:
  "I don’t know based on the provided documents."
- Do not invent APIs, code, or facts that are not supported by the context.
- Keep the answer under 200 words unless the user explicitly asks for more detail.

3. Context

Retrieved chunks / "Here is what you’re allowed to see."

Helps in labeling the chunk for citing it later on if needed.

[Context]
You will now receive a set of document excerpts ("chunks").
Each chunk is marked like this: [Doc i]
 
[Doc 1]
{{chunk_1}}
 
[Doc 2]
{{chunk_2}}
 
[Doc 3]
{{chunk_3}}
 
...

4. User query

The actual question or request.

[User Question]
{{user_query}}

It would be even better if you provide with 1-2 examples so that the LLM can understand better. Once a good prompt is prompted, the LLM outputs the required answers, and our RAG is now fully working to our documents and use case.

Some of the common mistakes in prompting you can see from the image below:

Prompting mistakes

But we also need a way to test out this RAG setup of ours.

RAG Evaluation pipeline

Evaluating a RAG pipeline helps in evaluating and understanding how our RAG is performing, it involves multiple steps which industries currently follow to evaluate their enterprise RAG.

Step 1: Build a small, high-quality test dataset

Before metrics, you need data:

Collect a set of questions your system should handle.
For each question, store:
- Ground-truth answer
- Relevant source passages/document IDs

Even 30–100 good examples are enough to start. This becomes your RAG test set.

Step 2: Evaluate retrieval – "Did we fetch the right context?"

Before you judge the final answer, you want to know: “Is my retriever even giving the model the right stuff to work with?”

So, for each question in the test set:

Run it through the retriever (but don’t call the LLM yet).
Collect the top-k chunks that come back.
Look at those chunks and ask:

If I were the model, could I answer this question from these chunks?

There are two ways to do this:

Manual check (great in the early stages): Skim a subset of questions and their retrieved chunks and judge them yourself:
- "Yes, this chunk clearly answers the question."
- "No, this is just boilerplate/irrelevant."
Automatic check (LLM-as-judge): Use another model to score relevance. For each (question, chunk) pair, you can ask: "Given this question and this chunk, how relevant is this chunk to answering the question? Score from 1 (not relevant) to 5 (highly relevant)."

This step isolates retrieval from generation:

If retrieval is weak, the LLM is basically guessing, no matter how good your prompt or model is.
If retrieval looks good but the final answers are still bad, that’s a strong signal that the problem is in the generation step (prompt design, model choice, or output constraints), not in the retriever.

Step 3: Evaluate generation – "Is the answer grounded & correct?"

Now we plug in the LLM and evaluate the answers themselves.

For each question:

Run full RAG: query → retrieval → generation.
Get the final answer.
Compare answer against:
- The context (for groundedness).
- The expected answer (if you have one).

Key dimensions:

Faithfulness / groundedness - Does the answer stick to the provided context, or is it making stuff up?
Relevance - Does it actually answer the question?
Correctness - If you have expected answers, how close is it?
Completeness - Does it cover all important parts of the query (not just one slice)?

Again, you can:

Do this manually at first.
Then automate with an evaluator model (LLM-as-judge), which you prompt like: "Given the question, the context, and the answer, rate:
- Faithfulness: 1–5
- Relevance: 1–5
- Completeness: 1–5 And explain briefly why."

You can store these scores and take averages across the datasets or use frameworks built for evaluation, for example like RAGAS.

Step 4: Measure formatting & contract compliance

For production apps, it’s not enough that the text is correct; it must also:

Return valid JSON when you ask for JSON.
Respect length constraints ("< 200 words").
Use the right tone or structure (e.g., always bullet points).

So you add checks like:

"Is the output valid JSON?" → yes/no
"Does it contain all required fields?" → yes/no
"Does it exceed X tokens?" → yes/no

These can be fully automated with simple scripts.

Step 5: Close the loop and compare different configurations

Now the fun part: you can experiment and compare setups:

Embedding models: model A vs model B
Chunk sizes: 500 tokens vs 1000 tokens
Top-k: k=3 vs k=10
LLMs: GPT-4.x vs open-source model
Prompts: strict vs more relaxed

For each configuration, run:

Retrieval eval → get relevance/recall metrics.
Generation eval → get groundedness/correctness/completeness.
Formatting checks → pass/fail.

In practice, you don’t have to build all these evaluation metrics from scratch. There are dedicated frameworks such as RAGAS, TruLens, LangSmith, and Arize Phoenix that use LLM-as-judge style evaluators to score both retrieval quality (context relevance/recall) and generation quality (faithfulness, answer relevance, hallucinations). RAGAS is one of the most popular RAG-specific libraries.

There are multiple evaluation frameworks out there along with numerous metrics you can read about them here: https://www.pinecone.io/learn/series/vector-databases-in-production-for-busy-engineers/rag-evaluation/

This brings us to the end of the 3-part RAG series.

I hope you now have a clear picture of how RAG works in production systems: from ingestion and chunking, to retrieval and augmentation, to generation and evaluation.

I’d love to hear how you do it in your company, or if there’s anything you think I missed.

Thank you so much for reading.

Take care, you wonderful people and see you in the next one.