Everything you need to know about production ready RAG systems : Part 3

By Pradyumna Chippigiri

November 28, 2025


In Part 1 and Part 2 we saw the retrieval and augmentation phase of RAG.

Series


In this part, we will talk about the generation part of RAG, along with ways to evaluate a RAG setup.

Generation

Generation phase


By this point in the pipeline, we’ve already turned everything into vectors: we have an embedding for the user’s query and embeddings for all our document chunks. Using similarity search, we’ve fetched the nearest neighbor document chunks, the pieces of text that are most likely to be relevant to the question. These retrieved chunks become the context window for the next stage: generation.


In this generation phase, the retrieved text chunks, along with the initial user query, are fed into a language model with a prompt. The algorithm will use this information to generate a coherent response to the user’s questions through a chat interface. This integration allows the LLM to generate responses that are more accurate and contextually relevant, as it can draw from the most current and authoritative information available. This process ensures that the LLM has access to both the context provided by the user’s query and the additional, relevant data fetched during the retrieval phase.

Choosing the Model: Local vs API

Here, as an engineer, you have to make a decision: what type of LLM are you going to use?


There are two core questions you should ask yourself:


  1. Do you want to run it locally?
  2. If yes, how much compute and memory can you realistically dedicate to it?

If you want the best possible performance with minimal infra hassle, you’ll likely start with a hosted API such as GPT or Claude. The trade-offs:


If you want to run models locally (or in your own VPC), the second question becomes more important:


Your answer to these will determine whether you use a small/medium open-weight model locally, or a large hosted model via API.

Why Prompting Matters in RAG

Writing effective prompts becomes a very important aspect in RAG as well because without effective prompting, even the most relevant retrieved information might not be used correctly by the LLM.


The key reasons prompting matters in RAG are:


  1. Avoiding Hallucinations: a well-crafted prompt explicitly instructs the LLM to only use the information provided in the context, preventing it from fabricating details or drawing upon its general knowledge when the context is insufficient.
  2. Generating Custom Outputs: prompting allows you to define the desired behavior, personality, tone, and style of the LLM’s response. Want a friendly, concise answer or a detailed, formal one? Prompting makes it possible.
  3. Setting Up Constraints and Rules: you can use prompts to enforce specific rules, such as prohibiting the mention of certain topics or requiring a particular format (like bullet points or JSON).

Almost every generation prompt in a production RAG system is some combination of these 4 layers:

1. Role / System message

"Who are you?"

[System]
You are a domain expert assistant that answers questions strictly using the provided context.
If the answer is not clearly supported by the context, you MUST say:
"I don’t know based on the provided documents."
Do not use outside knowledge or make assumptions.

This is your guardrail layer. It’s your best defense against hallucinations.

2. Task instructions

"What should you do right now?"

[Instructions]
Your task is to answer the user’s question using ONLY the context provided below.
 
- If the context contains the answer, respond with a clear and concise explanation.
- If the context partially answers the question, explicitly state the limitations.
- If the answer cannot be found in the context, say:
  "I don’t know based on the provided documents."
- Do not invent APIs, code, or facts that are not supported by the context.
- Keep the answer under 200 words unless the user explicitly asks for more detail.

3. Context

Retrieved chunks / "Here is what you’re allowed to see."


Helps in labeling the chunk for citing it later on if needed.

[Context]
You will now receive a set of document excerpts ("chunks").
Each chunk is marked like this: [Doc i]
 
[Doc 1]
{{chunk_1}}
 
[Doc 2]
{{chunk_2}}
 
[Doc 3]
{{chunk_3}}
 
...

4. User query

The actual question or request.

[User Question]
{{user_query}}

It would be even better if you provide with 1-2 examples so that the LLM can understand better. Once a good prompt is prompted, the LLM outputs the required answers, and our RAG is now fully working to our documents and use case.


Some of the common mistakes in prompting you can see from the image below:

Prompting mistakes


But we also need a way to test out this RAG setup of ours.

RAG Evaluation pipeline

Evaluating a RAG pipeline helps in evaluating and understanding how our RAG is performing, it involves multiple steps which industries currently follow to evaluate their enterprise RAG.

Step 1: Build a small, high-quality test dataset

Before metrics, you need data:

Even 30–100 good examples are enough to start. This becomes your RAG test set.

Step 2: Evaluate retrieval – "Did we fetch the right context?"

Before you judge the final answer, you want to know: “Is my retriever even giving the model the right stuff to work with?”


So, for each question in the test set:

  1. Run it through the retriever (but don’t call the LLM yet).
  2. Collect the top-k chunks that come back.
  3. Look at those chunks and ask:

There are two ways to do this:

This step isolates retrieval from generation:

Step 3: Evaluate generation – "Is the answer grounded & correct?"

Now we plug in the LLM and evaluate the answers themselves.


For each question:

  1. Run full RAG: query → retrieval → generation.
  2. Get the final answer.
  3. Compare answer against:
    • The context (for groundedness).
    • The expected answer (if you have one).

Key dimensions:

Again, you can:

You can store these scores and take averages across the datasets or use frameworks built for evaluation, for example like RAGAS.

Step 4: Measure formatting & contract compliance

For production apps, it’s not enough that the text is correct; it must also:

So you add checks like:

These can be fully automated with simple scripts.

Step 5: Close the loop and compare different configurations

Now the fun part: you can experiment and compare setups:

For each configuration, run:

In practice, you don’t have to build all these evaluation metrics from scratch. There are dedicated frameworks such as RAGAS, TruLens, LangSmith, and Arize Phoenix that use LLM-as-judge style evaluators to score both retrieval quality (context relevance/recall) and generation quality (faithfulness, answer relevance, hallucinations). RAGAS is one of the most popular RAG-specific libraries.


There are multiple evaluation frameworks out there along with numerous metrics you can read about them here: https://www.pinecone.io/learn/series/vector-databases-in-production-for-busy-engineers/rag-evaluation/


This brings us to the end of the 3-part RAG series.


I hope you now have a clear picture of how RAG works in production systems: from ingestion and chunking, to retrieval and augmentation, to generation and evaluation.


I’d love to hear how you do it in your company, or if there’s anything you think I missed.


Thank you so much for reading.


Take care, you wonderful people and see you in the next one.