Everything you need to know about Production ready RAG systems: Part 2

By Pradyumna Chippigiri

November 23, 2025

Series


In Part 1 of this series, we talked about data ingestion and chunking and we learned how to get raw data into the system and split it into pieces using different chunking strategies.


RAG pipeline overview


In this part, we’ll cover the last two core pieces of the RAG pipeline:

  1. Embedding generation – turning text into vectors that carry meaning.
  2. Retrieval – using those vectors to fetch the right chunks at query time.

In Part 3 we will talk about the generation step in RAG, along with different evaluation strategies to evaluate the setup of the RAG pipeline.

Embeddings: turning text into meaningful vectors

In simple terms, embeddings are numerical representations of data. These numerical representations are not random; they carry some meaning. These embeddings help in capturing the semantic meaning and similarity between data points. For instance, the words “shoes” and “boots” are related in a way that “watch” is not.


Embedding is not just for text; it can be applied to images, audio, or even graph data as well.


There are multiple embedding techniques; I can make an entire series on that. But for now to understand what makes a good embedding, these are the 2 points:


Semantic similarity


From these embedding model families, sentence transformers are one such; sentence transformers enable the transformation of sentences into vectors in vector spaces. They represent sentences as dense vector embeddings. Traditional embedding models used to work on word-level, and sentence transformers work on a sentence level.


Some popular sentence transformers are:

1. Closed(-ish) models via API

These are the ones you call over an API:

You can’t download the model or fine-tune it locally. You just send text → get back embeddings/vectors.


Think of them as black box but convenient.

2. Open(-weight) models you can run yourself

These are models where you can download the weights and run them locally or on your own cloud:

In these kind of models you have more control (privacy, cost, custom infra).


You can refer to this link from Hugging Face to see the topmost sentence transformers → Sentence Transformers Leaderboard


Embedding models like Gemini and OpenAI can take in huge paragraphs to compute the vectors as they are huge embedding models. Most of these models are multimodal; they can deal with images and tables, etc.


How do you decide which embedding model to use for your RAG use case? So to summarize, these are the factors to be considered:


Coming back to RAG, now that we have the chunked documents based on the best chunking strategy for that particular use case, we take the embedding of these chunks.


But where do we store these embeddings of the document chunks?

Vector Databases

Well, you may say we can store these in vector databases, that’s right, but there are so many vector databases available in the market, which one would you go with? Each one of them have their own pros and use cases.


For a fact even if you pretend vector databases don’t exist, a RAG system is still totally doable with just:


Vector index diagram


No Vector DB Approach


Vector DB Approach


The vector DBs that are available in the market are: (Chroma as well, missed to include that in the image)

Vector databases landscape


Why do vector databases need indexing?


Every time a user asks a question, we encode that query into an embedding vector on the fly. To answer the question, we need to find the most semantically similar vectors (chunks) from our stored corpus. If we naively compare the query embedding with every stored embedding one by one, it becomes very slow as the number of chunks grows.


Vector indexes solve this by organizing embeddings in a special data structure (HNSW graphs, IVFFlat, etc.) so that we can jump directly to the most promising neighbors instead of scanning everything. That’s what makes similarity search fast enough to be usable in production RAG systems.


I can talk about different indexing techniques in a different article. The better the indexing strategy, the better and faster the retrieval would be. By default, all of these databases come up with these indexing strategies where we can choose.


For clarity, the query embeddings are not stored anywhere; they are just calculated on the fly. Only the document chunks’ embeddings are stored in the vector DB.


The same model must be used for both the document and query embedding to ensure uniformity between the two.

Retrieval

Retrieval diagram


Now in the retrieval step:


  1. The user sends a query.
  2. You embed the query into a vector.
  3. The system compares this query embedding with all the document chunks embeddings (via a vector DB indexing).
  4. It retrieves the top-k (should be mentioned in our code) chunks whose embeddings are most similar to the query, using metrics like:
    • cosine similarity
    • Euclidean distance
    • or other distance metrics supported by the vector store

These retrieved chunks are considered to be the most relevant context for the user’s question and are then passed into the generation step, which we’ll cover in Part 3, along with how to evaluate whether this whole RAG setup is actually doing a good job.


I will also try to talk about the different indexing strategies used in these vector databases.


Hope you liked this article.


If you did, please do shower some love, and do Subscribe to my weekly newsletter!


Have a nice day.


See you in the next one.