Building Blocks of a RAG Pipeline

September 15, 2024 5 minute read

Over the past year I’ve watched “RAG” go from a niche research acronym to something that comes up in every engineering discussion involving LLMs. If you’ve been meaning to understand what it actually is — not the hand-wavy version, but the mechanics — this post is for you.

I want to use this as a foundation post for a series I’m planning on AI Pipelines and applied LLM work. A lot of what I’ll write going forward builds on these ideas, so it’s worth getting the basics right.

Why RAG Exists

Large language models have a knowledge cutoff. Ask GPT-4 about something that happened last month and it either hallucinates or tells you it doesn’t know. More practically, ask it about your company’s internal documentation and it definitely doesn’t know.

The naive fix is fine-tuning — train the model on your data. The problems: fine-tuning is expensive, slow, and static. Every time your data changes you need to re-train.

Retrieval-Augmented Generation (RAG) is the other approach. Instead of baking knowledge into the model weights, you retrieve relevant context at query time and feed it to the model as part of the prompt. The model’s job becomes: “given this context, answer this question.”

This is powerful because:

Your knowledge base can be updated without touching the model
You can trace where the answer came from (the retrieved chunks)
It works with any LLM — even smaller, cheaper models improve dramatically with good context

The Five Components

A RAG pipeline has five logical pieces. Let me walk through each one.

1. Document Ingestion

Before you can retrieve anything, you need to get your documents into a format the pipeline can use. This typically means:

Loading: Reading PDFs, web pages, markdown files, database exports, etc.
Chunking: Splitting long documents into smaller pieces
Metadata extraction: Keeping track of source, date, author

The chunking step is more art than science. Too small and individual chunks lose context. Too large and you’re paying for tokens you don’t need and diluting relevance.

A simple recursive character splitter is a reasonable starting point:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(docs)

The chunk_overlap matters — it ensures that ideas that span chunk boundaries aren’t silently cut off.

2. Embedding

Each chunk gets converted into a dense vector — a list of floating point numbers that represents the semantic meaning of the text. Semantically similar text ends up close together in this vector space.

The embedding model does this translation. Common choices in late 2024:

OpenAI text-embedding-3-small — cheap, good quality, 1536 dimensions
OpenAI text-embedding-3-large — higher quality, 3072 dimensions
sentence-transformers/all-MiniLM-L6-v2 — local, fast, smaller (384 dims)
Cohere embed-v3 — strong multilingual support

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Embed a single string
vector = embeddings.embed_query("What is a transformer architecture?")
print(len(vector))  # 1536

3. Vector Store

The vectors need to live somewhere searchable. A vector store indexes embeddings so you can find the nearest neighbors to a query vector quickly — usually using HNSW (Hierarchical Navigable Small World) or FAISS under the hood.

Options range from in-memory (good for prototyping) to fully managed cloud databases:

Store	Good for
ChromaDB	Local development, small datasets
FAISS	In-process, large-scale offline search
Qdrant	Self-hosted or cloud, production use
Pinecone	Fully managed, serverless option
pgvector	You already have Postgres

For a quick local setup with ChromaDB:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory="./chroma_db"
)

4. Retrieval

At query time, the user’s question gets embedded using the same model, and you find the k most similar chunks in the vector store.

query = "How does attention mechanism work in transformers?"
results = vectorstore.similarity_search(query, k=4)

for doc in results:
    print(doc.metadata)
    print(doc.page_content[:200])
    print("---")

This is where a lot of the tuning happens in practice. Raw cosine similarity isn’t always what you want. You might want:

MMR (Maximal Marginal Relevance) — balance relevance with diversity so you don’t retrieve 4 chunks that all say the same thing
Hybrid search — combine dense (semantic) with sparse (keyword/BM25) retrieval
Reranking — use a second model to re-score the retrieved chunks

5. Generation

The retrieved chunks get assembled into a prompt along with the user’s question, and sent to an LLM:

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

template = """Use the following context to answer the question.
If you don't know the answer based on the context, say so — don't make things up.

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=template
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    chain_type_kwargs={"prompt": prompt}
)

response = qa_chain.invoke({"query": "How does attention work?"})
print(response["result"])

Putting It All Together

Here’s a minimal end-to-end pipeline — load some documents, embed them, store them, and query:

import os
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# 1. Load documents
loader = DirectoryLoader("./docs", glob="**/*.md", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents")

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)
print(f"Split into {len(chunks)} chunks")

# 3. Embed + store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Build retrieval chain
template = """Answer based on the context below. Say "I don't know" if the context
doesn't contain the answer.

Context: {context}
Question: {question}
Answer:"""

qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    chain_type_kwargs={
        "prompt": PromptTemplate(
            template=template,
            input_variables=["context", "question"]
        )
    }
)

# 5. Query
while True:
    question = input("\nAsk a question (or 'quit'): ")
    if question.lower() == "quit":
        break
    result = qa.invoke({"query": question})
    print(f"\n{result['result']}")

What This Doesn’t Cover (Yet)

This is the happy path. Production RAG introduces a lot of complexity I want to address in separate posts:

Evaluation — how do you know if your RAG is actually good? RAGAS, LlamaIndex eval frameworks
Chunking strategies — semantic chunking, parent-child chunking, document hierarchy
Hybrid search — combining dense and sparse retrieval
Reranking — Cohere Rerank, cross-encoder models
Agentic retrieval — query decomposition, iterative retrieval

The pipeline I’ve shown here is functional but naive. It works well enough to prototype on. The gap between this prototype and a production system is where most of the interesting engineering lives, and that’s what I’ll be exploring in upcoming posts.

Prerequisites

If you want to run the code above:

pip install langchain langchain-openai langchain-chroma chromadb
export OPENAI_API_KEY="your-key-here"

Put some markdown or text files in a ./docs directory and you’re off.

Share on

X Facebook LinkedIn Bluesky

Deepan Seeralan

Building Blocks of a RAG Pipeline

Why RAG Exists

The Five Components

1. Document Ingestion

2. Embedding

3. Vector Store

4. Retrieval

5. Generation

Putting It All Together

What This Doesn’t Cover (Yet)

Prerequisites

Share on

Comments

You May Also Enjoy

OKF: Google’s Markdown-Based Knowledge Format for AI Agents

Building a Document Q&A Service: From Prototype to Production

Understanding Snowflake ID and its uses

Pinecone Goes Serverless: What Actually Changes