A Complete Guide to Building a Basic (Naive) RAG

Introduction

RAG (Retrieval‑Augmented Generation) lets an LLM answer questions using your documents. In this first post of the series we'll build a basic / naive RAG: index some files, retrieve the most relevant chunks for a query, then augment the LLM prompt with those chunks to get a grounded answer.

We'll use Python, ChromaDB for the vector store, local sentence‑transformer embeddings (so you can run it without paying for embedding calls), and OpenAI for generation. Code snippets are compact and runnable, with room for expansion.

What you'll build

A small script that:

Loads your docs (TXT)
Splits them into chunks
Embeds + stores them in Chroma
Retrieves top‑k chunks for a question
Asks an LLM to answer using only those chunks

Prerequisites

Python 3.9+
A code editor (VS Code recommended)
An OpenAI API key in a .env file as OPENAI_API_KEY=...
A few documents (plain text or a PDF) to query

Tip: Start with a tidy data/ folder. You can use a mix of .txt files and a PDF.

0) Project scaffold

mkdir rag-basic && cd rag-basic
python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows
# .venv\Scripts\activate

Create folders and starter files:

rag-basic/
├─ .env
├─ news_articles/
│  ├─ article1.txt
│  └─ article2.txt
└─ app.py

.env:

OPENAI_API_KEY=sk-...

1) Install dependencies

Keep it minimal for a basic RAG:

python3 -m venv venv
pip install python-dotenv openai chromadb

(Optionally for plots later in the series: pip install umap-learn matplotlib numpy.)

2) The minimal RAG pipeline (index → retrieve → generate)

2.1 Imports, env, and clients

Open app.py and paste the snippets in each step.

# app.py
from dotenv import load_dotenv
from openai import OpenAI
from chromadb.utils import embedding_functions
import chromadb
import os
 
load_dotenv()
 
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
assert OPENAI_API_KEY, "Set OPENAI_API_KEY in your .env"
 
# Embedding function (OpenAI)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=OPENAI_API_KEY,
    model_name="text-embedding-3-small",
)
 
# Chroma client (persistent on disk)
chroma_client = chromadb.PersistentClient(path="chroma_db")
 
# One collection per project/task
COLLECTION_NAME = "document_qa_collection"
collection = chroma_client.get_or_create_collection(
    name=COLLECTION_NAME,
    embedding_function=openai_ef
)
 
# OpenAI chat client
oa = OpenAI(api_key=OPENAI_API_KEY)

2.2 Load your documents

Support TXT files by adding this to app.py:

# app.py
def load_documents(directory: str):
    print(f"Loading documents from {directory} ...")
    docs = []
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            with open(os.path.join(directory, filename), "r", encoding="utf-8") as f:
                docs.append({"id": filename, "text": f.read()})
    print(f"Loaded {len(docs)} documents")
    return docs
 
documents = load_documents("./news_articles")

2.3 Split text into overlapping chunks

Chunking preserves context while staying under token limits.

def split_text(text: str, chunk_size=1000, chunk_overlap=200):
    chunks, start = [], 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        # step forward but overlap a little to keep context
        start = end - chunk_overlap
    return chunks
 
# Expand docs into chunked docs with unique IDs
chunked_docs = []
for doc in documents:
    chunks = split_text(doc["text"])
    for i, chunk in enumerate(chunks, start=1):
        chunked_docs.append({
            "id": f"{doc['id']}_chunk{i}",
            "text": chunk,
            "source": doc["id"],
            "chunk_index": i
        })
 
print(f"Split into {len(chunked_docs)} chunks")

2.4 Create embeddings and index into Chroma

Embed each chunk and upsert into the collection:

def get_openai_embedding(text: str):
    res = oa.embeddings.create(model="text-embedding-3-small", input=text)
    return res.data[0].embedding
 
# Upsert all chunks with precomputed embeddings
for d in chunked_docs:
    emb = get_openai_embedding(d["text"])
    collection.upsert(
        ids=[d["id"]],
        documents=[d["text"]],
        metadatas=[{"source": d["source"], "chunk_index": d["chunk_index"]}],
        embeddings=[emb],
    )
 
print("Indexing complete ✅")

2.5 Retrieve the most relevant chunks

def query_documents(question: str, n_results=4):
    print("Retrieving relevant chunks ...")
    res = collection.query(
        query_texts=[question],
        n_results=n_results,
        include=["documents", "metadatas", "distances", "ids"]
    )
    # Flatten results from Chroma's list-of-lists shape
    docs = res["documents"][0]
    metas = res["metadatas"][0]
    ids = res["ids"][0]
    dists = res.get("distances", [[None]*len(docs)])[0]
 
    for i, (cid, meta, dist) in enumerate(zip(ids, metas, dists), start=1):
        print(f"{i}. {meta['source']} (chunk {meta['chunk_index']})  distance={dist:.4f}")
 
    return docs

Why distances? It's useful to see how "close" each chunk is to the query in vector space (lower is closer).

2.6 Generate a final answer with the retrieved context

This is the "G" in RAG: pass the retrieved snippets to the chat model.

def generate_response(question: str, retrieved_chunks: list[str]):
    context = "\n\n".join(retrieved_chunks)
    system = (
        "You are an assistant for question-answering tasks. "
        "Use the provided context to answer the user succinctly. "
        "If the answer is not in the context, say you don't know."
    )
    user = f"Context:\n{context}\n\nQuestion:\n{question}"
 
    completion = oa.chat.completions.create(
        model="gpt-4o-mini",  # swap for your preferred model
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        temperature=0.2,
    )
    return completion.choices[0].message.content.strip()

2.7 Wire it up (ask your first question)

if __name__ == "__main__":
    question = "Tell me about AI replacing TV writers in the strike."
    top_chunks = query_documents(question, n_results=4)
    answer = generate_response(question, top_chunks)
    print("\n--- Answer ---")
    print(answer)

Run it

python app.py

You should see the top‑matching chunks printed (with distance scores), followed by a short, sourced answer that references your documents.

3) What "naive RAG" is (and why it's enough for now)

Naive (basic) RAG has three phases:

Indexing — clean, chunk, embed, store
Retrieval — embed the question and perform a similarity search
Generation — stuff retrieved text + question into a prompt for the LLM

It's a great starting point: the pipeline is clear, fast to build, and works well for small/clean corpora.

4) Try these quick experiments

Change chunk_size (e.g., 800 vs 1200) and chunk_overlap (50–250). How does answer quality change?
Ask a question you know isn't covered by your docs — does the model say "I don't know"?
Raise n_results to 6–8 and watch for noise versus coverage.

Full minimal script (for copy/paste)

from dotenv import load_dotenv
from openai import OpenAI
from chromadb.utils import embedding_functions
import chromadb
import os
 
load_dotenv()
 
#load env variables
openai_key = os.getenv("OPENAI_API_KEY")
 
#create embedding function that vectorizes data and son get store in chroma db
openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key=openai_key,model_name="text-embedding-3-small")
 
#Initialize chroma client with persistent storage
chromadb_client = chromadb.PersistentClient(path="chroma_db")
 
#create collection that will automatically store the vectorized data
collection_name = "document_qa_colleciton"
collection = chromadb_client.get_or_create_collection(name=collection_name, embedding_function=openai_ef)
 
#create OpenAI client
openai_client = OpenAI(api_key=openai_key)
 
response = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that can answer questions about the documents in the collection."},
        {"role": "user", "content": "What is the main idea of the document?"}
    ]
)
 
# load the documents from a directory
def load_documents(directory):
    print(f"Loading documents from {directory}")
    documents = []
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            with open(os.path.join(directory, filename), "r") as f:
                documents.append({"id": filename, "text": f.read()})
    return documents
 
# #Split the text into chunks
def split_text(text, chunk_size=1000, chunk_overlap=20):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - chunk_overlap
    return chunks
 
# #load the documents
documents = load_documents("./news_articles")
print("Number of documents loaded:", len(documents))
 
# # Split documents into chunks
chunked_documents = []
for doc in documents:
    chunks = split_text(doc["text"])
    print("==== Splitting docs into chunks ====", "\n")
    for i, chunk in enumerate(chunks):
        chunked_documents.append({"id": f"{doc['id']}_chunk{i+1}", "text": chunk})
 
print(f"Split documents into {len(chunked_documents)} chunks", "\n")
 
# #create a function to generate embedding for each chunk using openai
def get_openai_embedding(text):
    response = openai_client.embeddings.create(input=text, model="text-embedding-3-small")
    embedding = response.data[0].embedding
    print("==== Generating embeddings... ====")
    return embedding
 
# # Generate embeddings for the document chunks
for doc in chunked_documents:
    print("==== Generating embeddings... ====")
    # adding the embedding to the chunked_documents list for each chunk
    doc["embedding"] = get_openai_embedding(doc["text"])
 
# # Add  the document with embedding to the collection
for doc in chunked_documents:
    collection.upsert(
        documents=[doc["text"]],
        ids=[doc["id"]],
        embeddings=[doc["embedding"]]
    )
 
#Query Function to get the most relevant chunks
def query_documents(questions,n_results=3):
    print("==== Querying documents... ====")
    # query the collection for the most relevant chunks
    results =  collection.query(query_texts= questions, n_results=n_results)
 
    # Extract the relevant chunks
    relevant_chunks = [doc for sublist in results["documents"] for doc in sublist]
    print("==== Returning relevant chunks ====")
 
    #return the relevant chunks
    return relevant_chunks
 
# Function to generate a response from OpenAI
def generate_response(question, relevant_chunks):
    context = "\n\n".join(relevant_chunks)
    prompt = (
        "You are an assistant for question-answering tasks. Use the following pieces of "
        "retrieved context to answer the question. If you don't know the answer, say that you "
        "don't know. Use three sentences maximum and keep the answer concise."
        "\n\nContext:\n" + context + "\n\nQuestion:\n" + question
    )
 
    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": prompt,
            },
            {
                "role": "user",
                "content": question,
            },
        ],
    )
 
    answer = response.choices[0].message
    return answer
 
question = "Tell me about AI replacing TV writters strike"
relevant_chunks = query_documents(question)
response = generate_response(question, relevant_chunks)
print(response)

7) Where this "Basic RAG" falls short (and what's next)

Naive RAG is intentionally simple. As your corpus grows, you'll notice:

Sometimes the retrieved chunks are close but not quite what you need.
The LLM might ignore key context (or hallucinate).
Large corpora need better indexing, reranking, and smarter query expansion.

In the next post of this series, hardening retrieval with practical upgrades:

Query expansion (generated answers & multi‑query)
Reranking (to filter noise)
Prompt packing patterns that keep answers grounded

Cheatsheet summary

Index: chunk → embed → upsert
Retrieve: embed query → similarity search → top‑k chunks
Generate: pack context + question → LLM → concise grounded answer

That's it — you've got a working Basic RAG. Tweak chunk sizes, add your own docs, and see how far it gets you. Then let's level it up in Part 2.

A Complete Guide to Building a Basic (Naive) RAG — Part 1 of my RAG Series