top of page

LangChain RAG Tutorial: Build Retrieval-Augmented Generation from Scratch

  • Writer: Leanware Editorial Team
    Leanware Editorial Team
  • 4 hours ago
  • 6 min read

Introduction to RAG & LangChain


What is RAG (Retrieval-Augmented Generation)?

Retrieval-Augmented Generation (RAG) is a hybrid approach that combines the strengths of language models and external knowledge retrieval. Instead of relying solely on a model's internal training data, RAG dynamically retrieves relevant documents from a knowledge base and incorporates them into the generation process. This enhances accuracy, keeps information up-to-date, and avoids hallucinations. RAG became prominent through research by Facebook AI and has since been adopted in enterprise NLP workflows.


Why Use LangChain for RAG?

LangChain offers a modular, developer-friendly framework to build complex RAG pipelines. It abstracts the complexity of integrating LLMs, vector databases, prompt templates, and tools into a unified interface. LangChain supports many vector stores like FAISS, Chroma, and Pinecone, along with a rich ecosystem of document loaders, agents, and memory modules. Compared to LlamaIndex, LangChain excels in customizability and agent-driven orchestration.


Key Use Cases & Benefits

  • AI-powered chatbots that provide accurate, up-to-date responses from internal documentation

  • Legal and HR assistants capable of parsing policy documents and answering compliance queries

  • Customer support systems with tailored responses based on product manuals

  • Research assistants who retrieve and summarize academic papers


Prerequisites & Setup


Environment Setup & Dependencies

You’ll need Python 3.10+, pip, and optionally a GPU (for faster embedding generation). Set up a virtual environment using virtualenv or conda to manage dependencies.


Installing LangChain, Embeddings, Vector Stores

pip install langchain openai chromadb faiss-cpu tiktoken python-dotenv


  • langchain: Core framework

  • openai: LLM and embedding provider

  • chromadb or faiss-cpu: Vector store

  • tiktoken: Tokenizer used by OpenAI models


Project Structure & File Layout

rag-tutorial/

├── main.py

├── ingest.py

├── .env

├── data/

│   └── docs/

├── outputs/

└── utils/

    └── helpers.py


This layout keeps ingestion, retrieval, and logic modular.


Data Ingestion & Indexing

Loading Documents (PDFs, Word, Text, etc.)

LangChain provides document loaders for various formats. For example:

from langchain.document_loaders import DirectoryLoader, PyPDFLoader

loader = DirectoryLoader("./data/docs", glob="**/*.pdf", loader_cls=PyPDFLoader)

documents = loader.load()


Use libraries like pdfminer, docx, and unstructured for parsing.


Text Splitting / Chunking Strategies

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

docs = splitter.split_documents(documents)


Chunking improves context alignment and retrieval precision.


Building Embeddings & Vector Store

from langchain. embeddings import OpenAIEmbeddings

from langchain.vectorstores import Chroma

embedding = OpenAIEmbeddings()

vectorstore = Chroma.from_documents(docs, embedding, persist_directory="./outputs")


Other embedding providers include HuggingFace, Cohere, and BAAI/bge.


Metadata & Document Tags

Add source info to each document chunk:

for doc in docs:

    doc.metadata["source"] = "employee_policy.pdf"


This supports filtering and source citation later.


Retrieval + Generation Pipeline


Retriever Interfaces & Search

retriever = vectorstore.as_retriever(search_type="similarity", k=4)

retrieved_docs = retriever.get_relevant_documents("What is the leave policy?")

You can also use Max Marginal Relevance (MMR) or filtered retrieval.


Prompt Composition & LLM Input

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

response = qa_chain.run("Summarize our leave policy")

For more control, use StuffDocumentsChain or MapReduceDocumentsChain.


Putting It Together: The RAG Chain

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo")

qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

query = "How do performance reviews work?"

answer = qa_chain.run(query)

print(answer)


Advanced Features & Extensions

Chat Memory / Conversation History

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

This enables multi-turn conversations.


Agent-Based Retrieval (Multi-Hop, Adaptive Queries)

LangChain agents can trigger multiple tools, including retrievers. For complex questions, agents plan and sequence retrieval steps.


Returning Sources & Citations

qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=True)

result = qa_chain("Summarize the leave policy")

print(result["result"])

print(result["source_documents"])


Fine-Tuning & Custom Retrieval Logic

Apply metadata filters, hybrid (keyword + vector) search, or even reranking with custom logic.


Real-World Example: Building a RAG Chatbot


Sample Use Case & Problem Statement

An HR team wants a chatbot that answers employee questions based on internal documents (PDFs).


Code Walkthrough

llm = ChatOpenAI()

retriever = vectorstore.as_retriever()

qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

query = "What is our maternity leave policy?"

response = qa.run(query)

print(response)


Deploying with an API / Frontend (e.g. FastAPI, Streamlit)

FastAPI Example:

from fastapi import FastAPI, Request

app = FastAPI()

@app.post("/rag")

async def rag_endpoint(request: Request):

    body = await request.json()

    answer = qa.run(body["query"])

    return {"answer": answer}


Evaluation, Metrics & Debugging Tips

  • Use precision@k to evaluate retrieval accuracy

  • Compare generated answers to gold references (F1 or ROUGE)

  • Log tokens, sources, and model outputs for each run


Best Practices, Pitfalls & Optimization


ree

Chunk Size, Overlap & Split Strategy

Choosing how to break your documents into chunks is critical. Larger chunks (e.g., 500 tokens) give the model more context in a single pass, which can improve answer accuracy. However, if chunks are too large, they risk being truncated due to token limits, especially during multi-document retrieval. Adding an overlap (e.g., 50 tokens) helps preserve sentence continuity across chunks. The goal is to balance context richness with retrieval precision. Start with 500 token chunks and a 10–15% overlap as a baseline, then tune based on your use case.


Retriever Efficiency & Caching

To reduce latency and avoid redundant computations, enable retrieval caching. LangChain supports this via langchain-caching, which stores results of previously made vector queries. This is especially useful in applications where similar queries are repeated or agents use step-based reasoning that makes many retrieval calls. Caching reduces load on your vector store and speeds up overall response times.


Prompt Engineering Techniques

The quality of your final answer heavily depends on how you prompt the LLM. You can improve performance by using techniques like: • ReAct: A prompting style that blends reasoning and acting steps. • Few-shot examples: Include 1–2 examples in the prompt to show the model how to respond. • Chain-of-thought: Ask the model to explain its reasoning step-by-step. These strategies help models stay grounded and handle more complex queries, especially when combined with retrieved documents.


Avoiding Hallucinations & Ensuring Source Quality

LLMs can sometimes make up information ("hallucinate") if the retrieved documents are vague, irrelevant, or low-quality. To prevent this: • Always ground responses directly in the retrieved text. • Filter out noisy or outdated documents during ingestion. • Validate document metadata (e.g., source, timestamp, author) to ensure reliability. Additionally, instruct the model to cite or highlight evidence from the context when forming a response.


Future Directions & Advanced Research

Dynamic / Parametric RAG

Researchers are exploring dynamic retrieval models that can learn how and when to retrieve relevant documents during training. These models blend retrieval with the model's parameters—called parametric RAG. It allows LLMs to learn to fetch supporting evidence as part of their training process, rather than relying on external retrievers.


Multi-modal RAG (images, tables, etc.)

RAG is expanding beyond text. New systems like DocVQA and multi-modal RAG can retrieve and reason over structured data like tables, scanned documents, and even images. In LangChain, you can combine OCR pipelines (for extracting text from images) with text embeddings to support this use case, enabling richer context for answering questions.


RAG in Production at Scale

When moving from prototype to production, you need scalable infrastructure. LangChain offers LangServe, a tool to package your RAG pipelines into deployable APIs. For performance: • Use batch retrieval to minimize API calls. • Run inference on GPUs for faster responses. • Distribute workloads across sharded vector databases (e.g., FAISS, Weaviate, Pinecone) for speed and resilience.


Conclusion & Next Steps

Recap & Key Takeaways

LangChain offers a composable and flexible toolkit to build RAG systems. You’ve now learned the core steps: ingesting documents, splitting them into optimized chunks, generating embeddings, retrieving relevant context, and generating grounded responses using LLMs.


What to Learn Next (LangSmith, Agents, etc.)


  • LangSmith: A powerful tool for debugging and observing LLM applications, helping you track performance and test prompt variations.

  • LangChain Agents: For tasks that require multi-step decision-making, agents can retrieve, plan, and act using tools.

  • LangServe: Helps you deploy LangChain pipelines as production-grade APIs, with scalable and maintainable infrastructure.


You can consult with our team to evaluate your project needs and identify the most effective approach.


FAQs

What is LangChain used for?

LangChain is a framework to build LLM-powered apps like chatbots, agents, and RAG pipelines with modular components.

Is RAG better than fine-tuning?

RAG is cheaper and easier to update than fine-tuning. It’s ideal for fact-based tasks and domain-specific knowledge

Can I build a chatbot using LangChain and RAG?

Yes. Combine document loaders, vector stores, and LLM chains to build a production-grade chatbot.

What are the best vector stores for LangChain RAG?

Chroma (lightweight), FAISS (local), and Pinecone (cloud-native). Choose based on scale and latency needs.


How much does it cost to run a LangChain RAG system per 1000 queries?

Depends on LLM (GPT-4 vs. Mistral), vector store (local vs. Pinecone), and query size. Estimate: $0.10–$1.50 per 1000 queries.

What are the most common LangChain RAG errors, and how to fix them?

  • Embedding mismatch: ensure documents and queries use the same model

  • Memory overflow: reduce chunk size or model context

  • Bad results: tune retrieval parameters or prompt

How does LangChain RAG performance compare to GPT-4 with 128k context?

RAG often outperforms for domain-specific tasks due to explicit grounding. Also cheaper than context stuffing.


Can LangChain RAG work with local LLMs like Llama 2/3 or Mistral?

Yes. Use Ollama or LM Studio to run local models and configure LangChain with HuggingFace or OpenAI-compatible APIs.



Join our newsletter for fresh insights, once a month. No spam.

bottom of page