LangChain RAG Tutorial: Build Retrieval-Augmented Generation from Scratch
- Leanware Editorial Team
- 4 hours ago
- 6 min read
Introduction to RAG & LangChain
What is RAG (Retrieval-Augmented Generation)?
Retrieval-Augmented Generation (RAG) is a hybrid approach that combines the strengths of language models and external knowledge retrieval. Instead of relying solely on a model's internal training data, RAG dynamically retrieves relevant documents from a knowledge base and incorporates them into the generation process. This enhances accuracy, keeps information up-to-date, and avoids hallucinations. RAG became prominent through research by Facebook AI and has since been adopted in enterprise NLP workflows.
Why Use LangChain for RAG?
LangChain offers a modular, developer-friendly framework to build complex RAG pipelines. It abstracts the complexity of integrating LLMs, vector databases, prompt templates, and tools into a unified interface. LangChain supports many vector stores like FAISS, Chroma, and Pinecone, along with a rich ecosystem of document loaders, agents, and memory modules. Compared to LlamaIndex, LangChain excels in customizability and agent-driven orchestration.
Key Use Cases & Benefits
AI-powered chatbots that provide accurate, up-to-date responses from internal documentation
Legal and HR assistants capable of parsing policy documents and answering compliance queries
Customer support systems with tailored responses based on product manuals
Research assistants who retrieve and summarize academic papers
Prerequisites & Setup
Environment Setup & Dependencies
You’ll need Python 3.10+, pip, and optionally a GPU (for faster embedding generation). Set up a virtual environment using virtualenv or conda to manage dependencies.
Installing LangChain, Embeddings, Vector Stores
pip install langchain openai chromadb faiss-cpu tiktoken python-dotenv
langchain: Core framework
openai: LLM and embedding provider
chromadb or faiss-cpu: Vector store
tiktoken: Tokenizer used by OpenAI models
Project Structure & File Layout
rag-tutorial/
├── main.py
├── ingest.py
├── .env
├── data/
│ └── docs/
├── outputs/
└── utils/
└── helpers.py
This layout keeps ingestion, retrieval, and logic modular.
Data Ingestion & Indexing
Loading Documents (PDFs, Word, Text, etc.)
LangChain provides document loaders for various formats. For example:
from langchain.document_loaders import DirectoryLoader, PyPDFLoader
loader = DirectoryLoader("./data/docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
Use libraries like pdfminer, docx, and unstructured for parsing.
Text Splitting / Chunking Strategies
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(documents)
Chunking improves context alignment and retrieval precision.
Building Embeddings & Vector Store
from langchain. embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding, persist_directory="./outputs")
Other embedding providers include HuggingFace, Cohere, and BAAI/bge.
Metadata & Document Tags
Add source info to each document chunk:
for doc in docs:
doc.metadata["source"] = "employee_policy.pdf"
This supports filtering and source citation later.
Retrieval + Generation Pipeline
Retriever Interfaces & Search
retriever = vectorstore.as_retriever(search_type="similarity", k=4)
retrieved_docs = retriever.get_relevant_documents("What is the leave policy?")
You can also use Max Marginal Relevance (MMR) or filtered retrieval.
Prompt Composition & LLM Input
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
response = qa_chain.run("Summarize our leave policy")
For more control, use StuffDocumentsChain or MapReduceDocumentsChain.
Putting It Together: The RAG Chain
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
query = "How do performance reviews work?"
answer = qa_chain.run(query)
print(answer)
Advanced Features & Extensions
Chat Memory / Conversation History
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
This enables multi-turn conversations.
Agent-Based Retrieval (Multi-Hop, Adaptive Queries)
LangChain agents can trigger multiple tools, including retrievers. For complex questions, agents plan and sequence retrieval steps.
Returning Sources & Citations
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=True)
result = qa_chain("Summarize the leave policy")
print(result["result"])
print(result["source_documents"])
Fine-Tuning & Custom Retrieval Logic
Apply metadata filters, hybrid (keyword + vector) search, or even reranking with custom logic.
Real-World Example: Building a RAG Chatbot
Sample Use Case & Problem Statement
An HR team wants a chatbot that answers employee questions based on internal documents (PDFs).
Code Walkthrough
llm = ChatOpenAI()
retriever = vectorstore.as_retriever()
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
query = "What is our maternity leave policy?"
response = qa.run(query)
print(response)
Deploying with an API / Frontend (e.g. FastAPI, Streamlit)
FastAPI Example:
from fastapi import FastAPI, Request
app = FastAPI()
@app.post("/rag")
async def rag_endpoint(request: Request):
body = await request.json()
answer = qa.run(body["query"])
return {"answer": answer}
Evaluation, Metrics & Debugging Tips
Use precision@k to evaluate retrieval accuracy
Compare generated answers to gold references (F1 or ROUGE)
Log tokens, sources, and model outputs for each run
Best Practices, Pitfalls & Optimization

Chunk Size, Overlap & Split Strategy
Choosing how to break your documents into chunks is critical. Larger chunks (e.g., 500 tokens) give the model more context in a single pass, which can improve answer accuracy. However, if chunks are too large, they risk being truncated due to token limits, especially during multi-document retrieval. Adding an overlap (e.g., 50 tokens) helps preserve sentence continuity across chunks. The goal is to balance context richness with retrieval precision. Start with 500 token chunks and a 10–15% overlap as a baseline, then tune based on your use case.
Retriever Efficiency & Caching
To reduce latency and avoid redundant computations, enable retrieval caching. LangChain supports this via langchain-caching, which stores results of previously made vector queries. This is especially useful in applications where similar queries are repeated or agents use step-based reasoning that makes many retrieval calls. Caching reduces load on your vector store and speeds up overall response times.
Prompt Engineering Techniques
The quality of your final answer heavily depends on how you prompt the LLM. You can improve performance by using techniques like: • ReAct: A prompting style that blends reasoning and acting steps. • Few-shot examples: Include 1–2 examples in the prompt to show the model how to respond. • Chain-of-thought: Ask the model to explain its reasoning step-by-step. These strategies help models stay grounded and handle more complex queries, especially when combined with retrieved documents.
Avoiding Hallucinations & Ensuring Source Quality
LLMs can sometimes make up information ("hallucinate") if the retrieved documents are vague, irrelevant, or low-quality. To prevent this: • Always ground responses directly in the retrieved text. • Filter out noisy or outdated documents during ingestion. • Validate document metadata (e.g., source, timestamp, author) to ensure reliability. Additionally, instruct the model to cite or highlight evidence from the context when forming a response.
Future Directions & Advanced Research
Dynamic / Parametric RAG
Researchers are exploring dynamic retrieval models that can learn how and when to retrieve relevant documents during training. These models blend retrieval with the model's parameters—called parametric RAG. It allows LLMs to learn to fetch supporting evidence as part of their training process, rather than relying on external retrievers.
Multi-modal RAG (images, tables, etc.)
RAG is expanding beyond text. New systems like DocVQA and multi-modal RAG can retrieve and reason over structured data like tables, scanned documents, and even images. In LangChain, you can combine OCR pipelines (for extracting text from images) with text embeddings to support this use case, enabling richer context for answering questions.
RAG in Production at Scale
When moving from prototype to production, you need scalable infrastructure. LangChain offers LangServe, a tool to package your RAG pipelines into deployable APIs. For performance: • Use batch retrieval to minimize API calls. • Run inference on GPUs for faster responses. • Distribute workloads across sharded vector databases (e.g., FAISS, Weaviate, Pinecone) for speed and resilience.
Conclusion & Next Steps
Recap & Key Takeaways
LangChain offers a composable and flexible toolkit to build RAG systems. You’ve now learned the core steps: ingesting documents, splitting them into optimized chunks, generating embeddings, retrieving relevant context, and generating grounded responses using LLMs.
What to Learn Next (LangSmith, Agents, etc.)
LangSmith: A powerful tool for debugging and observing LLM applications, helping you track performance and test prompt variations.
LangChain Agents: For tasks that require multi-step decision-making, agents can retrieve, plan, and act using tools.
LangServe: Helps you deploy LangChain pipelines as production-grade APIs, with scalable and maintainable infrastructure.
You can consult with our team to evaluate your project needs and identify the most effective approach.
FAQs
What is LangChain used for?
LangChain is a framework to build LLM-powered apps like chatbots, agents, and RAG pipelines with modular components.
Is RAG better than fine-tuning?
RAG is cheaper and easier to update than fine-tuning. It’s ideal for fact-based tasks and domain-specific knowledge
Can I build a chatbot using LangChain and RAG?
Yes. Combine document loaders, vector stores, and LLM chains to build a production-grade chatbot.
What are the best vector stores for LangChain RAG?
Chroma (lightweight), FAISS (local), and Pinecone (cloud-native). Choose based on scale and latency needs.
How much does it cost to run a LangChain RAG system per 1000 queries?
Depends on LLM (GPT-4 vs. Mistral), vector store (local vs. Pinecone), and query size. Estimate: $0.10–$1.50 per 1000 queries.
What are the most common LangChain RAG errors, and how to fix them?
Embedding mismatch: ensure documents and queries use the same model
Memory overflow: reduce chunk size or model context
Bad results: tune retrieval parameters or prompt
How does LangChain RAG performance compare to GPT-4 with 128k context?
RAG often outperforms for domain-specific tasks due to explicit grounding. Also cheaper than context stuffing.
Can LangChain RAG work with local LLMs like Llama 2/3 or Mistral?
Yes. Use Ollama or LM Studio to run local models and configure LangChain with HuggingFace or OpenAI-compatible APIs.

