Build a Private RAG App with Ollama, LangChain, and Qdrant

ollama-rag-private-app-thumbnail

How to Build a Private ChatGPT for Your Documents with Ollama

RAG lets your AI app answer from your own documents. In this guide, we will build the beginner-friendly mental model for a private RAG app using Ollama, LangChain, and Qdrant.

โฑ๏ธ Time to Complete

Around 25-40 minutes for a basic local version.

๐ŸŽฏ What youโ€™ll achieve / learn

  • Understand what RAG is and why it matters
  • Learn the role of Ollama, LangChain, and Qdrant
  • Build a simple local document question-answering flow
  • Know where embeddings, chunks, vector databases, and chat models fit
  • Avoid common beginner mistakes in local RAG apps

๐Ÿ”— Related posts

Ollama RAG architecture

๐Ÿง  What is RAG?

RAG means Retrieval-Augmented Generation.

Normal chat:

User question -> LLM -> Answer

RAG chat:

User question -> Search your documents -> Send relevant context to LLM -> Answer

That means the model does not need to memorize everything. It retrieves the right document chunks first, then writes an answer using that context.

This is useful for:

  • Company docs
  • Personal notes
  • PDFs
  • Code documentation
  • Internal policies
  • Support knowledge bases
  • Private research

๐Ÿงฉ The stack

For this beginner setup:

  • Ollama runs the local chat model and embedding model
  • LangChain helps connect documents, retrievers, prompts, and the model
  • Qdrant stores vectors so you can search by meaning
  • Python glues everything together

You could also use Chroma, LlamaIndex, Milvus, or Weaviate. Qdrant is a good pick because it is production-friendly but still beginner approachable.

Ollama LangChain Qdrant stack responsibilities

โš™๏ธ Step 1: Install the tools

Install Ollama:

https://ollama.com/download

Pull a chat model:

ollama pull gemma4

Pull an embedding model:

ollama pull embeddinggemma

Install Python packages:

pip install langchain langchain-ollama langchain-qdrant qdrant-client

Start Qdrant with Docker:

docker run -p 6333:6333 qdrant/qdrant

If you do not have Docker yet, install it from docker.com.

๐Ÿ“„ Step 2: Prepare your documents

Create a folder:

docs/

Add a few .txt or .md files first. Start simple before adding PDFs, HTML, or large messy documents.

Example:

docs/company-faq.md
docs/api-notes.md
docs/install-guide.md

Beginner mistake: adding 10,000 files on day one. Start with five small documents and verify your pipeline works.

Private RAG document ingestion pipeline

โœ‚๏ธ Step 3: Split documents into chunks

LLMs cannot read every document all the time. RAG splits files into smaller pieces called chunks.

Good chunks are:

  • Large enough to contain useful meaning
  • Small enough to fit into the prompt
  • Overlapped slightly so important context is not cut off

Common beginner setting:

chunk_size: 800-1200 characters
chunk_overlap: 100-200 characters

You can tune this later.

Private Ollama RAG workflow

๐Ÿงฌ Step 4: Create embeddings

An embedding is a numeric representation of text meaning.

When you embed a document chunk, it becomes searchable by similarity. That means a user can ask:

How do I reset my API key?

And your app can find a chunk that says:

To rotate credentials, open the dashboard and generate a new API token.

Even though the wording is different, the meaning is close.

With Ollama, embeddings can be generated locally using an embedding model.

๐Ÿ—ƒ๏ธ Step 5: Store vectors in Qdrant

Qdrant stores:

  • The vector embedding
  • The original text chunk
  • Metadata like filename, page, title, or section

Metadata matters because users often ask follow-up questions like:

  • "Which file said that?"
  • "Show me the source."
  • "Was this from the install guide or FAQ?"

Always store source metadata if you want trustworthy answers.

๐Ÿ”Ž Step 6: Retrieve context for a question

When a user asks a question:

  1. Convert the question into an embedding
  2. Search Qdrant for similar document chunks
  3. Return the top matches
  4. Pass those chunks to Ollama as context

This is the retrieval part of RAG.

Private RAG question answering flow

๐Ÿค– Step 7: Generate the answer with Ollama

The prompt should tell the model to answer only from retrieved context.

Example:

You are a helpful assistant. Use only the provided context.
If the answer is not in the context, say you do not know.

Context:
{retrieved_chunks}

Question:
{user_question}

This reduces hallucination. It does not eliminate it completely, but it gives the model a much better source of truth.

Private RAG answer guardrails

๐Ÿงช Minimal Python shape

This is the high-level shape, not a full production app:

from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_qdrant import QdrantVectorStore

embeddings = OllamaEmbeddings(model="embeddinggemma")
llm = ChatOllama(model="gemma4")

vector_store = QdrantVectorStore.from_existing_collection(
    embedding=embeddings,
    collection_name="docs",
    url="http://localhost:6333",
)

retriever = vector_store.as_retriever(search_kwargs={"k": 4})
docs = retriever.invoke("How do I install the app?")

context = "\n\n".join(doc.page_content for doc in docs)
prompt = f"Use this context:\n{context}\n\nQuestion: How do I install the app?"

answer = llm.invoke(prompt)
print(answer.content)

Use the official LangChain Ollama integration and Qdrant documentation when turning this into a real app.

โœ… Beginner checklist

  • Ollama is installed
  • Chat model is pulled
  • Embedding model is pulled
  • Qdrant is running
  • Documents are small and clean
  • Chunks include source metadata
  • Retrieval returns relevant chunks
  • Prompt tells the model to use only context
  • App shows sources with answers

Related posts