RAG Knowledge Base (Internal or External)
Retrieval-Augmented Generation stack for grounding an LLM in your company's docs, PDFs, and data sources with accurate citations.
The Stack
LlamaIndex
— RAG orchestration frameworkLlamaIndex provides the full retrieval pipeline out of the box: document loaders, node parsers, embedding models, query engines, and multi-step query rewriting. Faster to wire than building from primitives.
Alternatives: langchain, dspy
Qdrant
— Vector storeQdrant's hybrid search (dense + sparse BM25 fusion) reduces the hallucination rate from retrieval misses by 30–50% compared to pure semantic search. Fully open-source and self-hostable.
Alternatives: pinecone, weaviate, pgvector, milvus, chroma
OpenAI
— Embedding + generationtext-embedding-3-large produces the highest-quality English embeddings at 3,072 dimensions; GPT-4o handles synthesis of retrieved chunks.
Alternatives: anthropic, cohere, deepseek-api, qwen-api, zhipu-ai
Unstructured
— Document ingestionUnstructured parses PDFs, HTML, DOCX, PPTX, and images (via OCR) into clean text chunks, preserving table structure that naive text splitters destroy.
Alternatives: firecrawl
Cohere
— Reranking optionalCohere Rerank-3 re-scores the top-k retrieved chunks by semantic relevance before injection, filtering noise and cutting context window waste by 40–60%.
Alternatives: voyage-ai, jina-ai
LangSmith
— Retrieval tracing and evaluation optionalLangSmith records every query, retrieved nodes, prompt, and output. The LangSmith Evals framework lets you run automated RAGAS-style correctness benchmarks on dataset regressions.
Alternatives: langfuse, braintrust, opik
Ragas
— RAG evaluation metrics optionalRAGAS computes faithfulness, answer relevancy, context precision, and context recall without human labels — critical for catching retrieval drift after re-indexing.
Alternatives: deepeval, trulens
Firecrawl
— Web and documentation crawling optionalFirecrawl turns any documentation site or web app into clean Markdown chunks ready for indexing — handles JavaScript-rendered pages that simple HTTP scrapers miss.
Supabase
— Metadata and user data store optionalSupabase postgres stores document metadata, access control lists, and user queries alongside pgvector as a lightweight secondary vector index for simpler setups.
Gotchas
- ⚠️ Chunk size is the single biggest lever — 512 tokens is a safe default but optimal size varies wildly by document type. Always benchmark retrieval precision before shipping.
- ⚠️ text-embedding-3-large is 5x the cost of text-embedding-3-small with ~15% quality gain — use small for high-volume production ingestion and large only for re-ranking experiments.
- ⚠️ Qdrant's HNSW index must be rebuilt on major schema changes — plan downtime or maintain a blue/green collection during re-indexing.
- ⚠️ Context stuffing with retrieved chunks eats the model's output budget. GPT-4o's 128k context is not free: at $15/1M input tokens, 100k input tokens per query = $1.50/query.
- ⚠️ Without access control, RAG can leak documents between tenants — implement payload filters in Qdrant or row-level security in Supabase before multi-tenant launch.
- ⚠️ Hallucination in RAG is rarely eliminated, only reduced — always display source citations so users can verify answers.
Related Stacks
Customer-Facing AI Chatbot SaaS
Production stack for shipping a multi-tenant AI chatbot with streaming, memory, guardrails, and usage-based billing.
LLM Production Observability and Evaluation
Stack for monitoring LLM applications in production: tracing every call, evaluating output quality, catching model drift, and controlling costs.
Building Your Own AI Coding Assistant Product
Stack for shipping a custom AI coding assistant — code completion, chat, code search, and agentic refactoring — as a standalone product or IDE plugin.