TL;DR
RAG (Retrieval-Augmented Generation) is a chatbot that searches relevant fragments in a knowledge base before generating an LLM answer. Key pieces: chunking, embeddings, vector store (e.g. pgvector in Branchly) and prompts with cited sources. Below: pipeline from PDF/FAQ to Next.js API — a technical angle complementing articles about knowledge base content, not about writing FAQ copy.
Who this is for
- Product teams deploying chatbots on corporate sites or in SaaS
- Developers looking beyond “dump the whole terms into the prompt”
- Companies with hundreds of documentation pages where fine-tuning makes no economic sense
- CTOs weighing inference cost vs answer quality
Keyword (SEO)
rag chatbot implementation, embeddings vector search, pgvector nextjs, ai chatbot knowledge base technical 2026
RAG vs prompt stuffing — why embeddings
| Approach | Limit | Cost | Knowledge updates |
|---|---|---|---|
| Full FAQ in system prompt | Context window (~128k tokens, used inefficiently) | High every question | Manual prompt edits |
| Fine-tuning | Expensive, slow iteration | One-time + retrain | Retrain |
| RAG + embeddings | Scales to thousands of chunks | Embed once + cheap query | Re-index chunks |
An embedding is a vector (e.g. 1536 dimensions) representing semantics of a paragraph. The user question becomes a vector too — you find nearest cosine neighbors, not keyword matches.
Deployment pipeline — 6 steps
[Documents] → chunking → embedding API → [vector store]
↓
[User question] → embedding → top-k retrieval → prompt + LLM → answer + sources- Ingest — markdown from repo, offer PDFs, FAQ export from Branchly
- Chunking — 400–800 tokens, overlap 50–100, heading in metadata
- Embed —
text-embedding-3-small(cheaper) or-large(more accurate) - Store —
document_chunkstable withembedding vector(1536)column - Query — embed question →
ORDER BY embedding <=> query_vec LIMIT 5 - Generate — GPT-4o-mini with instruction: “Answer only from context below”
Chunking — practical rules
Bad splits = bad answers, even with a good model.
- Split on Markdown headings (
##), not mid-sentence - Metadata:
source_url,locale,updated_at,section_title - Pricing tables — separate chunk with context “Pricing 2026”
- PL/EN duplicates — separate embeddings per language,
localefilter in query
Example record in Branchly (branchly.cloud) with pgvector:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
embedding vector(1536),
source_url TEXT,
locale TEXT DEFAULT 'en',
updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);For <10k chunks ivfflat is enough; above that consider HNSW or a dedicated engine (Qdrant, Pinecone) — still behind the same API.
Ingest code in Next.js
// scripts/ingest-knowledge.ts
import OpenAI from 'openai';
import { db } from '@/lib/db';
const openai = new OpenAI();
async function embedAndStore(chunks: { text: string; meta: ChunkMeta }[]) {
for (const batch of chunk(chunks, 50)) {
const res = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: batch.map((c) => c.text),
});
for (let i = 0; i < batch.length; i++) {
await db.$executeRaw`
INSERT INTO document_chunks (content, embedding, source_url, locale)
VALUES (
${batch[i].text},
${JSON.stringify(res.data[i].embedding)}::vector,
${batch[i].meta.sourceUrl},
${batch[i].meta.locale}
)
`;
}
}
}Run after content deploy or from cron on DevStudioIT Cloud — not on every user question.
RAG chat endpoint
// app/api/chat/route.ts
import OpenAI from 'openai';
import { streamText } from 'ai';
export async function POST(req: Request) {
const { message, locale } = await req.json();
const queryEmbedding = await embed(message);
const chunks = await db.$queryRaw<Chunk[]>`
SELECT content, source_url
FROM document_chunks
WHERE locale = ${locale}
ORDER BY embedding <=> ${queryEmbedding}::vector
LIMIT 5
`;
const context = chunks.map((c, i) => `[${i + 1}] ${c.content}`).join('\n\n');
return streamText({
model: openai('gpt-4o-mini'),
system: `Answer in English. Use only the context below. If missing data — say you don't know.
Context:
${context}`,
messages: [{ role: 'user', content: message }],
});
}Front-end: widget on the corporate site with links to source_url under the answer — builds trust and reduces hallucinations.
Retrieval quality — what improves relevance
- Hybrid search: BM25 (Postgres full-text) + vector, merge results — better for SKUs and product codes
- Re-ranking: after top-20 vector hits, cross-encoder or LLM picks top-5 (costlier, more precise)
- Query rewrite: LLM rewrites user question into a “search query” before embed
- Threshold: if max similarity < 0.75 — reply “I don't have that information” instead of guessing
Log question, chunk_ids, similarity scores in Branchly — weekly review of 20 low-score conversations.
Cost and latency (rough 2026)
- Embedding 1M tokens
text-embedding-3-small: order of a few USD - GPT-4o-mini with 2k context tokens: cents per conversation
- pgvector query <50 ms with index — bottleneck is LLM (TTFT ~300–800 ms)
- Cache embeddings for identical questions (hash normalized question) — ~30% savings on support FAQ
Hosting on DevStudioIT Cloud (devstudioit.cloud): Route Handler with streaming, rate limits per IP, OpenAI keys in env.
Security and GDPR
- Do not index customer PII or internal B2B pricing
- Rate limiting and CAPTCHA on public
/api/chat - Conversation log retention — 90-day policy, anonymization
- Privacy policy notice on processing via OpenAI (DPA, EU region if required)
Success metrics after go-live
After 4 weeks in production measure:
- Answer accuracy — manual review of 50 random chats (target: >85% correct)
- Fallback rate — how often low similarity triggers “don't know” (target: 10–20%)
- Average chunks in context vs latency
- Clicks on source_url — user trust proxy
Dashboard in Branchly: chat_logs table with question, chunk_ids, user_rating. Iterate chunking quarterly — answer stability beats daily tweaking.
FAQ
Does RAG replace the knowledge base article?
No — that article covers what to include (FAQ, policies). Here is how to search and generate technically.
pgvector vs Pinecone?
pgvector in Branchly = fewer moving parts for small/medium projects. Pinecone/Qdrant when >500k chunks or multi-tenant isolation.
How often to re-index?
On content change — CMS webhook → partial re-ingest of changed URLs, not the whole base.
Can we skip OpenAI?
Yes — local embedding models (Ollama, sentence-transformers) + Llama; trade-off: PL/EN quality and GPU ops.
Multilingual?
locale filter in SQL + separate chunks; do not mix PL and DE in one embed without language normalization.
CTA
Need a chatbot that cites your documentation instead of inventing answers?
- Get a RAG chatbot quote — architecture, Branchly pgvector, Next.js, DevStudioIT Cloud
- AI chatbot for business — business process and costs
About the author
We build fast websites, web/mobile apps, AI chatbots and hosting setups — with a focus on SEO and conversion.
Recommended links
From theory to production — Branchly, our hosting stack and shipped work.
