RAG chatbot — embeddings, vector search and deployment in 2026

TL;DR

RAG (Retrieval-Augmented Generation) is a chatbot that searches relevant fragments in a knowledge base before generating an LLM answer. Key pieces: chunking, embeddings, vector store (e.g. pgvector in Branchly) and prompts with cited sources. Below: pipeline from PDF/FAQ to Next.js API — a technical angle complementing articles about knowledge base content, not about writing FAQ copy.

Who this is for

Product teams deploying chatbots on corporate sites or in SaaS
Developers looking beyond “dump the whole terms into the prompt”
Companies with hundreds of documentation pages where fine-tuning makes no economic sense
CTOs weighing inference cost vs answer quality

Keyword (SEO)

rag chatbot implementation, embeddings vector search, pgvector nextjs, ai chatbot knowledge base technical 2026

RAG vs prompt stuffing — why embeddings

Approach	Limit	Cost	Knowledge updates
Full FAQ in system prompt	Context window (~128k tokens, used inefficiently)	High every question	Manual prompt edits
Fine-tuning	Expensive, slow iteration	One-time + retrain	Retrain
RAG + embeddings	Scales to thousands of chunks	Embed once + cheap query	Re-index chunks

An embedding is a vector (e.g. 1536 dimensions) representing semantics of a paragraph. The user question becomes a vector too — you find nearest cosine neighbors, not keyword matches.

Deployment pipeline — 6 steps

[Documents] → chunking → embedding API → [vector store]
                                              ↓
[User question] → embedding → top-k retrieval → prompt + LLM → answer + sources

Ingest — markdown from repo, offer PDFs, FAQ export from Branchly
Chunking — 400–800 tokens, overlap 50–100, heading in metadata
Embed — text-embedding-3-small (cheaper) or -large (more accurate)
Store — document_chunks table with embedding vector(1536) column
Query — embed question → ORDER BY embedding <=> query_vec LIMIT 5
Generate — GPT-4o-mini with instruction: “Answer only from context below”

Chunking — practical rules

Bad splits = bad answers, even with a good model.

Split on Markdown headings (##), not mid-sentence
Metadata: source_url, locale, updated_at, section_title
Pricing tables — separate chunk with context “Pricing 2026”
PL/EN duplicates — separate embeddings per language, locale filter in query

Example record in Branchly (branchly.cloud) with pgvector:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  embedding vector(1536),
  source_url TEXT,
  locale TEXT DEFAULT 'en',
  updated_at TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX ON document_chunks
  USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

For <10k chunks ivfflat is enough; above that consider HNSW or a dedicated engine (Qdrant, Pinecone) — still behind the same API.

Ingest code in Next.js

// scripts/ingest-knowledge.ts
import OpenAI from 'openai';
import { db } from '@/lib/db';

const openai = new OpenAI();

async function embedAndStore(chunks: { text: string; meta: ChunkMeta }[]) {
  for (const batch of chunk(chunks, 50)) {
    const res = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: batch.map((c) => c.text),
    });

    for (let i = 0; i < batch.length; i++) {
      await db.$executeRaw`
        INSERT INTO document_chunks (content, embedding, source_url, locale)
        VALUES (
          ${batch[i].text},
          ${JSON.stringify(res.data[i].embedding)}::vector,
          ${batch[i].meta.sourceUrl},
          ${batch[i].meta.locale}
        )
      `;
    }
  }
}

Run after content deploy or from cron on DevStudioIT Cloud — not on every user question.

RAG chat endpoint

// app/api/chat/route.ts
import OpenAI from 'openai';
import { streamText } from 'ai';

export async function POST(req: Request) {
  const { message, locale } = await req.json();

  const queryEmbedding = await embed(message);
  const chunks = await db.$queryRaw<Chunk[]>`
    SELECT content, source_url
    FROM document_chunks
    WHERE locale = ${locale}
    ORDER BY embedding <=> ${queryEmbedding}::vector
    LIMIT 5
  `;

  const context = chunks.map((c, i) => `[${i + 1}] ${c.content}`).join('\n\n');

  return streamText({
    model: openai('gpt-4o-mini'),
    system: `Answer in English. Use only the context below. If missing data — say you don't know.
    
Context:
${context}`,
    messages: [{ role: 'user', content: message }],
  });
}

Front-end: widget on the corporate site with links to source_url under the answer — builds trust and reduces hallucinations.

Retrieval quality — what improves relevance

Hybrid search: BM25 (Postgres full-text) + vector, merge results — better for SKUs and product codes
Re-ranking: after top-20 vector hits, cross-encoder or LLM picks top-5 (costlier, more precise)
Query rewrite: LLM rewrites user question into a “search query” before embed
Threshold: if max similarity < 0.75 — reply “I don't have that information” instead of guessing

Log question, chunk_ids, similarity scores in Branchly — weekly review of 20 low-score conversations.

Cost and latency (rough 2026)

Embedding 1M tokens text-embedding-3-small: order of a few USD
GPT-4o-mini with 2k context tokens: cents per conversation
pgvector query <50 ms with index — bottleneck is LLM (TTFT ~300–800 ms)
Cache embeddings for identical questions (hash normalized question) — ~30% savings on support FAQ

Hosting on DevStudioIT Cloud (devstudioit.cloud): Route Handler with streaming, rate limits per IP, OpenAI keys in env.

Do not index customer PII or internal B2B pricing
Rate limiting and CAPTCHA on public /api/chat
Conversation log retention — 90-day policy, anonymization
Privacy policy notice on processing via OpenAI (DPA, EU region if required)

Success metrics after go-live

After 4 weeks in production measure:

Answer accuracy — manual review of 50 random chats (target: >85% correct)
Fallback rate — how often low similarity triggers “don't know” (target: 10–20%)
Average chunks in context vs latency
Clicks on source_url — user trust proxy

Dashboard in Branchly: chat_logs table with question, chunk_ids, user_rating. Iterate chunking quarterly — answer stability beats daily tweaking.

FAQ

Does RAG replace the knowledge base article?

No — that article covers what to include (FAQ, policies). Here is how to search and generate technically.

pgvector vs Pinecone?

pgvector in Branchly = fewer moving parts for small/medium projects. Pinecone/Qdrant when >500k chunks or multi-tenant isolation.

How often to re-index?

On content change — CMS webhook → partial re-ingest of changed URLs, not the whole base.

Can we skip OpenAI?

Yes — local embedding models (Ollama, sentence-transformers) + Llama; trade-off: PL/EN quality and GPU ops.

Multilingual?

locale filter in SQL + separate chunks; do not mix PL and DE in one embed without language normalization.

CTA

Need a chatbot that cites your documentation instead of inventing answers?

Get a RAG chatbot quote — architecture, Branchly pgvector, Next.js, DevStudioIT Cloud
AI chatbot for business — business process and costs

RAG chatbotembeddings, vector search and deployment in 2026

TL;DR

Who this is for

Keyword (SEO)

RAG vs prompt stuffing — why embeddings

Deployment pipeline — 6 steps

Chunking — practical rules

Ingest code in Next.js

RAG chat endpoint

Retrieval quality — what improves relevance

Cost and latency (rough 2026)

Success metrics after go-live

FAQ

Does RAG replace the knowledge base article?

pgvector vs Pinecone?

How often to re-index?

Can we skip OpenAI?

Multilingual?

CTA

About the author

Recommended links

Like how we think? Let's build something together.

TL;DR

Who this is for

Keyword (SEO)

RAG vs prompt stuffing — why embeddings

Deployment pipeline — 6 steps

Chunking — practical rules

Ingest code in Next.js

RAG chat endpoint

Retrieval quality — what improves relevance

Cost and latency (rough 2026)

Security and GDPR

Success metrics after go-live

FAQ

Does RAG replace the knowledge base article?

pgvector vs Pinecone?

How often to re-index?

Can we skip OpenAI?

Multilingual?

CTA

Related posts

About the author

Recommended links

Like how we think? Let's build something together.