RAG chatbotembeddings, vector search and deployment in 2026

rag5 min readJuly 20, 2026

Author: DevStudio.it

TL;DR

RAG (Retrieval-Augmented Generation) is a chatbot that searches relevant fragments in a knowledge base before generating an LLM answer. Key pieces: chunking, embeddings, vector store (e.g. pgvector in Branchly) and prompts with cited sources. Below: pipeline from PDF/FAQ to Next.js API — a technical angle complementing articles about knowledge base content, not about writing FAQ copy.

Who this is for

  • Product teams deploying chatbots on corporate sites or in SaaS
  • Developers looking beyond “dump the whole terms into the prompt”
  • Companies with hundreds of documentation pages where fine-tuning makes no economic sense
  • CTOs weighing inference cost vs answer quality

Keyword (SEO)

rag chatbot implementation, embeddings vector search, pgvector nextjs, ai chatbot knowledge base technical 2026

RAG vs prompt stuffing — why embeddings

Approach Limit Cost Knowledge updates
Full FAQ in system prompt Context window (~128k tokens, used inefficiently) High every question Manual prompt edits
Fine-tuning Expensive, slow iteration One-time + retrain Retrain
RAG + embeddings Scales to thousands of chunks Embed once + cheap query Re-index chunks

An embedding is a vector (e.g. 1536 dimensions) representing semantics of a paragraph. The user question becomes a vector too — you find nearest cosine neighbors, not keyword matches.

Deployment pipeline — 6 steps

[Documents] → chunking → embedding API → [vector store][User question] → embedding → top-k retrieval → prompt + LLM → answer + sources
  1. Ingest — markdown from repo, offer PDFs, FAQ export from Branchly
  2. Chunking — 400–800 tokens, overlap 50–100, heading in metadata
  3. Embedtext-embedding-3-small (cheaper) or -large (more accurate)
  4. Storedocument_chunks table with embedding vector(1536) column
  5. Query — embed question → ORDER BY embedding <=> query_vec LIMIT 5
  6. Generate — GPT-4o-mini with instruction: “Answer only from context below”

Chunking — practical rules

Bad splits = bad answers, even with a good model.

  • Split on Markdown headings (##), not mid-sentence
  • Metadata: source_url, locale, updated_at, section_title
  • Pricing tables — separate chunk with context “Pricing 2026”
  • PL/EN duplicates — separate embeddings per language, locale filter in query

Example record in Branchly (branchly.cloud) with pgvector:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  embedding vector(1536),
  source_url TEXT,
  locale TEXT DEFAULT 'en',
  updated_at TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX ON document_chunks
  USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

For <10k chunks ivfflat is enough; above that consider HNSW or a dedicated engine (Qdrant, Pinecone) — still behind the same API.

Ingest code in Next.js

// scripts/ingest-knowledge.ts
import OpenAI from 'openai';
import { db } from '@/lib/db';

const openai = new OpenAI();

async function embedAndStore(chunks: { text: string; meta: ChunkMeta }[]) {
  for (const batch of chunk(chunks, 50)) {
    const res = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: batch.map((c) => c.text),
    });

    for (let i = 0; i < batch.length; i++) {
      await db.$executeRaw`
        INSERT INTO document_chunks (content, embedding, source_url, locale)
        VALUES (
          ${batch[i].text},
          ${JSON.stringify(res.data[i].embedding)}::vector,
          ${batch[i].meta.sourceUrl},
          ${batch[i].meta.locale}
        )
      `;
    }
  }
}

Run after content deploy or from cron on DevStudioIT Cloud — not on every user question.

RAG chat endpoint

// app/api/chat/route.ts
import OpenAI from 'openai';
import { streamText } from 'ai';

export async function POST(req: Request) {
  const { message, locale } = await req.json();

  const queryEmbedding = await embed(message);
  const chunks = await db.$queryRaw<Chunk[]>`
    SELECT content, source_url
    FROM document_chunks
    WHERE locale = ${locale}
    ORDER BY embedding <=> ${queryEmbedding}::vector
    LIMIT 5
  `;

  const context = chunks.map((c, i) => `[${i + 1}] ${c.content}`).join('\n\n');

  return streamText({
    model: openai('gpt-4o-mini'),
    system: `Answer in English. Use only the context below. If missing data — say you don't know.
    
Context:
${context}`,
    messages: [{ role: 'user', content: message }],
  });
}

Front-end: widget on the corporate site with links to source_url under the answer — builds trust and reduces hallucinations.

Retrieval quality — what improves relevance

  • Hybrid search: BM25 (Postgres full-text) + vector, merge results — better for SKUs and product codes
  • Re-ranking: after top-20 vector hits, cross-encoder or LLM picks top-5 (costlier, more precise)
  • Query rewrite: LLM rewrites user question into a “search query” before embed
  • Threshold: if max similarity < 0.75 — reply “I don't have that information” instead of guessing

Log question, chunk_ids, similarity scores in Branchly — weekly review of 20 low-score conversations.

Cost and latency (rough 2026)

  • Embedding 1M tokens text-embedding-3-small: order of a few USD
  • GPT-4o-mini with 2k context tokens: cents per conversation
  • pgvector query <50 ms with index — bottleneck is LLM (TTFT ~300–800 ms)
  • Cache embeddings for identical questions (hash normalized question) — ~30% savings on support FAQ

Hosting on DevStudioIT Cloud (devstudioit.cloud): Route Handler with streaming, rate limits per IP, OpenAI keys in env.

Security and GDPR

  • Do not index customer PII or internal B2B pricing
  • Rate limiting and CAPTCHA on public /api/chat
  • Conversation log retention — 90-day policy, anonymization
  • Privacy policy notice on processing via OpenAI (DPA, EU region if required)

Success metrics after go-live

After 4 weeks in production measure:

  • Answer accuracy — manual review of 50 random chats (target: >85% correct)
  • Fallback rate — how often low similarity triggers “don't know” (target: 10–20%)
  • Average chunks in context vs latency
  • Clicks on source_url — user trust proxy

Dashboard in Branchly: chat_logs table with question, chunk_ids, user_rating. Iterate chunking quarterly — answer stability beats daily tweaking.

FAQ

Does RAG replace the knowledge base article?

No — that article covers what to include (FAQ, policies). Here is how to search and generate technically.

pgvector vs Pinecone?

pgvector in Branchly = fewer moving parts for small/medium projects. Pinecone/Qdrant when >500k chunks or multi-tenant isolation.

How often to re-index?

On content change — CMS webhook → partial re-ingest of changed URLs, not the whole base.

Can we skip OpenAI?

Yes — local embedding models (Ollama, sentence-transformers) + Llama; trade-off: PL/EN quality and GPU ops.

Multilingual?

locale filter in SQL + separate chunks; do not mix PL and DE in one embed without language normalization.

CTA

Need a chatbot that cites your documentation instead of inventing answers?

Related posts

AI chatbot on a business website — when it pays off and when it does not (2026)
10 min read
AI chatbot: 7 real scenarios that save time
6 min read
AI chatbot for business: how to implement and cost in 2026
6 min read

About the author

We build fast websites, web/mobile apps, AI chatbots and hosting setups — with a focus on SEO and conversion.

Recommended links

From theory to production — Branchly, our hosting stack and shipped work.

Like how we think? Let's build something together.

Start project configuration