Custom RAG Development & Enterprise AI Search

Custom RAG & Enterprise Search Development — illustrative product visual produced by UnlockLive IT

Quick answer

UnlockLive IT designs and ships production retrieval-augmented generation systems — the kind that hold up in front of real users with real questions on real, messy enterprise data. We build hybrid retrieval pipelines (BM25 + dense embeddings + reranking), permission-aware search across your Notion, SharePoint, Confluence, S3, and Salesforce data, and answer generation with strict citation requirements and automated evals on every change. Default stack: Python/FastAPI + Qdrant or pgvector + Cohere Rerank + Claude or GPT-5, deployed on AWS, Modal, or fully on-prem for regulated workloads. See our Local LLM case study for a worked example of air-gapped RAG.

What we build

Enterprise knowledge-base search:Search-and-answer over your company's Notion, Confluence, Google Drive, SharePoint, Slack, Zendesk, Jira, and internal wikis — with permission-aware retrieval (a user only sees results from documents they're allowed to read).

Customer-facing support copilots:RAG over your help center, product docs, release notes, and historical tickets. Streaming answers with inline citations, deflection metrics, and clean handoff to a human agent when confidence drops.

Domain-specific Q&A on regulated content:Legal contracts, medical literature, clinical guidelines, financial disclosures, building codes — with strict citation requirements and full audit trails for compliance review.

Sales enablement and proposal copilots:RAG over win/loss notes, past proposals, pricing decks, ICP profiles, and competitor battle cards — generating first-draft RFP responses and discovery prep in seconds.

Code-aware RAG over your repos:AST-aware chunking of your codebase plus README, ADR, and ticket context — for internal developer copilots and onboarding assistants.

Multi-modal RAG (PDFs, images, tables, video):Layout-aware PDF parsing (Unstructured, LlamaParse, Reducto), table extraction, vision models for charts and screenshots, and video transcript indexing.

Our RAG technology stack

Document parsing: Unstructured, LlamaParse, Reducto, Docling, Azure Document Intelligence, AWS Textract

Chunking strategies: Semantic chunking, late chunking, layout-aware, context-enriched (Anthropic contextual retrieval pattern)

Embedding models: OpenAI text-embedding-3-large, Voyage voyage-3, Cohere Embed v3, BGE, Nomic, Jina v3, BM25 hybrid

Vector databases: Pinecone, Weaviate, Qdrant, Chroma, pgvector, MongoDB Atlas Vector Search, Turbopuffer

Hybrid retrieval: BM25 + dense + reciprocal rank fusion, query expansion, HyDE, multi-query, parent-document retrieval

Rerankers: Cohere Rerank 3, Voyage rerank-2, BGE reranker, Jina reranker, ColBERT v2 for high-precision retrieval

Generation: OpenAI (GPT-5 family), Anthropic Claude (Sonnet 4.5, Opus), Gemini, open-source via Together / Groq / vLLM

Permission-aware retrieval: Per-document ACL filters in vector DB, post-retrieval permission checks, OAuth-based source access

Evals & quality: Ragas, TruLens, LangSmith, LangFuse, Promptfoo, Phoenix — automated retrieval and answer-quality scoring

Observability: LangFuse, Helicone, Sentry, OpenTelemetry — full prompt and retrieval trace per request

Deployment: AWS, Modal, Vercel, Cloudflare Workers, on-prem (air-gapped), Kubernetes

Our RAG development process

Use case definition (1 week): Define WHO is asking, WHAT they're trying to accomplish, what an acceptable answer looks like, and the failure modes that would damage trust. We refuse to start RAG projects without an explicit success metric — usually answer accuracy on a labeled eval set.
Data ingestion & cleaning (1-3 weeks): Connectors to source systems (Confluence, Notion, S3, SharePoint, Salesforce, Zendesk), document parsing, deduplication, ACL extraction, and metadata enrichment. Most retrieval problems are actually ingestion problems.
Retrieval baseline & eval set (1-2 weeks): Hand-curate a 100-300 question eval set covering the long tail of real queries. Build a baseline retrieval pipeline. Score it. Document where it fails.
Iterate on retrieval (2-4 weeks): Hybrid search, reranking, query rewriting, contextual retrieval, parent-document retrieval, metadata filters. Each iteration is measured against the eval set, not vibes.
Generation & guardrails (1-2 weeks): Prompt engineering for the answer model, citation requirements, refusal handling for out-of-scope questions, jailbreak/prompt-injection defense, output classifiers.
Production deployment & monitoring (1-2 weeks): API endpoints, caching, rate limiting, per-tenant isolation, full observability (LangFuse / Helicone), and dashboards tied to your business metric — deflection rate, time-to-answer, or analyst hours saved.

Frequently asked questions

What is RAG and why do I need a custom one?

Retrieval-augmented generation is the pattern where an LLM is given relevant documents from your knowledge base and asked to answer using only those documents. The off-the-shelf 'upload-PDFs-to-ChatGPT' versions work for demos but break in production because real enterprise data has access control, messy formats, custom chunking needs, citation requirements, and quality bars that demand iteration. Custom RAG is what you build when an answer being wrong has a real business cost.

How much does it cost to build a production RAG system?

A focused single-source RAG (one knowledge base, one user surface, English only) typically ranges from $30,000 to $80,000. Multi-source enterprise RAG with permission-aware retrieval and connectors to 5+ systems ranges from $80,000 to $200,000. Regulated-industry RAG with audit trails, on-prem deployment, and rigorous evals starts at $150,000. Inference and embedding costs are separate and depend on document volume and query rate.

Pinecone, Weaviate, Qdrant, or pgvector?

pgvector is our default if you already have PostgreSQL — for under ~10M vectors and moderate QPS, it's fast enough and removes a moving part. Qdrant is our default for self-hosted or air-gapped deployments. Pinecone or Turbopuffer are our defaults for very high-scale managed services where you want zero ops. Weaviate is a good choice when you want hybrid search out of the box and already have the operational appetite. We pick after benchmarking against your data.

How do you handle hallucinations and accuracy?

Three layers. (1) Retrieval quality — measured against a labeled eval set on every change, with reranking, hybrid search, and contextual retrieval to push recall higher. (2) Generation guardrails — prompts that require citation of retrieved chunks, refusal of out-of-scope questions, and a human-in-the-loop path for low-confidence answers. (3) Production monitoring — sampled human review of conversations, automated factuality checks, and immediate alerting on failure-mode regressions. Every system we ship has documented acceptable failure modes.

Can the RAG respect user permissions?

Yes. We extract ACLs from source systems at ingestion time, store them as filterable metadata in the vector database, and apply per-user filters at query time. For high-trust environments we also do post-retrieval permission verification against the source system as a defense-in-depth check. Permission-aware retrieval is non-negotiable for any deployment touching internal documents.

Can we deploy this on-prem or in our own VPC?

Yes. We routinely deploy entirely inside customer AWS / Azure / GCP accounts, and for regulated workloads we deploy fully on-prem and air-gapped using open-source models (Llama 3.3, Qwen, Mistral) served on vLLM or TGI, plus Qdrant for retrieval. See our Local LLM case study for a worked example.

How is RAG different from fine-tuning?

RAG retrieves fresh facts at query time and is the right answer for changing knowledge bases, citation requirements, and access-controlled content. Fine-tuning bakes patterns and style into the model and is the right answer for teaching the model a domain-specific format, terminology, or persona. They're complementary — many production systems do both. We help pick the right tool in the discovery phase.

Have a different question? Book a free strategy call.

Ready to ship a RAG that actually answers correctly?

Tell us about the questions you want answered and the data you want answered from. We'll respond within one business day. Book a free strategy call with our Toronto team.

Get an honest estimate for your project

Same-day reply during North American business hours. No obligation.

Book a free strategy call

What we build

Our RAG technology stack

Our RAG development process

Frequently asked questions

What is RAG and why do I need a custom one?

How much does it cost to build a production RAG system?

Pinecone, Weaviate, Qdrant, or pgvector?

How do you handle hallucinations and accuracy?

Can the RAG respect user permissions?

Can we deploy this on-prem or in our own VPC?

How is RAG different from fine-tuning?

Ready to ship a RAG that actually answers correctly?

Get an honest estimate for your project

Contact For Service

Quick Links

Services

Locations

Custom RAG & Enterprise Search Development

What we build

Our RAG technology stack

Our RAG development process

Frequently asked questions

What is RAG and why do I need a custom one?

How much does it cost to build a production RAG system?

Pinecone, Weaviate, Qdrant, or pgvector?

How do you handle hallucinations and accuracy?

Can the RAG respect user permissions?

Can we deploy this on-prem or in our own VPC?

How is RAG different from fine-tuning?

Related case studies — see this in production

Private, On-Device LLM Deployment for a Privacy-Sensitive Enterprise (Ollama + llama.cpp)

Branify: AI-Enhanced Customer Service Platform

Artiflex: AI-Powered Image Generation Platform

Related services

AI Agent Development

MCP Server Development Services

AI & Machine Learning Development

Python & FastAPI Development

Ready to ship a RAG that actually answers correctly?

Get an honest estimate for your project

Contact For Service

Quick Links

Services

Locations