Case Study · Confidential enterprise client (regulated industry)

Private, On-Device LLM Deployment for a Privacy-Sensitive Enterprise (Ollama + llama.cpp)

How we replaced a $4.2K/month OpenAI bill with a fully on-device LLM workflow using Ollama, llama.cpp, and a FastAPI orchestration layer — keeping 100% of customer data on the user's laptop while delivering sub-second responses.

  • IndustryEnterprise / Healthcare
  • Year2025
  • CountryCanada
  • Duration3 months
Private, On-Device LLM Deployment for a Privacy-Sensitive Enterprise (Ollama + llama.cpp) hero screenshot

At-a-glance results

  • 100%On-device inference — zero customer data leaves the laptop
  • $4.2K/moOpenAI spend eliminated, replaced with $0 marginal inference cost
  • <800msp95 first-token latency on M2 MacBook for the chat model
  • 1stSubmission passed the client's security review

The challenge

An enterprise client in a regulated industry was using a hosted LLM through their internal tooling, but every quarter their security review team flagged the same blocker: customer-identifiable text was being sent to a US cloud endpoint. Legal had paused the wider rollout, the OpenAI bill had drifted past $4,200/month with only a fraction of the planned user base, and the product team had been asked the impossible question: "can we keep all of this AI capability without any data leaving the user's laptop?"

They didn't need a research project. They needed a shippable workflow that an internal employee could install in an afternoon, that ran entirely on their existing M-series MacBooks and Windows ThinkPads, and that performed well enough that nobody would resent the privacy upgrade.

Our solution

We designed and shipped a fully on-device LLM stack: Ollama as the model runtime, llama.cpp under the hood for quantized GPU/CPU inference, a small FastAPI orchestration layer for tool-use and retrieval, and a Next.js desktop UI shipped via a thin Tauri wrapper. Every byte of context stays on the device. There is no remote inference, no telemetry callback, no proxy.

We benchmarked seven open-weight models (Llama 3.1 8B, Qwen 2.5 7B/14B, Phi-3, Mistral, and two domain-tuned variants) across the client's real prompts and shipped a per-task model router: a small fast model for classification and chat, a larger reasoning model on demand. RAG runs against a local ChromaDB index built from the user's own documents, with every embedding computed on-device.

The result is a workflow that's measurably faster than the cloud version on the median prompt (no network round-trip), passes the client's security review on first submission, and scales to every laptop in the company at zero per-seat inference cost.

  • Ollama runtime with per-task model router (small for chat, large for reasoning)
  • Quantized inference via llama.cpp tuned for Apple Silicon and modern Intel/AMD CPUs
  • Local RAG over the user's own files using ChromaDB and on-device embeddings
  • FastAPI orchestration layer with stable HTTP API for the Next.js / Tauri desktop UI
  • Verifiable network allowlist — no LLM traffic ever leaves the device
  • Built-in evaluation harness so every model upgrade is scored before rollout
  • Signed, notarized installers for macOS (M1/M2/M3) and Windows
  • Privacy panel inside the app showing exactly what the model can read
  • Optional model-update channel respecting corporate proxy and network policy

How we built it

  1. 01

    Discovery: prompt audit & device baseline

    We collected a representative set of ~400 real prompts from the existing OpenAI logs (sanitized), then benchmarked candidate open-weight models on the client's actual hardware mix — M1, M2, and M3 MacBooks plus a Windows ThinkPad reference machine — measuring tokens-per-second, p95 first-token latency, and quality vs. the GPT-4 baseline using a rubric the client owned.

  2. 02

    Architecture: model router + local RAG

    We picked Ollama as the runtime (clean lifecycle, model versioning, GPU-aware quantization), then built a thin FastAPI orchestration layer that routes each task to the right model, handles retrieval against a local ChromaDB index, and exposes a stable HTTP API the desktop UI can call. The whole stack runs as three local processes managed by the desktop app.

  3. 03

    Build: desktop UX, RAG ingest, evals

    Engineering happened in two-week sprints with a real evaluation harness — every change was scored against a held-out prompt set so quality regressions surfaced immediately. We added a one-click ingest flow for the user's own documents, a model-update channel that respects the user's network policy, and a privacy panel that shows exactly what the model has access to.

  4. 04

    Security review, packaging, rollout

    We packaged the stack as a signed installer (notarized on macOS, signed on Windows), wrote a short threat model the security team could read in 20 minutes, and shipped to a 25-user pilot before company-wide rollout. The security review passed first submission — the killer feature was a verifiable network-allowlist that proves no LLM traffic ever leaves the device.

Tech stack

  • Ollama
  • llama.cpp
  • Llama 3.1
  • Qwen 2.5
  • LangChain
  • Python
  • FastAPI
  • Next.js
  • SQLite
  • ChromaDB
  • AI Agent Development
  • AI & ML Development
  • Python & FastAPI
  • Cybersecurity
We thought private AI meant a worse product. The UnlockLive build is faster than the cloud version on most of what our team does, and our security review took 20 minutes instead of three months.
Director of Product · Enterprise client (name confidential)

Frequently asked questions

Can a local LLM really replace GPT-4 for production workflows?

For most enterprise tasks — classification, summarization, RAG over the user's own documents, structured extraction — yes, a well-chosen open-weight model in the 7B-14B range matches or beats GPT-3.5 and gets within 10-15% of GPT-4 on quality. The trick is honest evaluation: we always score candidate models on the client's real prompts, not generic benchmarks.

Which open-weight models do you recommend for on-device deployment in 2025?

We default to Llama 3.1 8B for general chat and Qwen 2.5 14B for reasoning-heavy tasks, with Phi-3 mini for ultra-fast classification. Final choice depends on the user's hardware (M2/M3 Macs handle 14B comfortably; older Intel laptops do better with 7B quantized) and the workload mix.

Is Ollama production-ready for a regulated industry?

Ollama itself is permissively licensed and has stable model lifecycle, GPU-aware quantization, and a clean HTTP API. We pair it with a thin FastAPI orchestration layer we own, a signed/notarized installer, and a verifiable network allowlist — that combination has passed multiple enterprise security reviews on first submission.

How do you keep model quality high when you can't update models on every API call?

We ship a per-tenant evaluation harness with the deployment. Every model upgrade is scored against a held-out prompt set the client owns, and only promoted if it meets a quality bar. Model updates roll out through a canary channel the user controls.

What does a private LLM deployment cost vs. continuing to use OpenAI / Anthropic?

Typical breakeven is 6-9 months. The build is a one-time engineering cost (8-14 weeks for a focused workflow), then per-seat inference cost drops to zero. Hosted APIs win for spiky, low-volume use; on-device wins for daily-active enterprise tools.

Want a result like this?

Talk to the same team that built Private, On-Device LLM Deployment for a Privacy-Sensitive Enterprise (Ollama + llama.cpp). We’ll scope your project, give you a fixed-price proposal, and show you the closest analog from our portfolio.

Book a strategy call