Operator // JV — Field Document 001

The Local AI Deployment Hardening Checklist

Forty checks across six domains, written from real deployments and real incident reviews. It covers the stack this community actually runs — Ollama, llama.cpp, vLLM, Open WebUI, Chroma, Qdrant — not an abstract "enterprise AI platform."

No email gate. No PDF funnel. Print it (Cmd/Ctrl+P renders a clean black-on-white copy), share it, or open a PR if I got something wrong. If your team clears all forty, you're ahead of most production deployments I've reviewed.

Questions? Talk to me

§A — Inference Endpoints

The server you forgot is a server.

Most local-AI exposure starts here: an inference API bound to all interfaces with no authentication, because that's what the quickstart did.

A-01Bind to loopback unless you have a reason not to.Ollama serves on 11434, llama.cpp's server on 8080, vLLM on 8000. Confirm each binds 127.0.0.1, not 0.0.0.0. Check with ss -tlnp or lsof -iTCP -sTCP:LISTEN.

A-02No inference API is directly reachable from the internet.Search your own egress IPs on Shodan/Censys. Thousands of open Ollama and llama.cpp endpoints are indexed at any given time — don't be one of them.

A-03Auth in front of every OpenAI-compatible API.vLLM supports --api-key; for servers that don't, put a reverse proxy (Caddy, nginx, Traefik) with token auth or mTLS in front. "It's internal" is not auth.

A-04Web UIs require login and signups are disabled.Open WebUI, SillyTavern, text-generation-webui: confirm auth is on, default admin credentials are rotated, and open registration is off.

A-05No blanket port-forwards or "temporary" tunnels.Audit router port-forwards, cloudflared/ngrok/Tailscale Funnel configs. Temporary tunnels have a way of becoming permanent and forgotten.

A-06Admin and inference planes are separated.Endpoints that can pull/delete models (e.g. Ollama's /api/pull, /api/delete) shouldn't share exposure with endpoints that only need /v1/chat/completions.

A-07Rate limits exist, even internally.One runaway script can starve the GPU for everyone and bury real usage in noise. Per-key limits at the proxy are cheap.

§B — RAG & Vector Stores

Your corpus is the crown jewels, embedded.

RAG quietly concentrates your most sensitive documents into one queryable store — then answers anyone who can reach it.

B-01The vector store requires authentication.Chroma historically shipped with none; Qdrant's API key and Weaviate's anonymous-access flag are opt-in. Verify with an unauthenticated request from another host — it should fail.

B-02You know exactly what got indexed.Keep a manifest of ingested sources. Recursive folder ingestion loves to swallow .env files, HR folders, and that one spreadsheet of customer data.

B-03Access control matches the source documents.If a document was restricted in SharePoint, its chunks must be restricted in retrieval. Collection-level ACLs are rarely enough; you may need per-document filtering at query time.

B-04Embeddings are treated as sensitive data.Embedding inversion recovers meaningful text from vectors. Encrypt at rest, and don't ship production vectors to third-party embedding APIs you haven't cleared.

B-05There is a deletion path.When a source document is deleted or a right-to-erasure request lands, you can find and remove its chunks and vectors — and prove you did.

B-06Retrieved context is bounded and logged.Log which chunks each query pulled. Without it you cannot answer the only question that matters after an incident: what did it leak, and to whom?

B-07Untrusted documents are treated as hostile input.Indirect prompt injection lives in your corpus: a poisoned PDF instructs the model, the model obeys. Isolate untrusted sources and strip active content on ingest.

§C — Model Artifacts

Weights are executables with better PR.

You would never run a random binary from the internet as root. A model file can be the same thing wearing a lab coat.

C-01Prefer safetensors and GGUF over pickle formats.Legacy .pt/.bin pickle files can execute arbitrary code on load. If you must load one, scan it (picklescan or equivalent) and load with weights_only=true where supported.

C-02Pin models to exact revisions and verify hashes.Pull by commit hash, not by main. Record the SHA-256 of every artifact you deploy. A repo can change under a stable-looking name.

C-03Provenance is checked before the download.Uploader account age, org verification, download counts, community reports. Typosquatted model repos are a real distribution channel.

C-04Production pulls from an internal mirror, not the public hub.One vetted copy, hash-verified, served internally. Production boxes shouldn't have a reason to talk to huggingface.co at 2 a.m.

C-05Licenses are reviewed for your actual use.Llama-family, research-only, and non-commercial licenses differ in ways your counsel cares about. Know what you're running before a customer asks.

C-06Custom code paths are disabled unless audited.trust_remote_code=true means "run this repo's Python on my GPU box." Default it to false; audit the exceptions.

C-07Fine-tunes and LoRAs get the same scrutiny as base models.Adapters are small, shared casually, and can carry backdoored behavior. Track their lineage like any other artifact.

§D — Host & Network

The GPU box is a high-value target.

It holds your models, your corpus, and often your credentials — and it was probably set up in a hurry.

D-01Inference runs as an unprivileged user or container.No --privileged, no root, no docker socket mounted inside. Use --gpus device passthrough, not god-mode.

D-02The AI segment is isolated from the crown-jewel network.Separate VLAN or security group. The box that runs community models should not have a flat path to your database servers.

D-03Egress is restricted and monitored.Inference servers rarely need outbound internet. Default-deny egress turns "model exfiltrates data" from silent to visible.

D-04Drivers and runtimes are on a patch cadence.GPU drivers, CUDA, container toolkits, and the inference servers themselves all carry CVEs. "It works, don't touch it" is how boxes stay vulnerable for years.

D-05Secrets are not in environment dumps or compose files.API keys and DB credentials belong in a secrets manager. docker inspect shouldn't be a credential store.

D-06Disk encryption on any box holding models or corpora.Workstations and homelab-grade servers get stolen and decommissioned carelessly. Full-disk encryption is table stakes.

§E — Access & Usage

Know who asked, and what for.

Private AI removes the vendor's abuse controls. Whatever governance you want now has to be yours.

E-01Every request is attributable to a person or service.Per-user or per-service API keys at the gateway. A single shared key means your audit trail is fiction.

E-02A written policy says what data may enter prompts.Short and real beats long and ignored: what's allowed, what's banned (credentials, regulated data), who to ask.

E-03Prompt/response logging has an explicit retention decision.Logging everything forever creates a new sensitive corpus; logging nothing blinds incident response. Decide deliberately, document the decision.

E-04Tool-calling and agent permissions are least-privilege.If the model can call tools, enumerate them, scope their credentials, and require confirmation for destructive actions. An agent with your shell is an intern with root.

E-05Offboarding revokes AI access too.API keys, WebUI accounts, and VPN routes to the inference segment are part of the leaver checklist.

E-06Someone owns this system by name.Not a committee. One person who can answer "what changed on the AI stack last month?" without a meeting.

§F — Evidence & Response

If you can't prove it, it didn't happen.

The auditor and the attacker are both coming eventually. These checks decide whether you have answers or apologies.

F-01You can produce a full inventory of AI assets on demand.Every inference server, UI, vector store, model artifact (with hash), and dataset — in one document that's newer than last quarter.

F-02Logs leave the box.Access, auth, and inference logs ship to central storage the GPU host can't rewrite. Local-only logs die with the incident.

F-03Model lineage is recorded for every deployment.Which model, which revision, which quant, serving which app, since when. Six months from now, "what was running in March?" must have an answer.

F-04The incident plan has an AI page.Who can pull the plug on inference, how to snapshot the box, what "compromised model" containment looks like. Decide before 3 a.m., not at it.

F-05Configs are in version control.Compose files, proxy configs, gateway rules. Reproducibility is also forensics: the diff is the timeline.

F-06An outside party has checked your perimeter this year.A scan of your own attack surface by someone who didn't build it. Internal certainty is where exposure hides.

F-07You re-run this checklist on a schedule.Quarterly, calendared, owned (see E-06). Drift is the default state of every system.

Colophon

Written by Joey Victorino, co-founder of Qompute AI. This document is free to share and reproduce with attribution. Corrections and additions are welcome — the stack moves fast and so should this list.

If you're deploying at a scale where a mistake is expensive, that's the work I do.

Book a Readiness Call Back to the practice