Bastion AI
Architecture

How a Bastion AI deployment fits into your network.

The whole stack is open source, and the whole stack runs on your side of the firewall. We design, deploy, and operate it; you keep the keys, the hardware, and the logs.

Bastion AI on-prem reference architectureCustomer applications connect through a LiteLLM gateway hosted inside the customer's network. The gateway routes to vLLM and Ollama inference engines on private GPUs. Coder provides developer workspaces. Nothing leaves the secure perimeter.CUSTOMER NETWORK · AIR-GAP CAPABLEInternal appsRAG · chat · agentsDeveloper IDEsvia Coder workspacesExisting toolingOpenAI-compatible SDKsLiteLLM Gatewayrouting · auth · quotas · auditOpenAI-compatible API surfacevLLM clusterhigh-throughput GPU inferenceLlama · Qwen · Mistral · PhiOllama nodesedge / dev / air-gappedObject storemodel weights · audit logs
Components

Six pieces. All open source. All inside your perimeter.

LiteLLM gateway

Single entrypoint

Every application in your environment calls one OpenAI-compatible endpoint. LiteLLM enforces team-scoped API keys, rate limits, model allowlists, prompt logging, and audit. It also handles routing — small models for cheap tasks, large models for hard ones, fallbacks when a node is draining.

vLLM cluster

Production inference

vLLM serves the bulk of production traffic on your GPUs with continuous batching for high throughput. We provision the cluster, tune kernel and KV-cache settings for your hardware, set up autoscaling on Kubernetes or systemd, and ship the metrics into your existing Prometheus / Grafana / Datadog stack.

Ollama nodes

Edge & developer inference

Ollama covers everything that doesn't belong on the central GPU pool: developer laptops, branch offices, ships, forward-deployed sites, classified networks with no central GPU at all. Same model files, same API surface, same gateway in front.

Coder workspaces

AI-native, on-prem IDE

Engineers and analysts build with AI inside browser-based workspaces that never leave the network. VS Code, JetBrains, and JupyterLab integrations. Workspaces auto-bind to the LiteLLM gateway so Copilot-style workflows work without sending source code outside.

Object storage & registry

Sovereign artifacts

Model weights, fine-tunes, and audit logs live on storage you operate (MinIO, NetApp, Pure, S3-compatible). We mirror the upstream model catalog into your registry so air-gapped sites still get verified, signed updates.

Identity & access

Plug into what you have

SAML, OIDC, SCIM, mTLS. Active Directory, Okta, Ping, Keycloak, custom IdPs — all supported. No new user directory to manage, no new password to leak.

Timeline

From kickoff to production in about a month.

Most regulated organizations are used to AI projects taking quarters. Ours don't. The reference stack and our deployment automation get you to a working system fast.

  1. Week 1

    Discovery & sizing

    Workshops with your platform, security, and AI teams. We document the network, model needs, and compliance constraints, then size the GPU footprint.

  2. Week 2

    Pilot deployment

    We deploy the reference stack into a staging environment in your network. Your team gets working access through real applications by the end of the week.

  3. Week 3-4

    Production cutover

    Hardening, capacity sizing, runbooks, on-call integration, and SAML/OIDC wiring. Production launch with audit, monitoring, and rollback in place.

  4. Ongoing

    Operate & evolve

    Optional 24/7 managed operations: model upgrades, capacity expansion, incident response, and quarterly architecture reviews.

Want this diagram tailored to your environment?

Send us a few details and we'll come back with a concrete deployment proposal — hardware, network, and timeline — for your team.