How a Bastion AI deployment fits into your network.
The whole stack is open source, and the whole stack runs on your side of the firewall. We design, deploy, and operate it; you keep the keys, the hardware, and the logs.
Six pieces. All open source. All inside your perimeter.
LiteLLM gateway
Single entrypointEvery application in your environment calls one OpenAI-compatible endpoint. LiteLLM enforces team-scoped API keys, rate limits, model allowlists, prompt logging, and audit. It also handles routing — small models for cheap tasks, large models for hard ones, fallbacks when a node is draining.
vLLM cluster
Production inferencevLLM serves the bulk of production traffic on your GPUs with continuous batching for high throughput. We provision the cluster, tune kernel and KV-cache settings for your hardware, set up autoscaling on Kubernetes or systemd, and ship the metrics into your existing Prometheus / Grafana / Datadog stack.
Ollama nodes
Edge & developer inferenceOllama covers everything that doesn't belong on the central GPU pool: developer laptops, branch offices, ships, forward-deployed sites, classified networks with no central GPU at all. Same model files, same API surface, same gateway in front.
Coder workspaces
AI-native, on-prem IDEEngineers and analysts build with AI inside browser-based workspaces that never leave the network. VS Code, JetBrains, and JupyterLab integrations. Workspaces auto-bind to the LiteLLM gateway so Copilot-style workflows work without sending source code outside.
Object storage & registry
Sovereign artifactsModel weights, fine-tunes, and audit logs live on storage you operate (MinIO, NetApp, Pure, S3-compatible). We mirror the upstream model catalog into your registry so air-gapped sites still get verified, signed updates.
Identity & access
Plug into what you haveSAML, OIDC, SCIM, mTLS. Active Directory, Okta, Ping, Keycloak, custom IdPs — all supported. No new user directory to manage, no new password to leak.
From kickoff to production in about a month.
Most regulated organizations are used to AI projects taking quarters. Ours don't. The reference stack and our deployment automation get you to a working system fast.
- Week 1
Discovery & sizing
Workshops with your platform, security, and AI teams. We document the network, model needs, and compliance constraints, then size the GPU footprint.
- Week 2
Pilot deployment
We deploy the reference stack into a staging environment in your network. Your team gets working access through real applications by the end of the week.
- Week 3-4
Production cutover
Hardening, capacity sizing, runbooks, on-call integration, and SAML/OIDC wiring. Production launch with audit, monitoring, and rollback in place.
- Ongoing
Operate & evolve
Optional 24/7 managed operations: model upgrades, capacity expansion, incident response, and quarterly architecture reviews.
Want this diagram tailored to your environment?
Send us a few details and we'll come back with a concrete deployment proposal — hardware, network, and timeline — for your team.