Bastion AI
← All resources
·5 min read

LiteLLM, vLLM, and Ollama: how the three fit together

A short tour of why we standardize on these three projects and how they divide responsibility in our reference architecture.

By The Bastion AI team

People sometimes ask why we deploy three things instead of one. The short answer is that LiteLLM, vLLM, and Ollama solve different problems and trying to make any one of them do all three jobs is how you end up with a brittle stack.

LiteLLM is the gateway

LiteLLM is the single OpenAI-compatible endpoint that every application in your environment talks to. It handles auth, rate limits, model allowlists, audit logs, fallbacks, and routing. It is deliberately not an inference engine — it is an honest broker in front of one.

vLLM is the production engine

vLLM does one thing extremely well: it serves large open-weight models on GPUs at high throughput, with continuous batching and a sane KV-cache story. It is what your real traffic should hit.

Ollama is the edge engine

Ollama is what runs on developer laptops, on a single GPU box at a remote site, on a ship, on a forward-deployed kit, on the air-gapped subnet that doesn't have a central GPU pool. Same model files, smaller footprint, more forgiving operations.

Put them behind one gateway and the result is a single API surface, three substrates, and a clean separation of concerns.