LiteLLM, vLLM, and Ollama: how the three fit together
A short tour of why we standardize on these three projects and how they divide responsibility in our reference architecture.
By The Bastion AI team
People sometimes ask why we deploy three things instead of one. The short answer is that LiteLLM, vLLM, and Ollama solve different problems and trying to make any one of them do all three jobs is how you end up with a brittle stack.
LiteLLM is the gateway
LiteLLM is the single OpenAI-compatible endpoint that every application in your environment talks to. It handles auth, rate limits, model allowlists, audit logs, fallbacks, and routing. It is deliberately not an inference engine — it is an honest broker in front of one.
vLLM is the production engine
vLLM does one thing extremely well: it serves large open-weight models on GPUs at high throughput, with continuous batching and a sane KV-cache story. It is what your real traffic should hit.
Ollama is the edge engine
Ollama is what runs on developer laptops, on a single GPU box at a remote site, on a ship, on a forward-deployed kit, on the air-gapped subnet that doesn't have a central GPU pool. Same model files, smaller footprint, more forgiving operations.
Put them behind one gateway and the result is a single API surface, three substrates, and a clean separation of concerns.