Pricing

Predictable tiers. No per-token surprises.

Inference runs on your hardware, so you never pay us per token. You pay for the platform, the deployment work, and the level of operations you want from our team.

Pilot

$25k

one-time, fixed scope

A 4-week pilot deployment in a non-production segment of your network.

Reference architecture deployed in your environment
Up to 2 inference nodes (your hardware)
Up to 25 internal users
LiteLLM + vLLM + Ollama configured
Office-hours support during pilot
Migration path to Production tier

Start a pilot

Production

Enterprise

Custom

annual

For agencies, large hospital systems, and multi-site deployments.

Multi-site & multi-region deployments
Air-gapped & classified-network support
24/7 managed operations and on-call
Dedicated solutions architect
Custom model fine-tuning pipeline
Compliance evidence packs (FedRAMP, HIPAA, etc.)
Named SLA with penalties

FAQ

Common questions about pricing.

Do we pay for inference compute?: No. You provision and own the GPU hardware. Our pricing covers the platform, deployment, and operations — not the silicon underneath.
Can we run this fully air-gapped?: Yes. We support fully disconnected networks, with mirrored model registries and signed offline updates. The Enterprise tier is built around this.
What hardware do we need?: Anything from a single H100/H200 box up to multi-node clusters. We commonly deploy on NVIDIA L40S, H100, and H200 hardware. We help you size during discovery.
What happens if we want to take operations in-house?: Everything is open source and deployed as code in your environment. You can drop the managed-ops portion at any renewal and keep running the platform yourselves.