How's your inference server doing?

Open Responses API now supported in llmnop v0.6.0 release

Jan 19, 2026

A close up of a thermometer on a wooden surface — Photo by Eric Prouzet on Unsplash

I built llmnop because I wanted a fast and correct way to answer “how’s my inference server doing?” without any additional dependencies or overheads. It’s a single binary that works with any OpenAI-compatible endpoint such as with vLLM, Ollama, LM Studio, or the actual OpenAI API. It gives you the numbers that matter: time to first token, inter-token latency, throughput, and end-to-end latency.

I use it constantly while tuning servers. Change a batch size, run llmnop, see what happened. Swap quantization levels, run llmnop, compare. This project came from me wanting to understand LLM performance at a practical level.

Version 0.6.0 shipped in order to support the Open Responses API.

What’s new

Responses API support. The --api responses flag switches from Chat Completions to the new Responses API. Same streaming metrics: TTFT, inter-token latency, throughput, and end-to-end latency, just a different wire format.

llmnop --api responses \
  --url https://api.openai.com/v1 \
  --api-key $OPENAI_API_KEY \
  --model gpt-5-mini

The Responses API combines Chat Completions simplicity with Assistants API capabilities. OpenAI designed it with reasoning models in mind, and the emerging Open Responses standard means the same client code works across providers: vLLM, HuggingFace TGI, and others are adopting it.

Server-reported token counts. Some models don’t stream reasoning tokens. They just report the count in the final usage payload. The new --use-server-token-count flag tells llmnop to trust the server’s numbers instead of tokenizing locally.

This matters for hidden reasoning. Without it, throughput would be inflated (you’d divide output tokens by a generation window that included invisible reasoning time). With it, metrics stay accurate even when you can’t see what the model is thinking.

Faster startup. First-run latency dropped from ~10 seconds to under a second for typical tokenizers. The Shakespeare corpus used for synthetic prompt generation now tokenizes in parallel chunks instead of one giant string. Same prompts, less waiting.

Breaking changes

None. The Chat Completions API remains the default.

Try it

cargo install llmnop

Or grab a binary from the release page.

Let me know how it works for you!

Notes on AI Engineering

Discussion about this post

Ready for more?