"The industry measures tokens per second. Users experience time to first token. These are not the same thing."
When evaluating LLM performance, the default metric has become "tokens per second" (TPS). It's a clean number, easy to benchmark, and satisfying to optimize. But it fundamentally misrepresents what users actually experience.
Time-to-First-Token (TTFT) measures how long a user waits before seeing any response. This is the moment of maximum uncertainty—the user doesn't know if their request was received, if the model is processing, or if something went wrong. Reducing TTFT from 2 seconds to 100ms transforms the experience from "waiting for a computer" to "having a conversation."
Consider two systems: System A starts generating after 2 seconds and produces 100 tokens per second. System B starts generating after 100ms and produces 50 tokens per second. For a 200-token response, System A finishes in 4 seconds. System B finishes in 4.1 seconds. The TPS-obsessed would choose System A. But users overwhelmingly prefer System B.
2s TTFT + 100 TPS = 4s total for 200 tokens
100ms TTFT + 50 TPS = Instant engagement, 4.1s total
Research in human-computer interaction shows that perceived wait time is non-linear. A user who waits 2 seconds before seeing a response perceives the wait as significantly longer than a user who sees immediate progressive output. The first token is not just data—it's acknowledgment. It says "I heard you, I'm working."
First token, fast. Every time.
We optimize for user experience, not benchmark performance. The first token should arrive before the user's attention wanders.
Total throughput still matters for long-form generation, but it should never come at the cost of responsiveness. The best systems optimize both: fast first token AND high sustained throughput.
The future of LLM optimization isn't about squeezing more tokens out of each second. It's about respecting the user's time from the very first millisecond.