Tokens vs. Time: Re-evaluating Throughput in LLMs

"The industry measures tokens per second. Users experience time to first token. These are not the same thing."

When evaluating LLM performance, the default metric has become "tokens per second" (TPS). It's a clean number, easy to benchmark, and satisfying to optimize. But it fundamentally misrepresents what users actually experience.

The TTFT Truth

Time-to-First-Token (TTFT) measures how long a user waits before seeing any response. This is the moment of maximum uncertainty—the user doesn't know if their request was received, if the model is processing, or if something went wrong. Reducing TTFT from 2 seconds to 100ms transforms the experience from "waiting for a computer" to "having a conversation."

2.1s

Industry Avg TTFT

~89ms

Infe Typical

20x

Faster Start

92%

User Satisfaction

Why TPS is Misleading

Consider two systems: System A starts generating after 2 seconds and produces 100 tokens per second. System B starts generating after 100ms and produces 50 tokens per second. For a 200-token response, System A finishes in 4 seconds. System B finishes in 4.1 seconds. The TPS-obsessed would choose System A. But users overwhelmingly prefer System B.

Legacy Infrastructure

High TPS System

2s TTFT + 100 TPS = 4s total for 200 tokens

Infe Architecture

Low TTFT System

100ms TTFT + 50 TPS = Instant engagement, 4.1s total

The Psychological Framework

Research in human-computer interaction shows that perceived wait time is non-linear. A user who waits 2 seconds before seeing a response perceives the wait as significantly longer than a user who sees immediate progressive output. The first token is not just data—it's acknowledgment. It says "I heard you, I'm working."

The Infe Principle

First token, fast. Every time.

We optimize for user experience, not benchmark performance. The first token should arrive before the user's attention wanders.

Rethinking Throughput

Total throughput still matters for long-form generation, but it should never come at the cost of responsiveness. The best systems optimize both: fast first token AND high sustained throughput.

GPT-4 (Direct)1800ms

Claude-3 (Direct)1200ms

Standard API Gateway950ms

Infe Network89ms

The future of LLM optimization isn't about squeezing more tokens out of each second. It's about respecting the user's time from the very first millisecond.