Counting piano tuners, counting tokens

The famous Fermi problem goes: how many piano tuners are there in Chicago? Enrico Fermi liked to make his students answer it in their heads. Three million people, maybe one piano per twenty households, pianos get tuned roughly once a year, a tuner can do a few houses a day for two hundred days a year. Out the other end falls a number that’s correct to about a factor of two. The point wasn’t piano tuners. The point was that you can pin a real quantity to within an order of magnitude from first principles, without measuring anything, as long as you pick the right multiplicands.

Peter Steinberger posted his Claude Code billing for the last 30 days: 603 billion tokens, about $1.3 million in API fees. One person. One month.

The number is striking on its own. What I keep getting stuck on is the pitch behind it. Every frontier lab is telling investors and customers the same story right now: Peter is just early. The plan is for every knowledge worker to have a swarm of agents running in the background, all day, every day. At Peter’s per-user intensity.

Joules per token

Every output token a frontier MoE produces costs some energy at the chip. Multiply by datacenter overhead (PUE) and you get energy at the wall. Multiply that by a daily token rate and you get continuous power. Multiply that by population and you get a number you can compare to nameplate grid capacity.

The whole chain hangs off the first number. Joules per token at the wall. So we start there.

Eq `token_energy`: at the wall

$E_tok = PUE · sum over GPU types g of f_g · P_g / T_g$

Symbol	Meaning	Default
`G`	GPU types in the fleet	`{H100, B200}`
`f_g`	Fleet fraction of GPU `g`, sums to 1	0.5 / 0.5
`P_g`	GPU TDP, watts	700 / 1000
`T_g`	Batched output throughput, tokens/sec/GPU	200 / 600
`PUE`	Datacenter power usage effectiveness	1.4

For a 50/50 H100/B200 mix at PUE 1.4, that lands around 3 J per output token.

A note on T_g. Frontier-MoE serving throughput depends on batch size, model architecture, KV-cache discipline, and whether you’re using speculative decoding. The 200 and 600 here are middle-of-the-road estimates for production serving of a Claude-Sonnet/GPT-5.5-class model on each GPU. Swap your own numbers in.

Eq `cache_adjust`: what agent traffic actually looks like

The 3 J number is for a naive token, billed at full output-equivalent compute. Real agent traffic is nothing like that. A typical session is mostly the same prompt prefix re-read on every turn, hitting prefix cache. Cache hits skip ~99% of the attention compute. So the effective cost per nominal token is much lower than the chip-level number suggests.

$E_eff = E_tok · (alpha_o + alpha_m · beta_m + alpha_h · beta_h)$

Symbol	Meaning	Realistic default
`α_o`	Share of nominal tokens that are output	0.05
`α_m`	Share that are cache-miss input	0.20
`α_h`	Share that are cache-hit input	0.75
`β_m`	Cache-miss input cost vs output token	0.10
`β_h`	Cache-hit input cost vs output token	0.01

The defaults represent an average user running an agent through a working day: frequent tool use, some RAG, gaps between sessions. Peter himself sits at the optimistic edge of this range (closer to 0.02 / 0.08 / 0.90, since his sessions are long and continuous on a stable codebase); workloads that don’t cache cleanly sit further out. Plugging the realistic defaults in:

E_eff ≈ E_tok · (0.05 + 0.20·0.10 + 0.75·0.01)
      ≈ E_tok · 0.0775
      ≈ 0.23 J / nominal token

Agent-shaped traffic at this profile is roughly 13× cheaper than the naive number. Strip the cache entirely and the same workload costs 13× more. Closer to 700 kW per person at the per-user rate we compute next.

Eq `user_load`: from tokens to watts

$P_user = R · E_eff / 86400$

R is tokens per day per user, 86400 is seconds in a day. At Peter’s volume (603B/month → ~20B/day) and the realistic E_eff = 0.23 J/tok:

P_user = 20e9 · 0.23 / 86400 ≈ 53,000 W ≈ 53 kW continuous

One Peter-rate user (someone burning Peter’s daily token volume at average cache behavior) draws roughly the same continuous power as forty average US households. Peter himself, on his more cache-friendly profile, comes in around 26 kW. Half that, but still twenty households’ worth of grid for one person.

Eq `pop_load`: scaling up

$P_pop(N) = N · P_user$

This is where the numbers start to misbehave.

Eq `grid_ratio`: against the grid

$rho_grid(N, r) = P_pop(N) / P_grid^r$

For region r, what fraction of its average electricity load does Peter-rate adoption consume?

Eq `kardashev_frac`: against the sun

$K_1(N) = P_pop(N) / 1.74e17 W$

The denominator is the solar power hitting the top of Earth’s atmosphere. That’s the Type-I civilization threshold on the Kardashev scale.

When the cache doesn’t bail you out

β_h = 0.01 is doing most of the work in this post. A factor-of-100 discount on the cache-hit portion of every nominal token is what keeps per-token energy for an agent affordable. The 75% hit-share in the body’s defaults is doing the rest.

Prefix caching keeps the attention KV cache for a prompt prefix in GPU memory after the first request. The next request with the same prefix skips prefill for the cached portion. No forward pass through those tokens, just a memory load. That is where the 100× comes from. The cache lives in HBM as a per-conversation resource with a TTL around five minutes on Anthropic and similar windows on the others. When it expires, the next request pays full prefill plus a cache-write at roughly 1.25× input price to rebuild it.

Cases where the 75% hit-share doesn’t hold:

Cold starts. First request of any conversation pays full prefill. A user with many short conversations a day runs α_h closer to 0.4 than 0.75.
Bursty turns. Ten minutes between agent messages and the cache has aged out. Each refresh is a cache-write, and a few per session erase most of the discount.
Tool-mutating agents. Every tool result is appended to the conversation. The cache is only valid up to that point, so long agent runs with 20+ tool calls invalidate and rebuild the prefix over and over. Effective α_h lands somewhere between 0.9 and 0.5.
RAG-heavy workloads. Retrieved chunks are cache-miss by definition. Only the system prompt caches, so α_h drops to whatever fraction the stable prefix is, often 30 to 50%.
No cross-user sharing. Two users with identical system prompts don’t share a cache. Every Cursor user with the same 8,000-token preamble pays for their own copy in HBM.

Peter himself sits at the cache-friendly end. Plug his profile of 0.02 / 0.08 / 0.90 and E_eff drops from 0.23 J/tok to 0.11 J/tok. Per-user power halves, every population number halves. Now plug a pessimistic profile of 0.05 / 0.45 / 0.50 and E_eff jumps to 0.55 J/tok. Every number doubles in the other direction. The realistic middle sits where the body’s defaults are; the range either way is about 2×.

Cache hits cost 1% of an output token because the work was already done; the model is reading, not computing. The factor of 100 is what falls out when you replace compute with memory access. The same trade shows up in a few different shapes: KV cache, semantic recall over past sessions, cross-user prefix sharing.

Running the chain

The equations chain in one direction. The algorithms run that chain forward or backward, depending on what question you’re asking.

Alg 0: forward pass

Inputs: fleet composition, PUE, cache parameters, per-user rate R, population vector N[]
Outputs: P_pop[], ρ_grid[], K_1[]

1.  E_tok ← PUE · Σ_g f_g · P_g / T_g
2.  E_eff ← E_tok · (α_o + α_m·β_m + α_h·β_h)
3.  P_user ← R · E_eff / 86400
4.  for i in 1..|N|:
5.      P_pop[i]   ← N[i] · P_user
6.      ρ_grid[i]  ← P_pop[i] / P_grid_region[i]
7.      K_1[i]     ← P_pop[i] / 1.74e17
8.  return (P_pop, ρ_grid, K_1)

O(|G| + |N|). The whole point of the equation chain is that this loop is trivial.

Alg 1: how many fit

Inverse: given a power budget and a Peter-rate R, how many users fit?

inputs:  G, PUE, cache params, R, P_budget
output:  N_max

1.  E_eff   ← (steps 1–2 above)
2.  P_user  ← R · E_eff / 86400
3.  N_max   ← floor(P_budget / P_user)

For R = 20B tok/day, E_eff = 0.23 J/tok, and P_budget = current US electricity average (480 GW), N_max ≈ 9 million. Roughly the population of New York City. That’s the rough headroom in today’s grid for one Peter-rate worker per slot, if we spent the entire US grid on agent inference and nothing else. Which we won’t.

Alg 2: the gap

The roadmap question. What E_eff do we need to make N_target users at rate R fit inside budget P_budget?

$E_target = P_budget · 86400 / (N_target · R)$

inputs:  N_target, R, P_budget, E_eff_current
output:  E_eff_target, gap

1.  P_user_max   ← P_budget / N_target
2.  E_eff_target ← P_user_max · 86400 / R
3.  gap          ← E_eff_current / E_eff_target

Set N_target = world population, P_budget = world electricity (3.3 TW), R = 20B tok/day:

P_user_max   = 3.3e12 / 8.1e9 ≈ 407 W
E_eff_target = 407 · 86400 / 20e9 ≈ 1.76 mJ/tok
gap          = 0.23 / 0.00176 ≈ 130×

We’d need a 130× reduction in J-per-effective-token for the whole world to run agents at Peter-rate inside today’s electricity supply. Not 130× the GPU throughput. 130× the energy per token at the wall.

At scale

With E_eff = 0.23 J/tok and P_user = 53 kW:

Cohort	`N`	`P_pop`	Reference
Small town	25,000	1.3 GW	one Hoover Dam
Major US city	75,000	4.0 GW	four large nuclear units
Major US metro	5,000,000	265 GW	55% of total US electricity
NYC	8,300,000	440 GW	92% of US electricity
USA	335,000,000	17.8 TW	94% of total world primary energy
China	1,410,000,000	75 TW	4× world primary energy
World	8,100,000,000	429 TW	23× world primary energy

For the world-scale row, K_1 ≈ 2.5 × 10⁻³. Current humanity sits at K_1 ≈ 1 × 10⁻⁴, so this is roughly 25× our current energy footprint.

Inputs

Tokens / day / user R20.0B

Population N8.10B

H100 share of fleet50% / 50% B200

PUE1.40

Cache profile preset

Output share α_o0.05

Cache-hit share α_h0.75 (α_m = 0.20)

Cache-hit cost β_h0.010

Derived

E_tok (naive)3.62 J/tok

E_eff (after cache)0.280 J/tok (13× cheaper)

P_user64.9 kW (52 households)

P_pop525.55 TW

vs US electricity (480 GW)1094.9×

vs world primary energy (19 TW)27.7×

Kardashev fraction K_13.02e-3

"World" at this profile weighs 28× total human energy use.

So watt?

One Peter-rate user runs about 53 kW continuously. Forty US households of grid, plugged into one IDE.

Scale that up the obvious way. A town of them draws a Hoover Dam. An NYC of them draws 92% of total US electricity, used for nothing else. A whole world of them draws twenty-three times the energy humanity currently uses for everything we do: heating, transport, industry, light.

Almost all of it rides on β_h, the cache-hit cost. Today’s value sits at about a hundredth of an output token. Drop it tenfold and the whole stack shifts down with it. Lose the cache discount entirely and one Peter-rate user stops fitting in a rack.

“Peter is just early” weighs twenty-three civilizations.