Hey, it's me, josh - as a memoji

josh.miami

TokenGolf hit v1.0 because the two-mode system was the wrong abstraction

Josh Echeverri
Josh Echeverri
10 min read

The two-mode system was the wrong abstraction.

That took a while to see clearly. Flow mode and roguelike mode had been there since the beginning. They made sense as a concept. Passive tracking for people who didn't want to think about it, committed tracking for people who wanted stakes. Clean separation.

The problem was that the separation created two different products, and neither one was the right one.

Flow mode had no pressure. It watched you. That was it. No par, no rating, no death. Just a number at the end. Useful the way a receipt is useful. You could look at it. It didn't change how you worked.

Roguelike mode had pressure, but it started with a lie. The wizard asked you to commit a budget before you knew what the session would need. Pick a quest. Pick a tier. Pick a model. Commit to a dollar amount. Then go work.

That felt decisive. It was actually just guessing in formal clothes.

The budget wizard was the wrong question

The deeper problem wasn't the UX friction of three extra steps. It was that pre-committed budgets don't measure what they claim to measure.

If I commit $1.50 on Sonnet and spend $0.90, is that efficient? Depends entirely on whether $1.50 was the right commitment. If the task only needed $0.30, I was loose. If it genuinely needed $2.00, I got lucky or I cut corners. The budget didn't know. It was anchored to my guess, not to the work.

That meant the score was partially a test of my ability to estimate cost in advance. That's not the same thing as efficiency. It's forecasting. And forecasting AI session cost is a skill nobody has yet, because the variance is too high and the feedback loop is too noisy.

So the whole commitment model had to go.

No more tokengolf start. No more wizard. No more quest names, no more pre-committed budgets, no more choosing between two modes that each solved half the problem. Every session is a run. The budget is computed after the fact from what actually happened.

That's where par came from.

Par had to grow slower than spending

The replacement needed to do something specific: create pressure that scaled with session length without requiring the user to predict anything.

Golf had the answer. Par isn't a commitment. It's an expectation. You don't choose par. The course does. Your job is to beat it.

So par in TokenGolf became a function of two things: which model you're using, and how many prompts you've sent.

// src/lib/score.js
export function getParBudget(
  model,
  promptCount,
  rateOverrides,
  floorOverrides
) {
  const rates = { ...MODEL_PAR_RATES, ...rateOverrides }
  const floors = { ...MODEL_PAR_FLOORS, ...floorOverrides }
  const key = getModelKey(model)
  const rate = rates[key] ?? rates.sonnet
  const floor = floors[key] ?? floors.sonnet
  return Math.max(rate * Math.sqrt(promptCount), floor)
}

The sqrt is the key decision. Linear scaling would've been simpler but wrong. If par grew linearly with prompts, every prompt would carry the same weight. The 15th prompt would be as forgiving as the 2nd. That doesn't match reality. Early in a session you're exploring, orienting, loading context. That should have headroom. Later in a session, if you're still spending at the same rate, something is wrong. You're either stuck in a loop or the task scope crept.

Sqrt scaling creates exactly that shape. Early prompts are cheap relative to par. Long sessions have to get increasingly efficient or they bust. The pressure curve is convex. It rewards front-loaded clarity and punishes aimless continuation.

The floor prevents a different kind of unfairness. A one-prompt agentic session on Opus can easily cost $3.00 from a single complex exchange. Without a floor, par for one prompt would be $8.00 times sqrt(1), which is just $8.00. Generous. But the real risk is that someone sends one prompt, Opus does a lot of work, and the run is already dead before the second prompt. So the floor sets a minimum par regardless of prompt count.

Death is still cosmetic. Bust means you spent more than par. The run gets marked died, the accent bar turns red, the death achievements fire. But it doesn't stop anything. It's a signal, not a gate.

That was deliberate. The game should name failure. It shouldn't prevent work.

The ratings needed a wider spread

The old efficiency tiers had five levels. That wasn't enough resolution once par replaced fixed budgets.

The problem was specific. With fixed budgets, the spread between "barely under" and "way under" was narrow enough that five tiers could name the interesting zones. But par is dynamic. A 10-prompt Sonnet session and a 40-prompt Sonnet session have very different par values. The interesting efficiency territory got wider.

So the scale went from five tiers to six, and the thresholds tightened at the top.

LEGENDARY moved from under 25% to under 15%. That made it genuinely rare. A new tier, EPIC, took the 15-30% range. PRO filled in 30-50%. SOLID stayed at under 75%. CLOSE CALL stayed at under 100%. BUST stayed above 100%.

The same thing happened with spend tiers. A new top tier, Mythic, showed up for sessions that were extraordinarily cheap for their model class. Sonnet under $0.10. Opus under $0.50. The thresholds are model-calibrated now instead of universal, which means the same tier name carries the same meaning regardless of which model you ran.

That calibration mattered more than the individual numbers. The point was never to make the math pretty. It was to make "Diamond on Haiku" and "Diamond on Opus" represent comparable discipline, not identical dollar amounts.

The game needed to feel something

Once par was working and the ratings felt right, the HUD had an interesting gap.

It showed numbers. Cost, par percentage, context load, model, efficiency tier. All useful. All inert. The HUD was informative in the same way a spreadsheet is informative. Accurate and emotionally dead.

That felt like a missed opportunity for a game.

So the emotion system started there. Not as decoration. As signal compression.

A session where you're at 80% of par, 60% context, with three failed tool calls and twelve prompts feels different from a session at 20% of par, 15% context, zero failures, and two prompts. Those are different states. The numbers describe them. But a single mood icon communicates the vibe faster than reading four metrics.

Twelve emotion states. VIBING for smooth, efficient runs. CRUISING when things are going well but not exceptional. FOCUSED for the middle ground. GRINDING when it's getting long. TENSE when par pressure is building. SWEATING when it's getting close. FRUSTRATED when failures are stacking up. OVERWHELMED when everything is bad at once. DEAD when the run busted. SLEEPING when there's no active run.

Each one is a composite of par percentage, context percentage, failed tool calls, and prompt count. No single metric controls the mood. The mapping is heuristic, not formulaic. That was intentional. Emotions should feel like an interpretation, not a calculation.

Three display modes. Emoji mode replaces the golf flag on line one with the mood icon. ASCII mode adds a third HUD line with a kaomoji and the emotion label spelled out. Off mode goes back to the classic flag. Configurable via tokengolf config emotions.

That configuration surface was part of a broader change. Par rates and floors became configurable too. If the defaults don't match your spending patterns, you can tune them. tokengolf config par sonnet 2.0 sets your Sonnet par rate to $2.00 per sqrt-prompt. tokengolf config floor opus 5.0 sets the Opus floor. reset restores defaults.

That mattered because par rates are opinions about what "normal" spending looks like. Those opinions should be adjustable. The defaults are calibrated from real session data, but real session data varies a lot across projects, codebases, and working styles.

The model had been lying about the model

There was a bug that had been quietly corrupting data for a while.

Every run was recording as Sonnet.

The reason was structural. session-start.js fires at the beginning of the session. At that point, the hook doesn't know which model Claude Code is actually using. So it defaulted to claude-sonnet-4-6. Reasonable guess. Wrong for anyone running Opus or Haiku.

The fix came from the statusline. The statusline hook receives the full session JSON on every render, and that JSON includes the actual model. So statusline.sh started checking the model it received against the model stored in current-run.json. If they differed, it wrote the real model back.

That's a weird place for a data integrity fix to live. It worked because the statusline runs frequently and has access to truth that the session-start hook doesn't.

Once the model was correct, par calculations were correct. Achievement thresholds were correct. The entire scoring system got more honest without changing any scoring logic. Just better input data.

Plugin distribution changed who could reach it

Up through v0.5.4, installing TokenGolf meant npm install -g tokengolf followed by tokengolf install to patch your Claude Code settings. That worked. It also meant you needed npm, you needed to know about global installs, and you needed to run a separate setup command that modified a config file in a non-obvious location.

Claude Code's plugin system changed that equation.

claude plugin install tokengolf. One command. No npm. No manual hook patching. No global install. The plugin scaffold handles hook registration, script bundling, and path resolution. Updates happen through the plugin system instead of the auto-sync logic that was checking version stamps on every session start.

Building the plugin meant extracting the ANSI scorecard renderer into a shared module, bundling session-end.js with esbuild (it's the only hook that uses async imports), and writing a hooks.json that maps all nine hooks to their bundled scripts using ${CLAUDE_PLUGIN_ROOT} for portable paths.

The npm install path still works. The auto-sync still works. But the plugin became the recommended path because it removed every piece of friction that wasn't the game itself.

What got deleted matters as much as what got added

The StartRun wizard was 278 lines. ActiveRun was 141 lines. The demo fixtures for ActiveRun were another 56. The FLOORS progression array. The run.quest field. The run.budget field. The run.mode field. All the null-budget checks that existed because flow mode runs didn't have a budget.

All gone.

That deletion is the actual story of this version.

The system got simpler. Not in the sense of fewer features. In the sense of fewer concepts. There's one mode now. Every session is a run. Par replaces budgets. Death is par-relative. Achievements reference par instead of budget. The entire codebase stopped having to ask "is this a flow run or a roguelike run?" because the question no longer exists.

The test suite grew from around 151 tests to over 164. Thirteen new tests just for getParBudget() covering all models, edge cases, rate overrides, floor overrides, zero prompts, null models. The rule surface got more complex in some places and radically simpler in others.

Why this is v1.0

Calling it v1.0 was a statement about which abstraction won.

The two-mode system was a design hypothesis. Flow was for people who didn't want pressure. Roguelike was for people who did. The hypothesis was that those were different audiences who needed different products.

They weren't.

What they needed was the same product with better automatic pressure. Par delivers that. It doesn't ask you to commit. It doesn't ask you to choose a mode. It just watches how the session unfolds and tells you, relative to what a session like this should cost, how you did.

That's a simpler idea. It's also a more honest one. The wizard felt like agency. It was actually noise. The par formula has less ceremony and more signal.

TokenGolf is still a Node CLI built on Commander, Ink, React, and esbuild. It still lives in nine hooks and a statusline script. It still tracks cost by parsing JSONL transcripts because the hook payload doesn't include total_cost_usd.

But the game it plays is cleaner now. One mode. Dynamic par. Emotions. Configurable rates. Plugin distribution. And a version number that says: this is the abstraction I'm committing to.

The next interesting question is what happens when other people start tuning their par rates and the game has to accommodate working styles I haven't seen yet.


Building TokenGolf is an ongoing series about turning Claude Code sessions into a roguelike. You can see the product and follow along at https://josheche.github.io/tokengolf/

Related Posts

What I shipped today in TokenGolf v0.4.0

Expansion and hardening at the same time. More achievements, more hooks, more structure, and the release infrastructure to keep the rules from drifting.

Sunday was when TokenGolf stopped feeling bolted on

By the end of Saturday, the score was believable. Sunday became the day of friction removal and deeper integration.

The repo looked done until I tried to run it

Saturday was where TokenGolf stopped being a neat concept and started becoming a real tool. The first milestone had nothing to do with the game.