Haiku alone: 2/7 shipped. Haiku with context: 7/7.
The tokenmaxxing counter-argument. Same model, same tasks, same prompts. The only variable was what Haiku knew before the first turn. Context is the lever, not the model.
X / Twitter
Post 1 231 / 280
Haiku without context: 2 of 7 tasks shipped. Haiku with Hydrate: 7 of 7. Same model. Same tasks. Same prompts. Same token budget. The only variable was what Haiku knew before its first turn. Context is the lever. Not the model.
Hook tweet. Very quotable. Tuesday to Thursday 8-10am UK. Let it stand alone.
Post 2 387 / 280
The conventional answer to "Haiku keeps failing on complex tasks" is to upgrade to Sonnet. That works. It also costs more and does not fix the underlying problem. The root cause is usually not reasoning capacity. It is that Haiku starts every session cold, without the architectural context that would let it make correct decisions. Warm context. Same model. Five more tasks shipped.
Contrarian follow-on. Thread under or stand alone.
Post 1402 chars
I set up a benchmark with seven sequential implementation tasks on the same codebase. Five cells: Haiku baseline, Haiku with Hydrate, Sonnet baseline, Sonnet with Hydrate, and a hybrid routing cell. Haiku alone shipped 2 of 7. Haiku with Hydrate shipped 7 of 7. Same model. Same tasks. Same prompts. The only variable was what Haiku knew before its first turn. The result that surprised me was not the 7/7. It was the 2/7. Haiku is a capable model. On isolated tasks it performs well. But on a multi-session project where each task builds on decisions made in earlier sessions, cold-start Haiku keeps running into architectural context it does not have. It makes defensible choices, but they conflict with what was established earlier. Tasks fail, or produce work that needs rework. When Hydrate injects the prior session context, Haiku knows what was decided. It does not need to re-discover the module structure, the error format, the established conventions. It arrives knowing them and applies them correctly from turn one. The conclusion I draw from this: when an agent fails on a complex multi-session task, the first hypothesis should not be "needs a bigger model." It should be "needs better context." The two solutions have different costs. One is a 10x token spend. The other is a 195-token injection. Full benchmark breakdown: gethydrate.dev/blog/tokenmaxxing-and-the-empirical-answer
The 2/7 finding needs the explanation of why. Lead with the numbers, then the mechanism. Tuesday or Wednesday.
Bluesky
Post 1 143 / 300
Haiku without context: 2/7 tasks shipped. Haiku with Hydrate: 7/7. Same model. Same tasks. Same budget. Context is the lever. Not the model.
Short Bluesky version.
r/ClaudeAI
Title
Haiku ships 2/7 tasks cold. Same model ships 7/7 with context injection. Here's the benchmark.
Body
Part of a larger benchmark series on how memory injection affects task completion rates across model tiers. This particular result was the most striking. Setup: seven sequential implementation tasks on a Go REST API project. All in isolated Docker containers, same model per cell, same prompts, same max-turns cap (25). Five cells in total. Haiku baseline (no context): 2/7 shipped. The failures were not random. Haiku made locally reasonable decisions that conflicted with architectural choices made in earlier sessions. It did not know about the established error format. It duplicated a helper that already existed. It chose a dependency that violated the project's stdlib-only policy. Haiku with Hydrate (context injected before every session): 7/7 shipped. The injected context told Haiku what had been established in prior sessions. It applied those conventions correctly throughout. For comparison: Sonnet baseline shipped 4/7, Sonnet with Hydrate 7/7, hybrid (Sonnet seed session + Haiku+Hydrate for all subsequent sessions) 7/7 at $0.20 per session vs $1.64 for raw Sonnet. The interpretation I take from this: cold-start failure on complex multi-session projects is usually a context problem, not a reasoning problem. The fix is not always a bigger model. It is sometimes a better starting state. Happy to share the full task list and grading criteria if useful.
Technical depth. The specific failure modes (error format, duplicate helper, dependency policy) make this credible. Post weekdays.