ailocal-llmcomparison

I Tried to Build Wordle with Gemma 4

Google’s latest open-weights model versus Claude Code, head to head on a simple coding task. One took 2 hours. The other took 2 minutes.

Running Gemma 4 Locally

Google recently released Gemma 4, a 27-billion-parameter open-weights model and their latest entry in the local LLM space. The pitch is compelling: a model capable enough for real coding tasks, running entirely on your own hardware with no API keys and no per-token costs.

Setting it up is straightforward. Pull the model through Ollama, then point Aider at it:

ollama pull gemma4:26b
aider --model ollama_chat/gemma4:26b

I ran this on an Apple M4 Pro with 64GB of unified RAM, which is enough to hold the full 26B model in memory. For the first few exchanges, inference speed is perfectly usable. You get a response in under a minute, which feels acceptable for a coding assistant workflow.

The prompt I gave it was deliberately ambitious but not unreasonable:

Write a clone of Wordle with a much nicer front-end
(make it pure client side, no server, you can hash the
date to pick a word for the day, use a completion count
on the hash so after the user gets the word of the day,
they can make more words, you'll need a list of words
too). Make an amazing front end UI using framer.

This is the kind of task where a good coding model should shine. It requires creating a self-contained game with UI, state management, and game logic. Nothing exotic, just solid full-stack implementation.

What Went Wrong

The first attempt didn’t compile. The model generated a truncated variable declaration: const [completions, setComplet just stopped mid-word. The file was syntactically broken before it even had a chance to run.

That was the beginning of a long debugging session. Each fix the model applied introduced a new bug somewhere else. It was playing whack-a-mole with its own typos:

BUG
isCurrentcurrentRow

Two variable names fused together. The model concatenated a partial rewrite with the original.

BUG
text-3rab

An invalid Tailwind class. It seems to have corrupted 'text-3xl' during an edit.

BUG
httpshttps://

Doubled the URL scheme when fixing a link. The original 'https://' was prepended to itself.

BUG
sans:sans-empty

A malformed CSS class, mixing font-family syntax with state-based styling.

BUG
bg-yellow:500

Tailwind uses a dash, not a colon. This is the kind of error you'd expect from a model that has seen both Tailwind and non-Tailwind CSS.

Context degradation was real. As the conversation grew past 10 exchanges, two things happened. Responses got noticeably slower (the model was processing an increasingly large context window on local hardware). And the typo rate increased visibly. The model seemed to lose track of what it had already changed and would partially re-apply earlier edits on top of new ones.

The architecture choice made things worse. Gemma 4 chose to build the Wordle clone using ESM modules with Babel transpilation and dynamic script creation in the browser. This is a fragile approach that introduces many points of failure. A simpler architecture (plain React with CSS) would have avoided most of the issues, but the model doubled down on the complex approach rather than simplifying when things went wrong.

Web search was essentially unusable in this setup. Between Aider’s harness overhead and the model’s context limits, looking anything up was impractical. You’re flying blind with whatever the model already knows.

After approximately 10 rounds of “here’s the error” followed by “I fixed it” followed by “now there’s a different error,” I gave up and asked Claude to clean it up. Total time: roughly 2 hours of active debugging for a task that should take minutes.

There’s one more issue that’s easy to overlook: both models generated a tiny word list. Gemma produced 60 words. Claude’s initial version had about 120. The real Wordle uses 2,300+ answer words, and a proper clone needs thousands to stay interesting. Neither model thought to use an external word list or generate a comprehensive one. I had to manually source the Stanford GraphBase list (5,757 common five-letter words) and plug it into both versions. It’s a good reminder that even when the code “works,” the product thinking (will this actually be fun to play?) is still on you.

Gemma’s Version (After Fixes)

Here is the Wordle that Gemma 4 produced, after Claude cleaned up the remaining bugs. It works, but it took ~2 hours and 10+ debug cycles to get here. Try it:

The Same Task with Claude

For a direct comparison, I asked Claude Code (Opus 4.6) to build the same Wordle clone. The game below is the result. Try it out:

Wordle

Solved: 0

That game was built by Claude in a single pass, with no debugging required. It worked on first run. No truncated variables, no doubled URL schemes, no invented CSS classes.

The key differences go beyond just “fewer bugs.” Claude didn’t pick a fragile architecture. There’s no Babel-in-browser, no ESM/UMD conflict. It used straightforward React with CSS animations and transforms. It handled the full word list, keyboard input (both on-screen and physical), tile flip animations, shake feedback, and game logic in a single coherent output.

The cost comparison is worth considering honestly. Claude Code costs money. Every prompt burns API tokens. Gemma 4 is free to run locally once you have the hardware. But 2 hours of a developer’s time debugging a local model’s output is not free. The value proposition of a capable model is not just “it works” but “it works on the first try and you can move on to the next thing.”

MetricGemma 4 (local)Claude Opus 4.6
Time to working game~2 hours~2 minutes
Debug cycles10+0
Typos introduced5+0
Architecture qualityFragile (Babel/ESM)Clean (React/CSS)
Cost per run$0 (local)API tokens
Developer time cost~2 hours of debuggingEffectively zero

Conclusions

Gemma 4 is impressive for an open-weights model you can run on a laptop. The fact that it produces structured React code at all is remarkable. A year ago, local models of this size were barely coherent at writing prose, let alone generating working applications.

But for real productivity, the gap between local open models and frontier cloud models is still enormous. This is not a 10% difference in output quality. It is a qualitative difference in reliability. Claude produces working code on the first try. Gemma 4 produces code that is almost right but requires significant human intervention to get across the finish line.

The cost of Claude Code is a tremendous value when you factor in developer time saved. Two hours of debugging is not free, even if the model running on your laptop technically is. The real cost of a local model is the time you spend babysitting its output.

The sweet spot for local models today might be simpler tasks: quick scripts, text processing, boilerplate generation, or anything where the output is short enough that the model can maintain consistency. For anything requiring architectural decisions and self-consistent code across hundreds of lines, frontier models still dominate.

The trajectory matters though. If Gemma 5 next year closes even half this gap, the local-first coding workflow becomes genuinely viable. Every generation of open models gets meaningfully better. We are watching that gap shrink in real time. The question is not whether local models will eventually match frontier performance for common tasks, but when.