Meta just shipped Muse Spark — their first natively multimodal reasoning model. The blog post was the usual parade of benchmark tables and cherry-picked demos. Benchmarks are benchmarks. I wanted to see for myself.
So I ran three tests across five frontier models: Meta Muse Spark (Thinking), Claude Opus 4.6 (Thinking), GPT-5.4, Gemini 3.1 (Thinking), and Grok 4.2 (Expert). Same prompts, same context, zero retries. One shot each.
Test 1: Read the Menu
I grabbed a photo of a chalkboard menu from Yezzi’s — handwritten chalk, glass reflections, multiple sections with prices, add-ons, and fine print. Then I asked each model: “What’s on the menu?”

This tests whether a model can actually read messy real-world images, not just describe them. The difference between “Boursin Turkey Sandwich” and “Brazini Sandwich” is the difference between useful and hallucinated.
Menu Reading: Accuracy vs. Hallucinations
Green = items correctly identified (of 17). Red = hallucinated items or prices.
Consensus Scores
Instead of hand-verifying every item, I used a consensus method: if 4+ out of 5 models agreed on an item or price, it was marked as ground truth. Here’s how each model aligned with that consensus.
The Hallucination Hall of Fame
Described the spicy chicken sandwich as having “Carolina spices, turbo w/ lemon, and a salty martini olive.”
It was Nashville spices, lettuce, and spicy mayo. Not even close.
Confidently identified a “Slapped Wagyu Dog” for $8.
It was a Salami Wrapped Dog. I wish the Wagyu version existed.
Read “Junior Beef” as “Lemon Beef,” “Boursin” as “Brazini,” and “Fried Chx” as “French Dip.”
Barely recognizable. Like reading a menu through a waterfall.
Added “apples” to the chicken salad that weren’t there and renamed “Spicy Aioli” to “Garlic Aioli.”
At least these sound like real food. The smoothest hallucinator.
The most telling pattern: each model handles uncertainty differently. Meta gets it right or stays vague. Gemini smooths things over. GPT-5.4 guesses confidently. Grok invents food. Claude gives up and misreads the word entirely.
Test 2: Stock Analysis with Real Numbers
I asked each model to find current stock prices for NVIDIA, AMD, and Intel, calculate their P/E ratios from latest reported earnings, and pick the best value. This tests tool use (can it fetch live data?), math (does the arithmetic check out?), and reasoning (does the recommendation follow from the evidence?).
| Data Point | Meta | Claude | GPT-5.4 | Gemini | Grok |
|---|---|---|---|---|---|
| NVDA Price | $177.64 | $177.64 | $181.53 | $177.64 | $181.60 |
| NVDA EPS | $4.90 | $4.90 | $4.90 | $4.90 | $4.90 |
| NVDA P/E | 36.3x | 36.3x | 37.0x | 36.3x | 37.1x |
| AMD P/E | 82–84x | 81.5x | 87.5x | 83.6x | 89.2x |
| INTC P/E | N/A | N/A | N/A | N/A | N/A |
| Best Value | NVDA | NVDA | NVDA | NVDA | NVDA |
All five models picked NVIDIA. Unanimous. NVIDIA’s $4.90 EPS was the one hard number every model nailed. The differentiation was in the depth of analysis, not the conclusion.
Test 3: Build a Snow Globe
I asked each model to generate a single HTML file with a 3D snow globe using Three.js — glass sphere with refraction, pine trees and a house inside, falling snow particles, auto-orbiting camera, translucent materials, the works. One prompt, one shot, no iteration.
This is where things got interesting. I analyzed the code before opening the files in a browser. The code-level ranking looked like this: Meta had the best glass material, Gemini had 478 lines of sophisticated particle physics with shadow maps, and ChatGPT wrote the simplest, least ambitious code.
Then I opened them in a browser.
Code Analysis vs. What You Actually See
Code quality (static analysis)
Meta wrote the most technically correct Three.js. Then it rendered a black screen.
The ranking completely inverted. Meta’s technically correct glass material produced a black screen — the lighting couldn’t penetrate. Gemini’s 478 lines of sophisticated snow physics had a clock bug (getElapsedTime() internally calls getDelta(), so calling both gives dt ≈ 0) that froze the snow completely. ChatGPT’s simple, safe code was the only one where you could actually see what was happening.
Nobody passed this test. Zero models produced a snow globe you’d actually want to look at. But GPT-5.4 came closest — not by writing the best code, but by getting the basics right: enough light, visible glass, a scene you can see.
See For Yourself
All five snow globes, running live. Same prompt, one shot each.
The Final Scoreboard
Each model gets a composite score from 0–100. The formula: convert each test’s rank (1st–5th) to points (100, 75, 50, 25, 0), then average across all three tests.
Meta Muse Spark takes it. Two first-place finishes on vision and analysis gave it enough runway to absorb a 4th-place on code generation. Nobody else won more than one test. GPT-5.4’s snow globe victory wasn’t enough to offset a mediocre showing on the other two. Claude was the most consistent — never first, never last (except on that menu) — landing solidly in second.
What I Actually Learned
1. Confabulation style is more revealing than accuracy. Every model hallucinated something. The way they hallucinate tells you more than benchmark scores. Meta stays quiet when unsure. Grok invents plausible-sounding food (“salty martini olive”). GPT-5.4 states wrong numbers with full confidence. Gemini smooths over gaps with generic substitutes. Claude just misreads the word entirely.
2. Code quality ≠ output quality. The snow globe test was the clearest example: Meta’s code was technically the best Three.js in the batch and rendered a black screen. ChatGPT’s code was the simplest and was the only watchable result. In coding, “it works” beats “it’s correct” every time.
3. Meta Muse Spark is genuinely impressive at multimodal tasks. On the menu reading, it was the only model with zero consensus-breaking hallucinations. On stocks, it caught same-day market news and calculated forward P/E ratios nobody else considered. For perception and tool use, it’s the real deal.
4. Meta Muse Spark is the overall winner — but barely. Two first-place finishes carried it past a rough code test. The model that crushed the vision test came last on code. The model that wrote the best code produced the worst visual. The model that had the deepest financial analysis couldn’t read a menu. You’re not choosing a model — you’re choosing a model for a task.
Methodology note: All tests were run on April 8, 2026. Models used: Meta Muse Spark (Thinking), Claude Opus 4.6 (Thinking), GPT-5.4 (ChatGPT), Gemini 3.1 (Thinking), and Grok 4.2 (Expert). Each model received identical prompts with no system instructions, no retries, and no follow-up clarifications. For the menu test, ground truth was established by consensus (4+ of 5 models agreeing) rather than human verification. The stock test was run during market hours. The snow globe test evaluated both code quality (static analysis) and visual output (human review in Chrome).