I Tested Meta Muse Spark Against 4 Frontier Models

Meta just shipped Muse Spark — their first natively multimodal reasoning model. The blog post was the usual parade of benchmark tables and cherry-picked demos. Benchmarks are benchmarks. I wanted to see for myself.

So I ran three tests across five frontier models: Meta Muse Spark (Thinking), Claude Opus 4.6 (Thinking), GPT-5.4, Gemini 3.1 (Thinking), and Grok 4.2 (Expert). Same prompts, same context, zero retries. One shot each.

Models tested

Tests run

Retries allowed

Meta Muse

Winner

Test 1: Read the Menu

I grabbed a photo of a chalkboard menu from Yezzi’s — handwritten chalk, glass reflections, multiple sections with prices, add-ons, and fine print. Then I asked each model: “What’s on the menu?”

The test image: Yezzi’s chalkboard menu. Handwritten chalk, glass reflections, fine print.

This tests whether a model can actually read messy real-world images, not just describe them. The difference between “Boursin Turkey Sandwich” and “Brazini Sandwich” is the difference between useful and hallucinated.

Menu Reading: Accuracy vs. Hallucinations

Green = items correctly identified (of 17). Red = hallucinated items or prices.

Meta Muse Spark17/17 found · 0 hallucinated

Gemini 3.117/17 found · 3 hallucinated

GPT-5.415/17 found · 4 hallucinated

Grok 4.217/17 found · 5 hallucinated

Claude Opus 4.68/17 found · 5 hallucinated

Correct items Hallucinations

Consensus Scores

Instead of hand-verifying every item, I used a consensus method: if 4+ out of 5 models agreed on an item or price, it was marked as ground truth. Here’s how each model aligned with that consensus.

Meta Muse Spark17/17 items · 0 hallucinations

Gemini 3.117/17 items · 2-3 hallucinations

GPT-5.415/17 items · 3-4 hallucinations

Grok 4.217/17 items · 5+ hallucinations

Claude Opus 4.68/17 items · 5+ hallucinations

The Hallucination Hall of Fame

Grok 4.2 — Most Creative

Described the spicy chicken sandwich as having “Carolina spices, turbo w/ lemon, and a salty martini olive.”

It was Nashville spices, lettuce, and spicy mayo. Not even close.

GPT-5.4 — Most Confident

Confidently identified a “Slapped Wagyu Dog” for $8.

It was a Salami Wrapped Dog. I wish the Wagyu version existed.

Claude Opus 4.6 — Most Lost

Read “Junior Beef” as “Lemon Beef,” “Boursin” as “Brazini,” and “Fried Chx” as “French Dip.”

Barely recognizable. Like reading a menu through a waterfall.

Gemini 3.1 — Most Plausible

Added “apples” to the chicken salad that weren’t there and renamed “Spicy Aioli” to “Garlic Aioli.”

At least these sound like real food. The smoothest hallucinator.

The most telling pattern: each model handles uncertainty differently. Meta gets it right or stays vague. Gemini smooths things over. GPT-5.4 guesses confidently. Grok invents food. Claude gives up and misreads the word entirely.

Test 2: Stock Analysis with Real Numbers

I asked each model to find current stock prices for NVIDIA, AMD, and Intel, calculate their P/E ratios from latest reported earnings, and pick the best value. This tests tool use (can it fetch live data?), math (does the arithmetic check out?), and reasoning (does the recommendation follow from the evidence?).

Data Point	Meta	Claude	GPT-5.4	Gemini	Grok
NVDA Price	$177.64	$177.64	$181.53	$177.64	$181.60
NVDA EPS	$4.90	$4.90	$4.90	$4.90	$4.90
NVDA P/E	36.3x	36.3x	37.0x	36.3x	37.1x
AMD P/E	82–84x	81.5x	87.5x	83.6x	89.2x
INTC P/E	N/A	N/A	N/A	N/A	N/A
Best Value	NVDA	NVDA	NVDA	NVDA	NVDA

All five models picked NVIDIA. Unanimous. NVIDIA’s $4.90 EPS was the one hard number every model nailed. The differentiation was in the depth of analysis, not the conclusion.

Meta Muse Spark7+ · Best sourcing, Intel forward P/E, same-day news

Claude Opus 4.67 · Correct data, proper sources, noted caveats

Grok 4.26 · Exact timestamps, showed formula, no sources

GPT-5.46 · Right answer, bad sourcing (NVIDIA link for everything)

Gemini 3.15.5 · Good analysis, zero sources cited

Test 3: Build a Snow Globe

I asked each model to generate a single HTML file with a 3D snow globe using Three.js — glass sphere with refraction, pine trees and a house inside, falling snow particles, auto-orbiting camera, translucent materials, the works. One prompt, one shot, no iteration.

This is where things got interesting. I analyzed the code before opening the files in a browser. The code-level ranking looked like this: Meta had the best glass material, Gemini had 478 lines of sophisticated particle physics with shadow maps, and ChatGPT wrote the simplest, least ambitious code.

Then I opened them in a browser.

Code Analysis vs. What You Actually See

Code quality (static analysis)

Meta Muse Spark12.5

Claude12.0

Gemini11.0

Grok10.0

ChatGPT9.5

Meta wrote the most technically correct Three.js. Then it rendered a black screen.

The ranking completely inverted. Meta’s technically correct glass material produced a black screen — the lighting couldn’t penetrate. Gemini’s 478 lines of sophisticated snow physics had a clock bug (getElapsedTime() internally calls getDelta(), so calling both gives dt ≈ 0) that froze the snow completely. ChatGPT’s simple, safe code was the only one where you could actually see what was happening.

Nobody passed this test. Zero models produced a snow globe you’d actually want to look at. But GPT-5.4 came closest — not by writing the best code, but by getting the basics right: enough light, visible glass, a scene you can see.

See For Yourself

GPT-5.41

Claude2

Grok3

Meta Muse4

Gemini5

All five snow globes, running live. Same prompt, one shot each.

The Final Scoreboard

Each model gets a composite score from 0–100. The formula: convert each test’s rank (1st–5th) to points (100, 75, 50, 25, 0), then average across all three tests.

Meta Muse Spark Winner

Menu 1 Stocks 1 Globe 4· 2 wins

/100

GPT-5.4

Menu 3 Stocks 4 Globe 1· 1 win

/100

Claude Opus 4.6

Menu 5 Stocks 2 Globe 2

/100

Grok 4.2

Menu 4 Stocks 3 Globe 3

/100

Gemini 3.1

Menu 2 Stocks 5 Globe 5

/100

Meta Muse Spark takes it. Two first-place finishes on vision and analysis gave it enough runway to absorb a 4th-place on code generation. Nobody else won more than one test. GPT-5.4’s snow globe victory wasn’t enough to offset a mediocre showing on the other two. Claude was the most consistent — never first, never last (except on that menu) — landing solidly in second.

What I Actually Learned

1. Confabulation style is more revealing than accuracy. Every model hallucinated something. The way they hallucinate tells you more than benchmark scores. Meta stays quiet when unsure. Grok invents plausible-sounding food (“salty martini olive”). GPT-5.4 states wrong numbers with full confidence. Gemini smooths over gaps with generic substitutes. Claude just misreads the word entirely.

2. Code quality ≠ output quality. The snow globe test was the clearest example: Meta’s code was technically the best Three.js in the batch and rendered a black screen. ChatGPT’s code was the simplest and was the only watchable result. In coding, “it works” beats “it’s correct” every time.

3. Meta Muse Spark is genuinely impressive at multimodal tasks. On the menu reading, it was the only model with zero consensus-breaking hallucinations. On stocks, it caught same-day market news and calculated forward P/E ratios nobody else considered. For perception and tool use, it’s the real deal.

4. Meta Muse Spark is the overall winner — but barely. Two first-place finishes carried it past a rough code test. The model that crushed the vision test came last on code. The model that wrote the best code produced the worst visual. The model that had the deepest financial analysis couldn’t read a menu. You’re not choosing a model — you’re choosing a model for a task.

Methodology note: All tests were run on April 8, 2026. Models used: Meta Muse Spark (Thinking), Claude Opus 4.6 (Thinking), GPT-5.4 (ChatGPT), Gemini 3.1 (Thinking), and Grok 4.2 (Expert). Each model received identical prompts with no system instructions, no retries, and no follow-up clarifications. For the menu test, ground truth was established by consensus (4+ of 5 models agreeing) rather than human verification. The stock test was run during market hours. The snow globe test evaluated both code quality (static analysis) and visual output (human review in Chrome).