Same Size, 3 Years of Progress: Retraining My Image Prompt Generator
In 2023 I fine-tuned GPT-2 355M to write image prompts. Now I retrained the concept with Qwen3-0.6B on Claude-generated data. Same size class, dramatically different results.
In April 2023, I fine-tuned GPT-2 Medium (355M parameters) to expand simple image prompts into detailed ones. It worked — kinda. The model learned to pad prompts with “breathtaking,” “captivating,” and “whimsical,” throw in some camera terms, and call it a day. But the outputs were generic, repetitive, and occasionally wrong in embarrassing ways.
Three years later, everything has changed. Open-source model architectures have leapfrogged. Distillation from frontier models is a proven technique. And I have a much better idea of what constitutes a good image prompt.
So I retrained the concept. Same size class, completely different approach. The question: how much does architecture generation + data quality matter when model size stays the same?
Three Years of Progress
GPT-2 was released in 2019. Qwen3 in 2025. Same size class, completely different generation of architecture — better attention mechanisms, more efficient tokenization, trained on vastly more data during pretraining.
What Changed
Two things changed between v1 and v2, and I wanted to test whether they matter more than raw scale:
RoPE embeddings, GQA attention, SwiGLU activations, RMSNorm. Six years of transformer research baked into the architecture itself.
9,400 prompt pairs generated by Claude across 11 categories and 5 styles. Every pair validated. Zero noise, zero duplicates.
Side by Side
Same input, same task. Toggle between v1 and v2 to see the difference three years makes. Pay attention to specificity, vocabulary, and accuracy:
A breathtaking image of a tranquil, calm image of the horizon at sunset, revealing a serene, crystal-clear ocean with a tranquil white sand beach below.
Generated images (via Grok Imagine)
Basic input
v1 prompt
v2 promptA captivating image of a bustling, neon-lit street at night, with pedestrians weaving through the mix of pedestrian traffic.
Generated images (via Grok Imagine)
Basic input
v1 prompt
v2 promptA whimsical image of a cherry blossom blossoms in a lush, tropical rainforest
Generated images (via Grok Imagine)
Basic input
v1 prompt
v2 promptA haunting image of a deer stalking the night sky
Generated images (via Grok Imagine)
Basic input
v1 prompt
v2 promptA vibrant and dramatic image of a bustling tokyo street, showcasing the vibrant colors, textures.
Generated images (via Grok Imagine)
Basic input
v1 prompt
v2 promptBy the Numbers
Across 20 test prompts, v2 consistently generates longer, more detailed outputs:
Average words per output
Speed tradeoff
v2 is nearly 2x slower — the cost of a bigger vocabulary and more layers.
What v2 Does Better
Specificity. v1 generates “a breathtaking image” and “vibrant colors.” v2 names specific cameras (Fuji X-T5), locations (Shibuya), colors (amber, grey, rose), and times (2am). This is the difference between a prompt that describes and one that directs.
Technical vocabulary. v2 includes f-stops, focal lengths, ISO values, and film stock references. It learned these from the Claude-generated training data, which was designed to mimic how professional photographers think about images.
Compositional awareness. v2 talks about where things are in the frame — “shot from a low angle,” “the child and kite at the top of the frame,” “shot from slightly to the side.” v1 rarely considers composition.
Mood and meaning. v2 often ends with a thematic statement — “the image is about domestic life at its most domestic” or “the persistence of strategy and the cost of knowledge.” v1 never does this.
Where v2 Still Fails
Hallucination. v2 turned “old man playing chess” into a scene with a mahjong table. The chess pieces are there, but so is the wrong game. v1 hallucinated too (deer instead of wolf), but v2’s hallucinations are subtler and harder to catch.
Repetitive phrasing patterns. v2 has its own verbal tics. It loves the em-dash construction (“the sky above — the colors from deep rose”) and often ends with philosophical statements that can feel overwrought.
Speed. Nearly 2x slower at 4.8 seconds per prompt vs 2.6 seconds. The larger vocabulary (151K vs 50K tokens) and deeper architecture (28 vs 24 layers) have a real cost.
Training on Apple Silicon
I trained v2 on an M4 Pro MacBook with 64GB unified memory. It was not a smooth ride.
MPS Training Survival Guide
Final Training Config
The Verdict
Architecture and data quality matter more than size. A 596M model from 2025, trained on 9,400 Claude-generated pairs, decisively beats a 355M model from 2019 trained on scraped web data. The improvement isn’t subtle — it’s the difference between “generic AI fluff” and “this reads like a photographer’s notes.”
But it’s not a clean win. v2 is slower, still hallucinates (just more subtly), and has its own repetitive patterns. The philosophical endings feel like a Claude fingerprint in the training data — the model learned Claude’s tendency to find meaning in everything.
The biggest surprise was the training data. 9,400 clean, purposeful pairs produced a better model than whatever scraped dataset v1 used. In retrospect, this is obvious: a model can only be as good as what it learns from. Garbage in, generic out. Claude in, Claude-like out.
The Plot Twist: Do We Even Need This?
Here’s the thing nobody talks about: image generators have gotten really good at understanding simple prompts.
When I generated images from the basic inputs, v1 outputs, and v2 outputs using Grok Imagine, something unexpected happened. The basic one-liner — “sunset over the ocean,” “rainy tokyo street” — often produced images that were just as good as the expanded prompts. Sometimes better.
Modern image generators internally enhance your prompt before generating. They add lighting, composition, and style details automatically. The gap between “sunset over the ocean” and a 150-word detailed expansion has narrowed dramatically since 2023.
Where prompt expansion still matters:
- Specific creative vision: When you want something particular that the generator wouldn’t default to
- Technical control: Specific camera, lens, film stock when the default isn’t what you want
- Avoiding hallucinations: v1 turned a wolf into a deer — a more detailed correct prompt prevents that
- Consistency: Detailed prompts produce more predictable results across multiple generations
Where it doesn’t: if you just want “a good sunset photo,” typing those four words might genuinely be enough in 2026.
What’s Next
The real value of prompt expansion may be shifting from “make the image better” to “make the image yours.” Default AI images are polished but generic. Detailed prompts let you express a specific creative intent that the generator can’t guess from four words.
The training data and model are published — anyone can try it, fine-tune further, or use the dataset for their own experiments.