aimlfine-tuning

Same Size, 3 Years of Progress: Retraining My Image Prompt Generator

In 2023 I fine-tuned GPT-2 355M to write image prompts. Now I retrained the concept with Qwen3-0.6B on Claude-generated data. Same size class, dramatically different results.

In April 2023, I fine-tuned GPT-2 Medium (355M parameters) to expand simple image prompts into detailed ones. It worked — kinda. The model learned to pad prompts with “breathtaking,” “captivating,” and “whimsical,” throw in some camera terms, and call it a day. But the outputs were generic, repetitive, and occasionally wrong in embarrassing ways.

Three years later, everything has changed. Open-source model architectures have leapfrogged. Distillation from frontier models is a proven technique. And I have a much better idea of what constitutes a good image prompt.

So I retrained the concept. Same size class, completely different approach. The question: how much does architecture generation + data quality matter when model size stays the same?

0M
Parameters
0K
Training pairs
0h 0m
Training time
0%
More words

Three Years of Progress

GPT-2 was released in 2019. Qwen3 in 2025. Same size class, completely different generation of architecture — better attention mechanisms, more efficient tokenization, trained on vastly more data during pretraining.

2019GPT-2 architecture
OpenAI releases GPT-2
2023v1 trained
355M params, scraped data
2025Qwen3 architecture
Alibaba releases Qwen3
2026v2 trained
596M params, Claude data

What Changed

Two things changed between v1 and v2, and I wanted to test whether they matter more than raw scale:

Architecture
GPT-2 (2019) → Qwen3 (2025)

RoPE embeddings, GQA attention, SwiGLU activations, RMSNorm. Six years of transformer research baked into the architecture itself.

Data Quality
Scraped web → Claude-generated

9,400 prompt pairs generated by Claude across 11 categories and 5 styles. Every pair validated. Zero noise, zero duplicates.

Side by Side

Same input, same task. Toggle between v1 and v2 to see the difference three years makes. Pay attention to specificity, vocabulary, and accuracy:

1sunset over the ocean

A breathtaking image of a tranquil, calm image of the horizon at sunset, revealing a serene, crystal-clear ocean with a tranquil white sand beach below.

Generated images (via Grok Imagine)

sunset over the ocean - basicBasic input
sunset over the ocean - v1v1 prompt
sunset over the ocean - v2v2 prompt
2neon-lit street at night

A captivating image of a bustling, neon-lit street at night, with pedestrians weaving through the mix of pedestrian traffic.

Repeats “pedestrian”

Generated images (via Grok Imagine)

neon-lit street at night - basicBasic input
neon-lit street at night - v1v1 prompt
neon-lit street at night - v2v2 prompt
3cherry blossoms in rain

A whimsical image of a cherry blossom blossoms in a lush, tropical rainforest

Wrong! Puts cherry blossoms in a rainforest

Generated images (via Grok Imagine)

cherry blossoms in rain - basicBasic input
cherry blossoms in rain - v1v1 prompt
cherry blossoms in rain - v2v2 prompt
4wolf howling at the moon

A haunting image of a deer stalking the night sky

Hallucinated a deer instead of wolf!

Generated images (via Grok Imagine)

wolf howling at the moon - basicBasic input
wolf howling at the moon - v1v1 prompt
wolf howling at the moon - v2v2 prompt
5rainy tokyo street

A vibrant and dramatic image of a bustling tokyo street, showcasing the vibrant colors, textures.

Generic, repeats “vibrant”

Generated images (via Grok Imagine)

rainy tokyo street - basicBasic input
rainy tokyo street - v1v1 prompt
rainy tokyo street - v2v2 prompt

By the Numbers

Across 20 test prompts, v2 consistently generates longer, more detailed outputs:

Average words per output

v1 · GPT-2 355M92 words
v2 · Qwen3 596M117 words
+27%more words on average with v2

Speed tradeoff

v2 is nearly 2x slower — the cost of a bigger vocabulary and more layers.

2.6s
v1 · GPT-2
per prompt
4.8s
v2 · Qwen3
per prompt

What v2 Does Better

Specificity. v1 generates “a breathtaking image” and “vibrant colors.” v2 names specific cameras (Fuji X-T5), locations (Shibuya), colors (amber, grey, rose), and times (2am). This is the difference between a prompt that describes and one that directs.

Technical vocabulary. v2 includes f-stops, focal lengths, ISO values, and film stock references. It learned these from the Claude-generated training data, which was designed to mimic how professional photographers think about images.

Compositional awareness. v2 talks about where things are in the frame — “shot from a low angle,” “the child and kite at the top of the frame,” “shot from slightly to the side.” v1 rarely considers composition.

Mood and meaning. v2 often ends with a thematic statement — “the image is about domestic life at its most domestic” or “the persistence of strategy and the cost of knowledge.” v1 never does this.

Where v2 Still Fails

Hallucination. v2 turned “old man playing chess” into a scene with a mahjong table. The chess pieces are there, but so is the wrong game. v1 hallucinated too (deer instead of wolf), but v2’s hallucinations are subtler and harder to catch.

Repetitive phrasing patterns. v2 has its own verbal tics. It loves the em-dash construction (“the sky above — the colors from deep rose”) and often ends with philosophical statements that can feel overwrought.

Speed. Nearly 2x slower at 4.8 seconds per prompt vs 2.6 seconds. The larger vocabulary (151K vs 50K tokens) and deeper architecture (28 vs 24 layers) have a real cost.

Training on Apple Silicon

I trained v2 on an M4 Pro MacBook with 64GB unified memory. It was not a smooth ride.

MPS Training Survival Guide

Use HuggingFace Trainer
Not raw PyTorch loops
float32 precision
Not BF16 (yet)
batch_size=1
With gradient accumulation
Gradient checkpointing
For anything over ~300M params
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
Disable memory limits
Monitor disk space
Before long training runs

Final Training Config

Qwen3-0.6B
596M params
Full fine-tune
All params unfrozen
4 epochs
1,120 steps
float32
~9.2 GB peak
2h 50m
M4 Pro, 64GB

The Verdict

Architecture and data quality matter more than size. A 596M model from 2025, trained on 9,400 Claude-generated pairs, decisively beats a 355M model from 2019 trained on scraped web data. The improvement isn’t subtle — it’s the difference between “generic AI fluff” and “this reads like a photographer’s notes.”

But it’s not a clean win. v2 is slower, still hallucinates (just more subtly), and has its own repetitive patterns. The philosophical endings feel like a Claude fingerprint in the training data — the model learned Claude’s tendency to find meaning in everything.

The biggest surprise was the training data. 9,400 clean, purposeful pairs produced a better model than whatever scraped dataset v1 used. In retrospect, this is obvious: a model can only be as good as what it learns from. Garbage in, generic out. Claude in, Claude-like out.

The Plot Twist: Do We Even Need This?

Here’s the thing nobody talks about: image generators have gotten really good at understanding simple prompts.

When I generated images from the basic inputs, v1 outputs, and v2 outputs using Grok Imagine, something unexpected happened. The basic one-liner — “sunset over the ocean,” “rainy tokyo street” — often produced images that were just as good as the expanded prompts. Sometimes better.

Modern image generators internally enhance your prompt before generating. They add lighting, composition, and style details automatically. The gap between “sunset over the ocean” and a 150-word detailed expansion has narrowed dramatically since 2023.

Where prompt expansion still matters:

  • Specific creative vision: When you want something particular that the generator wouldn’t default to
  • Technical control: Specific camera, lens, film stock when the default isn’t what you want
  • Avoiding hallucinations: v1 turned a wolf into a deer — a more detailed correct prompt prevents that
  • Consistency: Detailed prompts produce more predictable results across multiple generations

Where it doesn’t: if you just want “a good sunset photo,” typing those four words might genuinely be enough in 2026.

What’s Next

The real value of prompt expansion may be shifting from “make the image better” to “make the image yours.” Default AI images are polished but generic. Detailed prompts let you express a specific creative intent that the generator can’t guess from four words.

The training data and model are published — anyone can try it, fine-tune further, or use the dataset for their own experiments.