March 27, 2026aimlfine-tuning

Same Size, 3 Years of Progress: Retraining My Image Prompt Generator

In 2023 I fine-tuned GPT-2 355M to write image prompts. Now I retrained the concept with Qwen3-0.6B on Claude-generated data. Same size class, dramatically different results.

In April 2023, I fine-tuned GPT-2 Medium (355M parameters) to expand simple image prompts into detailed ones. It worked — kinda. The model learned to pad prompts with “breathtaking,” “captivating,” and “whimsical,” throw in some camera terms, and call it a day. But the outputs were generic, repetitive, and occasionally wrong in embarrassing ways.

Three years later, everything has changed. Open-source model architectures have leapfrogged. Distillation from frontier models is a proven technique. And I have a much better idea of what constitutes a good image prompt.

So I retrained the concept. Same size class, completely different approach. The question: how much does architecture generation + data quality matter when model size stays the same?

Parameters

Training pairs

0h 0m

Training time

More words

Three Years of Progress

GPT-2 was released in 2019. Qwen3 in 2025. Same size class, completely different generation of architecture — better attention mechanisms, more efficient tokenization, trained on vastly more data during pretraining.

2019

GPT-2 architecture

OpenAI releases GPT-2

2023

v1 trained

355M params, scraped data

2025

Qwen3 architecture

Alibaba releases Qwen3

2026

v2 trained

596M params, Claude data

v1 · GPT-2 v2 · Qwen3

2019GPT-2 architecture

OpenAI releases GPT-2

2023v1 trained

355M params, scraped data

2025Qwen3 architecture

Alibaba releases Qwen3

2026v2 trained

596M params, Claude data

What Changed

Two things changed between v1 and v2, and I wanted to test whether they matter more than raw scale:

Architecture

GPT-2 (2019) → Qwen3 (2025)

RoPE embeddings, GQA attention, SwiGLU activations, RMSNorm. Six years of transformer research baked into the architecture itself.

Data Quality

Scraped web → Claude-generated

9,400 prompt pairs generated by Claude across 11 categories and 5 styles. Every pair validated. Zero noise, zero duplicates.

Side by Side

Same input, same task. Toggle between v1 and v2 to see the difference three years makes. Pay attention to specificity, vocabulary, and accuracy:

1“sunset over the ocean”

A breathtaking image of a tranquil, calm image of the horizon at sunset, revealing a serene, crystal-clear ocean with a tranquil white sand beach below.

Generated images (via Grok Imagine)

Basic input

v1 prompt

v2 prompt

2“neon-lit street at night”

A captivating image of a bustling, neon-lit street at night, with pedestrians weaving through the mix of pedestrian traffic.

⚠Repeats “pedestrian”

Generated images (via Grok Imagine)

Basic input

v1 prompt

v2 prompt

3“cherry blossoms in rain”

A whimsical image of a cherry blossom blossoms in a lush, tropical rainforest

⚠Wrong! Puts cherry blossoms in a rainforest

Generated images (via Grok Imagine)

Basic input

v1 prompt

v2 prompt

4“wolf howling at the moon”

A haunting image of a deer stalking the night sky

⚠Hallucinated a deer instead of wolf!

Generated images (via Grok Imagine)

Basic input

v1 prompt

v2 prompt

5“rainy tokyo street”

A vibrant and dramatic image of a bustling tokyo street, showcasing the vibrant colors, textures.

⚠Generic, repeats “vibrant”

Generated images (via Grok Imagine)

Basic input

v1 prompt

v2 prompt

By the Numbers

Across 20 test prompts, v2 consistently generates longer, more detailed outputs:

Average words per output

v1 · GPT-2 355M92 words

v2 · Qwen3 596M117 words

+27%more words on average with v2

Speed tradeoff

v2 is nearly 2x slower — the cost of a bigger vocabulary and more layers.

2.6s

v1 · GPT-2

per prompt

4.8s

v2 · Qwen3

per prompt

What v2 Does Better

Specificity. v1 generates “a breathtaking image” and “vibrant colors.” v2 names specific cameras (Fuji X-T5), locations (Shibuya), colors (amber, grey, rose), and times (2am). This is the difference between a prompt that describes and one that directs.

Technical vocabulary. v2 includes f-stops, focal lengths, ISO values, and film stock references. It learned these from the Claude-generated training data, which was designed to mimic how professional photographers think about images.

Compositional awareness. v2 talks about where things are in the frame — “shot from a low angle,” “the child and kite at the top of the frame,” “shot from slightly to the side.” v1 rarely considers composition.

Mood and meaning. v2 often ends with a thematic statement — “the image is about domestic life at its most domestic” or “the persistence of strategy and the cost of knowledge.” v1 never does this.

Where v2 Still Fails

Hallucination. v2 turned “old man playing chess” into a scene with a mahjong table. The chess pieces are there, but so is the wrong game. v1 hallucinated too (deer instead of wolf), but v2’s hallucinations are subtler and harder to catch.

Repetitive phrasing patterns. v2 has its own verbal tics. It loves the em-dash construction (“the sky above — the colors from deep rose”) and often ends with philosophical statements that can feel overwrought.

Speed. Nearly 2x slower at 4.8 seconds per prompt vs 2.6 seconds. The larger vocabulary (151K vs 50K tokens) and deeper architecture (28 vs 24 layers) have a real cost.

Training on Apple Silicon

I trained v2 on an M4 Pro MacBook with 64GB unified memory. It was not a smooth ride.

MPS Training Survival Guide

✓

Use HuggingFace Trainer

Not raw PyTorch loops

✓

float32 precision

Not BF16 (yet)

✓

batch_size=1

With gradient accumulation

✓

Gradient checkpointing

For anything over ~300M params

✓

PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0

Disable memory limits

✓

Monitor disk space

Before long training runs

Final Training Config

Qwen3-0.6B

596M params

→

Full fine-tune

All params unfrozen

→

4 epochs

1,120 steps

→

float32

~9.2 GB peak

→

2h 50m

M4 Pro, 64GB

The Verdict

Architecture and data quality matter more than size. A 596M model from 2025, trained on 9,400 Claude-generated pairs, decisively beats a 355M model from 2019 trained on scraped web data. The improvement isn’t subtle — it’s the difference between “generic AI fluff” and “this reads like a photographer’s notes.”

But it’s not a clean win. v2 is slower, still hallucinates (just more subtly), and has its own repetitive patterns. The philosophical endings feel like a Claude fingerprint in the training data — the model learned Claude’s tendency to find meaning in everything.

The biggest surprise was the training data. 9,400 clean, purposeful pairs produced a better model than whatever scraped dataset v1 used. In retrospect, this is obvious: a model can only be as good as what it learns from. Garbage in, generic out. Claude in, Claude-like out.

The Plot Twist: Do We Even Need This?

Here’s the thing nobody talks about: image generators have gotten really good at understanding simple prompts.

When I generated images from the basic inputs, v1 outputs, and v2 outputs using Grok Imagine, something unexpected happened. The basic one-liner — “sunset over the ocean,” “rainy tokyo street” — often produced images that were just as good as the expanded prompts. Sometimes better.

Modern image generators internally enhance your prompt before generating. They add lighting, composition, and style details automatically. The gap between “sunset over the ocean” and a 150-word detailed expansion has narrowed dramatically since 2023.

Where prompt expansion still matters:

Specific creative vision: When you want something particular that the generator wouldn’t default to
Technical control: Specific camera, lens, film stock when the default isn’t what you want
Avoiding hallucinations: v1 turned a wolf into a deer — a more detailed correct prompt prevents that
Consistency: Detailed prompts produce more predictable results across multiple generations

Where it doesn’t: if you just want “a good sunset photo,” typing those four words might genuinely be enough in 2026.

What’s Next

The real value of prompt expansion may be shifting from “make the image better” to “make the image yours.” Default AI images are polished but generic. Detailed prompts let you express a specific creative intent that the generator can’t guess from four words.

The training data and model are published — anyone can try it, fine-tune further, or use the dataset for their own experiments.

View the code

Training scripts + evaluation on GitHub

Get the v2 model

Qwen3-0.6B fine-tuned weights on HuggingFace

Training dataset

9,400 Claude-generated prompt pairs

Original v1 model

GPT-2 355M from 2023 for comparison