April 12, 2023aimlstable-diffusion

I Trained a GPT-2 Model to Write Better Image Prompts Than I Can

Fine-tuned a 355M parameter model on thousands of high-quality Midjourney and Stable Diffusion prompts. Give it a few words, it gives you a masterpiece description.

I’ve been generating a lot of AI images lately. Midjourney, Stable Diffusion, DALL-E—they’re all incredible. But there’s a skill gap between “a cat sitting on a mountain” and the kind of richly detailed prompt that actually produces stunning output.

So I did what any reasonable person would do: I trained a language model to bridge that gap.

The Problem

Good image prompts are surprisingly hard to write. The best ones read like poetry crossed with a camera manual. Most people type “lighthouse on a cliff” and wonder why the results are mediocre.

I wanted a model that could take the simple version and produce the detailed version automatically.

See It In Action

Click Basic, 355M, or 7B on each image to see how the model transforms a simple prompt into a detailed one—and how the generated image changes:

Input

"basketball"

Input

"mayan temple"

Input

"exotic bird"

Input

"woman snowboarding"

Input

"marble statue"

Input

"halloween"

Input

"roast chicken"

How It Works

I fine-tuned GPT-2 Medium (355M parameters) on a curated dataset of high-quality image prompts. The training data includes thousands of prompts that produced excellent results across Midjourney, Stable Diffusion, and DALL-E, tagged with structured metadata:

BRF: The brief—the simple input prompt
POS: The positive expansion—the detailed, production-ready prompt
ENH: Enhancers—camera settings, lens type, lighting conditions
INS: Inspiration—photographers or artists whose style matches
NEG: Negative prompt—what to avoid

The model learns the relationship between simple descriptions and the rich, detailed expansions that image generators respond to best. The key insight is that image generators are extremely sensitive to specific vocabulary—words like “cinematic,” “golden hour,” “bokeh” aren’t just descriptive, they’re technical triggers that meaningfully change the output.

355M vs 7B

I also trained a 7B parameter version. The quality difference is striking—the 7B model produces more creative, more coherent prompts that better match unusual input concepts. You can see the difference in the examples above by toggling between 355M and 7B.

The 355M is “good enough for most people.” The 7B is “this is genuinely impressive.”

Try It Yourself

Type a simple prompt and let the model do its thing. Tip: Keep it short—“a dragon in a library” works better than a paragraph.

Loading demo...

Curated prompt dataset