aimlstable-diffusion

I Trained a GPT-2 Model to Write Better Image Prompts Than I Can

Fine-tuned a 355M parameter model on thousands of high-quality Midjourney and Stable Diffusion prompts. Give it a few words, it gives you a masterpiece description.

I’ve been generating a lot of AI images lately. Midjourney, Stable Diffusion, DALL-E—they’re all incredible. But there’s a skill gap between “a cat sitting on a mountain” and the kind of richly detailed prompt that actually produces stunning output.

So I did what any reasonable person would do: I trained a language model to bridge that gap.

The Problem

Good image prompts are surprisingly hard to write. The best ones read like poetry crossed with a camera manual. Most people type “lighthouse on a cliff” and wonder why the results are mediocre.

I wanted a model that could take the simple version and produce the detailed version automatically.

See It In Action

Click Basic, 355M, or 7B on each image to see how the model transforms a simple prompt into a detailed one—and how the generated image changes:

basketball - base
Input

"basketball"

mayan temple - base
Input

"mayan temple"

exotic bird - base
Input

"exotic bird"

woman snowboarding - base
Input

"woman snowboarding"

marble statue - base
Input

"marble statue"

halloween - base
Input

"halloween"

roast chicken - base
Input

"roast chicken"

How It Works

I fine-tuned GPT-2 Medium (355M parameters) on a curated dataset of high-quality image prompts. The training data includes thousands of prompts that produced excellent results across Midjourney, Stable Diffusion, and DALL-E, tagged with structured metadata:

  • BRF: The brief—the simple input prompt
  • POS: The positive expansion—the detailed, production-ready prompt
  • ENH: Enhancers—camera settings, lens type, lighting conditions
  • INS: Inspiration—photographers or artists whose style matches
  • NEG: Negative prompt—what to avoid

The model learns the relationship between simple descriptions and the rich, detailed expansions that image generators respond to best. The key insight is that image generators are extremely sensitive to specific vocabulary—words like “cinematic,” “golden hour,” “bokeh” aren’t just descriptive, they’re technical triggers that meaningfully change the output.

355M vs 7B

I also trained a 7B parameter version. The quality difference is striking—the 7B model produces more creative, more coherent prompts that better match unusual input concepts. You can see the difference in the examples above by toggling between 355M and 7B.

The 355M is “good enough for most people.” The 7B is “this is genuinely impressive.”

Try It Yourself

Type a simple prompt and let the model do its thing. Tip: Keep it short—“a dragon in a library” works better than a paragraph.

Loading demo...