aimusicproduct

I Built a Browser-Based AI Music Generator

A working AI music platform — type a prompt, get a full song with vocals, lyrics, and cover art. One person, open-source models.

I typed “melancholic French jazz with a female vocalist and upright bass” into a text box and got a full song back. Vocals, piano, brushed drums, the works. It took about 90 seconds.

The site is theory808.com. It’s live, it works, and you can try it right now. What I want to share is what’s possible today with publicly available AI models, and why I think browser-based music generation is about to get very interesting.

What It Does

You type a prompt. Something simple like “lo-fi jazz for a rainy evening” or “dark synthwave with 808s and a driving bassline.” The system does three things in parallel:

  1. An LLM preprocesses your prompt — it splits your casual description into an optimized music caption (instruments, genre, production style), structured lyrics with section tags, a song title, and suggested BPM/key. One input, five outputs.
  2. A music generation model produces the audio — a commercially-licensed, open-source model running on serverless GPUs. Each generation produces two variants so you can pick the better one.
  3. An image model generates cover art — because every track deserves album art. It runs concurrently with the music, so it doesn’t add wait time.

The result: a complete “release” — two audio variants with vocals and lyrics, plus album cover art — from a single sentence.

What Surprised Me

The quality of open-source music models. There are commercially-licensed, MIT-licensed music generation models available today that produce genuinely good output. Vocals in 50+ languages. Structure tags that give you control over verse/chorus/bridge arrangement. BPM, key, and time signature control. Up to 10 minutes per track at 48kHz.

A year ago, this tier of quality required API access to closed platforms. Now you can download the weights and run it yourself.

The preprocessing matters more than the model. The raw model is good but inconsistent. The real differentiator is what happens before the audio generation starts. An LLM takes your casual “90s grunge anthem about highway driving” and transforms it into a precise music description with specific instruments, vocal character, mood, and production style — plus writes structured lyrics with proper verse/chorus/bridge tags. This preprocessing step is what makes the difference between “interesting demo” and “this actually sounds like a song.”

You can build a production platform solo. Theory 808 has user accounts (Google OAuth + email), a credit-based billing system with Stripe, a persistent music player, an explore feed, a library, settings, legal pages, and admin tools. One person. The AI coding tools available today make this kind of scope realistic for a solo builder.

Listen For Yourself

Every track below was generated from a single text prompt. No editing, no post-processing, no cherry-picking. What you hear is what the model produced.

Theory 808 — Live Examples

Every track generated from a single text prompt. No editing.

Try it yourself on theory808.com

The Architecture (High Level)

The platform is a Next.js app deployed on Vercel. The music generation model runs on serverless GPUs — cold starts take about 40 seconds, but once warm, a 60-second track generates in about 10 seconds. Audio files are stored on Cloudflare R2 and served via CDN.

The smart part is the prompt pipeline. When you type “midnight soul train” the system doesn’t just pass that string to the music model. An LLM (Gemini Flash) first expands it into:

  • A title: “Midnight Soul Train”
  • A music caption: specific instruments, genre tags, vocal character, production style (150–500 characters, optimized for the music model’s training distribution)
  • Structured lyrics with [Verse], [Chorus], [Bridge] tags
  • Suggested BPM and key
  • A cover art prompt for the image generator

This preprocessing runs in about 2 seconds and dramatically improves output quality. The music model gets a precisely formatted input instead of a casual prompt, and the user gets a complete package instead of a bare audio file.

Roadblocks & Key Learnings

This wasn’t a smooth ride. A few things I hit along the way:

Serverless GPU cold starts are brutal. The music model needs ~11GB of VRAM. The first request after a cold start takes nearly 40 seconds just to load the model, on top of the actual generation time. Once warm, a 60-second song generates in about 10 seconds. The UX fix was an instant redirect after submitting — a background poller updates the UI when the track is ready, so you’re never staring at a spinner.

GPU availability is the real infrastructure bottleneck. No single cloud GPU provider has reliable 24/7 availability for the tier of hardware you need (24GB+ VRAM). During peak hours, your preferred provider might simply have no machines. I had to build the inference layer to be provider-agnostic — a thin abstraction over RunPod, Modal, and others so the system can cascade between providers transparently. The actual API surface is almost identical (POST a job, poll for status, get output), but the operational reality of GPU scarcity was the thing I least expected. If you’re building anything that depends on GPUs at inference time, plan for multi-provider from day one.

Getting the model to actually run is the hard part. You have the weights, you have the code, you have a GPU — and it still doesn’t work. Every provider has a different base image, a different CUDA version, different PyTorch builds. A model that runs perfectly on your local 4090 throws cryptic errors on a cloud instance because of a version mismatch in some transitive dependency you’ve never heard of. There’s no shortcut here. You build the Docker image, it fails, you read the traceback, you pin a version, you rebuild. Three, four, five iterations until it finally runs clean. And then you switch providers and half of it breaks again because their builder handles things differently. The lesson: budget time for this. It’s not a config step, it’s an engineering problem.

What’s Next

The music generation space is moving fast. Models that generate vocals, understand song structure, and produce studio-quality audio at 48kHz are now available under MIT licenses, trained on licensed and public domain data. The tools to build on top of them — serverless GPU hosting, edge CDNs, AI coding assistants — have gotten good enough that one person can ship a production platform.

I don’t think the question is whether AI music generation will be mainstream. It’s a question of how fast the quality ceiling rises. Right now it’s good enough to be useful and fun. In a year, it might be indistinguishable.

Try Theory 808 and decide for yourself.