I Spent a Day with the Best Open-Source Music AI

ACE-Step v1.5 is an engineering marvel. But after a full day of testing, the problem isn't the model — it's the data it was trained on.

When ACE-Step v1.5 dropped last week, I cleared my schedule. This is arguably the most capable open-source music generation model available. I installed it, and spent an entire day pushing it as hard as I could.

What I found wasn't a technology problem. The architecture is genuinely impressive. ACE-Step runs locally on my Mac, generates audio in seconds, and supports LoRA fine-tuning so you can train it on your own material. The engineering is extraordinary. The fact that this is open-source, free for anyone to run, study, and build on, is something worth celebrating.

But something is deeply wrong with the output. And the more I listened, the clearer it became that the issue isn't the model. It's what the model has been fed.

The data problem

Meta's MusicGen was explicitly trained on Shutterstock and Pond5 music data. ACE-Step doesn't name its sources, but lists "royalty-free / no-copyright music" and "synthetic data via MIDI-to-audio conversion" among its training material. Stable Audio trained on stock library content. The pattern across open-source music AI is consistent: the models are learning from stock music and synthetic renders.

Here's the thing about stock libraries like Pond5 that the AI community doesn't seem to understand: there are no gatekeepers. Anyone can upload anything. No review process, no quality threshold, no editorial curation. You make a track, you upload it, it's in the library.

The result is a platform with millions of tracks, the vast majority of which have never been purchased. Not once. They sit at zero sales because they're not good enough to sell. A music supervisor looking for a trailer cue listens to the first three seconds and moves on. A video editor previewing background tracks skips past them instantly. The market has already passed judgment on this material. Nobody wanted it.

And when your other major data source is "synthetic MIDI-to-audio conversion," you're training on music that was never even performed. Just programmed and rendered through a synthesis pipeline.

This is the training data. Not the top-selling tracks. Not the material that was good enough to be licensed. The entire long tail of tracks that exist only because no one ever told their creators they couldn't upload them, plus MIDI files run through a converter.

What the model actually learned

I prompted ACE-Step for an orchestral trailer track. What came back sounded like a Yamaha PSR keyboard running an auto-accompaniment preset. Not because the model can't generate complex audio, but because it has learned orchestral music from people who don't know how to write it.

The brass was wrong. Not wrong in a subtle, taste-dependent way. The epic, supposed-to-be-epic brass motifs sounded like Mexican trumpets. Every note at the same velocity and articulation, without any idea how a brass section actually voices a chord. It sounded like a parody of a track. Almost humorous.

The drums sounded like cheap MIDI drums from a 90s ROMpler, mixed too loud and too bright, sitting on top of the track in their own reverb space, completely disconnected from the rest of the arrangement.

These aren't artifacts of the generation process. These are the characteristics of the training data, faithfully reproduced. The model has learned that this is what orchestral music sounds like, because thousands of unsold stock tracks and synthetic MIDI renders told it so.

The errors become convention

At scale, these mistakes stop looking like mistakes. If hundreds of tracks use horrible sounds and bad instrumentation, the model learns it as convention. If thousands have badly mixed MIDI drums, the model learns that as how drums are supposed to sound. The errors become the norm. The model has no frame of reference for anything better.

I tried fine-tuning. Trained a LoRA adapter on my own productions, professionally recorded, carefully mixed. The training converged and the output was closer. But the residue was still there underneath, like a stain you can't get out. A small dataset of professional work can't overwrite millions of bad tracks and synthetic renders baked into the base weights.

An open-source opportunity

None of this is a criticism of the teams building these models. The engineering is remarkable. What ACE-Step has achieved on consumer hardware would have seemed impossible two years ago. The architecture, the speed, the accessibility — all of it is genuinely groundbreaking work.

The problem is upstream. Every team in AI audio is racing to train on more data. Nobody is racing to train on better data. And the data that's freely available and easily licensed at scale — stock libraries with no quality filter, synthetic MIDI renders — is the data that produces output like what I heard.

The open-source community has solved the model problem. The data problem is still wide open. At this point, you'd get more value training the model on these tracks as examples of what not to do.