I've got on my to do list something similar, feed the raw wave data from some po...

jerf · on June 11, 2015

I'd suggest feeding the raw FFT transform into the RNN, then translating back to normal sound afterwards. Raw sound has an awful lot of just bouncing up and down... a RNN isn't going to be any better at you than figuring out what's going on from there, whereas the FFT view of a song is much more meaningful... I can glance at one and at least have an idea of what the song is doing.

I'm still not all that optimistic, frankly, but it'll be better than raw data. Some further experimentation with different encoding might be necessary.

That said, I'd put $10 bucks that just putting an FFT representation through wouldn't necessarily produce a new pop song, but it would sound uniquely spooky, I bet. If nothing else you might have something you can hook up to some speakers next Halloween and make some kids cry.

kastnerkyle · on June 11, 2015

This is pretty hard - we use raw data for speech [1, talked about in comment above] but it still needs some work to do really good synthesis. FFT is not really the way to go either - then you still need to deal with the problems of complex data which is very, very unpleasant. Most people use FFT -> IDCT (cepstrum) or a filtered version (mel-frequency cepstral coefficients, MFCC). This can work but it is a lot of domain knowledge.

One thing we tried in early testing, but did not pursue farther was vector quantized X (where X is MFCCs, LPC, LSF, FFT, cepstrum). Basically you use K-means to find clusters for some large number of K, then simply assign every real value (or real-valued vector) to the closest cluster. The cluster mapping becomes a codebook, and your problem goes from input vectors like [0.2, 0.7, 0.111, ...] to [0, 1, 0, ...] where the length of the vector of 0s and 1s is the number clusters K.

This is a much easier learning problem, and closely corresponds to most "bag-of-words" or word-level models. The quantization is lossy but for large enough K I do not think it would be noticeable. After all, we listen to discrete audio every day, all the time in wav format :)

To synthesize, you can either map codebook points back to the corresponding cluster center, or as most people do, map it to the cluster center with some small variance so you have a little bit of interesting variation.

[1] http://arxiv.org/abs/1506.02216

jerf · on June 11, 2015

Thank you for expanding on my uninformed, off-the-cuff comment like that.

chromaton · on June 11, 2015

I was going to try the same thing but ran into trouble installing the RNN software dependencies.

Please try it and let us know how it turns out, good or bad.