music video generated by a neural net mario

Music video generated by a neural net

Mario Klingemann fed a song into an image-generation neural network, and produced this music video.

I’ll just quote Mario’s explanation for how it works:

This consists of several components: the images are generated by a neural network that is given a 4096 dimensional feature vector. What happens is that you give the NN 4096 numbers and it produces a 256x256 image out of those. In theory this network could produce any possible image, but in reality most vectors that you feed in produce garbage or one particular image (I call it the “fox with eye shadow” - sometimes you see it pop up because I didn’t really clean up my data). So the first task is to find feature vectors in this huge space of possibilities that produce images that look interesting or even better, like something that humans would recognize as a certain object. In order to do that as second neural network is used to classify whatever the generator has produced. Using gradient descent the algorithm tries to tweak the feature vector so that the resulting image looks more and more like a certain category. For this clip I was more interested in abstract looking images so what I did is to stop the gradient descent before it got too concrete and save the feature vectors.
After I had about 1000 different vectors I move on to step 2 which is making the music video. The idea here is that I want similar sounding samples to produce similar images. So what I did is to sample a song, transform it into frequency bands using FFT and then cluster the short snippets into 100 clusters using k-means. When I now playback a song it will use the learned k-means to give me a number between 0-100 for a certain frequency pattern. Surprisingly this works even for songs that are totally different that the one I trained the k-means on. That new number I get every frame becomes the index of my pre-calculated feature vectors which you can also see as a coordinate in 4096-dimensional space. That coordinate becomes the current target for my “playback particle” which tries to get from its current position in 4096-dim space to the new target. It uses a kind of gravity/spring physics to get there - or you could also see it like a mouse-follower script, so there is a bit of inertia in order to get those morphing transitions. Because that is the fascinating part of the latent space: you can interpolate between two feature vectors and will get a weird-smooth transition between the two images.

Note that last point: the multi-dimentional space lets you smoothly interpolate between images, giving you flexibility. The whole thing is also designed around giving order to chaotic images, which shouldn’t be a surprise with this artist.

Be warned that the videos have some rapid flashing (I slowed the GIF above down a bit).

An experimental music video clip generated by a neural network
Another music video generated by a neural network
A generated music video based on extracted poses from still images