Composing with a Neural Net

The first two blog posts in this series on machine learning and music covered some of the basic conceptual considerations for analyzing music through the use of neural networks. In this post, I want to explore the creative, generative possibilities of neural networks

Whereas I used Python in the previous two examples, in this example I’ll be using Max to get the audio side of things up and running more quickly. I’ll use a machine learning library for Max called ml.star, developed by Benjamin Day Smith. You can download it through the Package Manager (File -> Show Package Manager).

Just as with the analysis described in the previous post, it took me a little while to determine an application for machine learning in the compositional process that would be both manageable and interesting. In addition to these criteria, I wanted something that would function in real time (or close to it) and that would specifically use machine learning to simplify the choices and/or interface presented to the user, rather than overwhelming them with arcane parameters. I also imagined that a binary-encoded input layer would make things simple and easier to follow.

After some trial and error, I decided on developing a program to compose a drum sequence. However, instead of having the user input each sound individually (as with most sequencers), the user is presented with two phrases generated at random. They listen to the two phrases, and then click a button to indicate which they like better. If they prefer phrase #1, a new phrase #2 is generated, and vice versa. Over time, the idea is that the neural network will be able to infer the user’s preferences and generate phrases that more closely match the phrases the user has previously preferred.

I decided to use a very simple type of artificial neural network called a multilayer perceptron. A perceptron is a function that attempts to represent the relationship between its input and output by weighting corresponding nodes. While all perceptrons contain an input and output layer (the data to be evaluated and the evaluation, respectively), multilayer perceptrons are unique in that they contain one or more hidden layers that are weighted during the training phase. The ml.star package contains a multiplayer perception object called [ml.mlp].

After some experimentation, I finally settled on a structure for the program. Each drum phrase would be an eight-step sequence consisting of three instrumental parts (clap, kick, and hi-hat samples from an 808). The program begins with a completely random pattern for each phrase and loops until the user makes a choice. When the user clicks the button indicating their preference, the preferred phrase remains as is and the other phrase is randomly modified. Both continue to loop. Each time the user makes a choice, that choice is passed along to the neural net. Perceptrons use binary classifiers (0 or 1) to determine whether something is a good match. In our case, a “match” is equivalent to the user’s preference. When passed to the neural net, the preferred phrase is accompanied by a high value (1), and the other phrase is given a low output value (0). This is, in a nutshell, how the computer learns.

A brief technical aside: multilayer perceptrons are defined by the size and number of layers they comprise. In this case, the input layer has 24 data points or features (3 parts * 8 steps), each of which can have a value of 0 (no note) or 1 (note). The output layer consists of a single data point. In the training set, this is clearly defined: 1 if it is preferred and 0 if it is not. In the testing (prediction) set, the output layer will give intermediate values which reflect a better match as they approach 1 and a worse match as they approach 0.

The number and size of the hidden layers are what gives each neural net its unique performance and computational characteristics. There is no firm rule for the size of the hidden layers, but frequently they are somewhere in between the size of the input and output layers. Reviewing a wide range of applications suggested to me that a single hidden layer would be sufficient for a relatively simple problem such as this, so I chose a single hidden layer with a size of 12—about halfway between the input and output layer sizes. Experimenting with different parameters—easy to do by modifying the arguments of the [ml.mlp] object—did not yield dramatically different results.

After training the neural net, the last step is the generative process. After all, it’s one thing to analyze music using a neural net, but something else entirely to use it to create. My solution was to generate a corpus of random phrases for each round (i.e. each click), run them all through the neural net, and then choose the phrase with the highest output value (0-1). Theoretically, the phrase with the highest output value would be the best match for the target preferred phrase. Each time the user indicated their preference for one phrase or the other, the neural net would be updated and retrained, and then the newly-generated random phrases would run through it. What this means—again, at least in theory—is that the “random” phrases loaded each time should get closer and closer to the preferred phrase.

I emphasize “in theory” because in machine learning, as in life, things don’t always work out exactly as we might expect. For example, in my first few trials, I found that the output value of the “winning” random phrase was getting progressively lower over time. In other words, the quality of these phrases was decreasing even as the machine was supposedly learning more and more. I found this completely baffling for a while, until I had an epiphany: as I had set it up, the neural network was doomed to sabotage itself.

To understand what I mean, let’s rewind a bit. Each time I clicked the button to indicate my preference, two new pieces of data were added to the training set: the preferred phrase (with a training value of 1), and the rejected phrase (with a training value of 0). Conceptually, it makes sense that this would produce better and better results over time. However, what ends up happening in practice is that as the random phrases match the preferred phrase more and more closely (which actually happens quite quickly), the positive and negative reinforcement tend to cancel each other out. Think about it this way: if you have a phrase that you reject because it is only a few notes off the model, there’s nothing to stop the neural net from assuming that the entire pattern is wrong. Consequently, all of the correct aspects of the phrase are also registered as wrong, and most of the positive weighting for the correct features in the preferred phrase are cancelled out. As a result, the neural net ends up focusing on the wrong features.

Ultimately, my (admittedly inelegant) solution was to only use the first handful of rejections as part of the training set. That way, as the rejected phrases got closer to the target phrase, they wouldn’t be marked as “wrong”—only the target phrase would be marked as “right.” This immediately produced better results, both observationally and in terms of the average output value for the randomly generated phrases over time. I settled on including only the first five rejections, though other variations are almost certainly possible.

Machine learning is a huge area to explore, and consequently this post leaves much left unsaid. There are many further parameters that could be tweaked, such as how precisely to train the neural net: should it be based on a specific number of epochs, or a target minimum of error? And if so, what is the right number of epochs, or the best error to aim for? Likewise, changing the number of random phrases generated in each round has a significant impact on the results—the more I added, the better the results tended to be (I experimented with values from 10 to 10,000). Yet at the same time, generating (and therefore, analyzing) more phrases slows down the system, potentially reducing its real-time applicability.

In future posts I hope to explore these details—as well as other applications, such as pitch- and timbre-based matching systems. I am also interested in addressing the crossover between this approach and the reinforcement learning methods I’ve explored in previous posts. Until then, take care and check out the finished (for now) program below. (Click here to download Max.)

Download the software bundle (save in your Max file directory)