Mapping Gestures to Sound with a Neural Net

In this post, I want to discuss a slightly different usage for neural networks than has been explored so far. Rather than using neural nets to define relationships between strictly musical data (as in the previous post), here I will use them to define relationships between gestures and sounds.

Gesture recognition—the interpretation of human gestures by computers—is an important topic for the application of machine learning methods. Of course, what constitutes a “gesture” can vary widely depending on the context. In this post, I’ll be imagining the movement of a hand, fingertip, or other single point in a vertically oriented, two-dimensional plane. (Imagine tracing a shape on a steamed-up bathroom mirror.)

Just as in the previous post, I’ll use the library for Max developed by Benjamin Day Smith. You can download it through the Package Manager (File -> Show Package Manager). As before, we’ll also use a simple type of neural network called a multilayer perceptron (mlp), represented by the object [ml.mlp].

The basic idea for this patch is that each gesture will be represented as a list of numbers. When training the neural net, the user will specify a gesture label, which will then be associated with the particular sequence. I will use eight distinct labels, though this is easily modifiable. Over time, the idea is that the neural net can generalize what is distinctive about each gesture. Once the training is complete, the neural net will be able to categorize new gestures by assigning labels to the new gestures.

There are many different ways of defining gestures. For example, gestures can be dynamic (moving) or static (not moving). I’m especially interested in dynamic gestures, since their beginnings and endings can be recognized to automatically tell the computer when a gesture is starting or ending. To keep things simple, I will divide up the two-dimensional plane into eight columns and characterize each gesture according to the average values in these columns. (The fact that I have eight labels and eight columns is entirely coincidental—these do not have to be the same number.) The user will sweep their hand back and forth across the plane horizontally, with the variation in vertical position defining each gesture.

The image above illustrates a typical gesture, with the continuous human-generated gesture given by the black line, and the mean-based simplification given by the eight blue bars. (There is a slight horizontal discrepancy between the display position of the bars and the data segments they represent.)

Gesture recognition typically involves a lot of pre-processing because you have to transform human gestures—which are typically complex and time-based—into an input format that a neural network can recognize and work with. In this case, the eight vertical columns are the eight data points that make up the input layer of the neural network. The output layer will have eight points as well, but these will represent the eight possible gesture categories. These categories, it should be emphasized, will be completely user-defined: whatever the user labels “Gesture 1” will become gesture 1, etc.

There were a handful of other minor processing details. For example, the [lcd] object used for drawing the line counts y values from the top down. In order to not confuse myself, I flipped this by running the xy coordinate list output through a [vexpr] object with a simple expression. I also had to calculate the mean of each segment separately, which involved sorting the values from the “dump” output of the coll object storing the line’s coordinates into eight bins of equal size. I ended up solving this in reasonably elegant fashion by scaling the x values to the range of one to eight and using them as the control for an 8-outlet gate, with each output leading to a separate [mean] object.

When in prediction mode, the patch gives not only the most likely label for a gesture, but also the likelihood for each label (as a multicolored [multislider] object). The final version of this patch, while functional and easy to use, also has plenty of room for improvement. For example, the inner workings of the neural net remain hidden to the user so as not to clutter the interface, but this also prevents the user from adjusting the inner structure to produce better predictions. The patch would also benefit from some gesture “filtering,” by which gestures are not recognized unless they pass across all eight columns. This will become especially important when we link it up with a computer vision system such as the LEAP motion in a later post.

Download the code bundle here.

Composing with a Neural Net

The first two blog posts in this series on machine learning and music covered some of the basic conceptual considerations for analyzing music through the use of neural networks. In this post, I want to explore the creative, generative possibilities of neural networks

Whereas I used Python in the previous two examples, in this example I’ll be using Max to get the audio side of things up and running more quickly. I’ll use a machine learning library for Max called, developed by Benjamin Day Smith. You can download it through the Package Manager (File -> Show Package Manager).

Just as with the analysis described in the previous post, it took me a little while to determine an application for machine learning in the compositional process that would be both manageable and interesting. In addition to these criteria, I wanted something that would function in real time (or close to it) and that would specifically use machine learning to simplify the choices and/or interface presented to the user, rather than overwhelming them with arcane parameters. I also imagined that a binary-encoded input layer would make things simple and easier to follow.

After some trial and error, I decided on developing a program to compose a drum sequence. However, instead of having the user input each sound individually (as with most sequencers), the user is presented with two phrases generated at random. They listen to the two phrases, and then click a button to indicate which they like better. If they prefer phrase #1, a new phrase #2 is generated, and vice versa. Over time, the idea is that the neural network will be able to infer the user’s preferences and generate phrases that more closely match the phrases the user has previously preferred.

I decided to use a very simple type of artificial neural network called a multilayer perceptron. A perceptron is a function that attempts to represent the relationship between its input and output by weighting corresponding nodes. While all perceptrons contain an input and output layer (the data to be evaluated and the evaluation, respectively), multilayer perceptrons are unique in that they contain one or more hidden layers that are weighted during the training phase. The package contains a multiplayer perception object called [ml.mlp].

After some experimentation, I finally settled on a structure for the program. Each drum phrase would be an eight-step sequence consisting of three instrumental parts (clap, kick, and hi-hat samples from an 808). The program begins with a completely random pattern for each phrase and loops until the user makes a choice. When the user clicks the button indicating their preference, the preferred phrase remains as is and the other phrase is randomly modified. Both continue to loop. Each time the user makes a choice, that choice is passed along to the neural net. Perceptrons use binary classifiers (0 or 1) to determine whether something is a good match. In our case, a “match” is equivalent to the user’s preference. When passed to the neural net, the preferred phrase is accompanied by a high value (1), and the other phrase is given a low output value (0). This is, in a nutshell, how the computer learns.

A brief technical aside: multilayer perceptrons are defined by the size and number of layers they comprise. In this case, the input layer has 24 data points or features (3 parts * 8 steps), each of which can have a value of 0 (no note) or 1 (note). The output layer consists of a single data point. In the training set, this is clearly defined: 1 if it is preferred and 0 if it is not. In the testing (prediction) set, the output layer will give intermediate values which reflect a better match as they approach 1 and a worse match as they approach 0.

The number and size of the hidden layers are what gives each neural net its unique performance and computational characteristics. There is no firm rule for the size of the hidden layers, but frequently they are somewhere in between the size of the input and output layers. Reviewing a wide range of applications suggested to me that a single hidden layer would be sufficient for a relatively simple problem such as this, so I chose a single hidden layer with a size of 12—about halfway between the input and output layer sizes. Experimenting with different parameters—easy to do by modifying the arguments of the [ml.mlp] object—did not yield dramatically different results.

After training the neural net, the last step is the generative process. After all, it’s one thing to analyze music using a neural net, but something else entirely to use it to create. My solution was to generate a corpus of random phrases for each round (i.e. each click), run them all through the neural net, and then choose the phrase with the highest output value (0-1). Theoretically, the phrase with the highest output value would be the best match for the target preferred phrase. Each time the user indicated their preference for one phrase or the other, the neural net would be updated and retrained, and then the newly-generated random phrases would run through it. What this means—again, at least in theory—is that the “random” phrases loaded each time should get closer and closer to the preferred phrase.

I emphasize “in theory” because in machine learning, as in life, things don’t always work out exactly as we might expect. For example, in my first few trials, I found that the output value of the “winning” random phrase was getting progressively lower over time. In other words, the quality of these phrases was decreasing even as the machine was supposedly learning more and more. I found this completely baffling for a while, until I had an epiphany: as I had set it up, the neural network was doomed to sabotage itself.

To understand what I mean, let’s rewind a bit. Each time I clicked the button to indicate my preference, two new pieces of data were added to the training set: the preferred phrase (with a training value of 1), and the rejected phrase (with a training value of 0). Conceptually, it makes sense that this would produce better and better results over time. However, what ends up happening in practice is that as the random phrases match the preferred phrase more and more closely (which actually happens quite quickly), the positive and negative reinforcement tend to cancel each other out. Think about it this way: if you have a phrase that you reject because it is only a few notes off the model, there’s nothing to stop the neural net from assuming that the entire pattern is wrong. Consequently, all of the correct aspects of the phrase are also registered as wrong, and most of the positive weighting for the correct features in the preferred phrase are cancelled out. As a result, the neural net ends up focusing on the wrong features.

Ultimately, my (admittedly inelegant) solution was to only use the first handful of rejections as part of the training set. That way, as the rejected phrases got closer to the target phrase, they wouldn’t be marked as “wrong”—only the target phrase would be marked as “right.” This immediately produced better results, both observationally and in terms of the average output value for the randomly generated phrases over time. I settled on including only the first five rejections, though other variations are almost certainly possible.

Machine learning is a huge area to explore, and consequently this post leaves much left unsaid. There are many further parameters that could be tweaked, such as how precisely to train the neural net: should it be based on a specific number of epochs, or a target minimum of error? And if so, what is the right number of epochs, or the best error to aim for? Likewise, changing the number of random phrases generated in each round has a significant impact on the results—the more I added, the better the results tended to be (I experimented with values from 10 to 10,000). Yet at the same time, generating (and therefore, analyzing) more phrases slows down the system, potentially reducing its real-time applicability.

In future posts I hope to explore these details—as well as other applications, such as pitch- and timbre-based matching systems. I am also interested in addressing the crossover between this approach and the reinforcement learning methods I’ve explored in previous posts. Until then, take care and check out the finished (for now) program below. (Click here to download Max.)

Download the software bundle (save in your Max file directory)

Building a Neural Net

In the previous post, I described the process by which I reached the question I’d like to use an artificial neural network to answer: whether a given major triad in pitch space (i.e. any voicing) is a C major triad. In this post, I’ll describe the implementation.

As I mentioned in the previous post, I used this tutorial as the basis for creating my own neural network. In this example, the input layer has three nodes. Since I will be working in the pitch space defined by the MIDI note numbers 0-127, I need 128 nodes in my input layer. Each of these nodes has a value of 1 if that note is contained within the chord, and 0 if not. This is an example of what’s known as one-hot encoding. My output consists of a single node, which will approach 1 if the input is recognized as a C major triad and 0 if the input is not a C major triad.

While it’s easy to generate examples when each sample has only three nodes, with 128 nodes I needed a way to automate example generation. I wrote the following function to generate a training set of arbitrary size that would contain only major triads, but with a variety of voicings, inversions, and pitch cardinalities (using the numpy library):

def make_training_set(set_size):
 arr = np.zeros((set_size,128))
 for row in arr:
  transpose = random.randint(1,12)
  for i in range(len(row)):
  if (i + transpose) % 12 == 0:
   if random.randint(1,4) == 1:
    row[i] = 1
  if (i + transpose) % 12 == 4:
   if random.randint(1,4) == 1:
    row[i] = 1
  if (i + transpose) % 12 == 7:
   if random.randint(1,4) == 1:
    row[i] = 1
 return arr

Each example in the training set must be accompanied by the correct answer so that the model can “learn” by shifting weights and minimizing the error. This is also easily automatable, along the same lines as described in the previous post. This function takes the training set as input and generates a list of outputs in the same format used in the tutorial (i.e. a numpy array comprising a single column of 0s and 1s):

def make_output_list(training_list):
 temp_list = []
 for each_row in training_list:
  possible_triad = []
  for note in range(len(each_row)):
   if each_row[note] == 1:
    possible_triad.append(note%12) # append pitch class
  possible_triad = list(set(possible_triad)) # remove duplicates
  if possible_triad == [0,4,7]: # if c major...
   temp_list.append(1) # ...output 1
   temp_list.append(0) # if not, output 0
 temp_list = [temp_list[i:i+1] for i in range(len(temp_list))]
 final_output = np.empty((0,1), int)
 for item in temp_list:
  final_output = np.append(final_output,np.array([item]),axis=0)
 return final_output

The code in the tutorial for the neural network itself is highly generalizable, and only needed one tweak in the __init__ function to be adapted to the new input layer size. This line, which initializes the weights at 0.50 for each node:

self.weights = np.array([[.50], [.50], [.50]])

Becomes this:

self.weights = np.array([[i] for i in np.full((1,128), 0.50)[0]])

Again, because we are working with much larger samples (128 nodes vs. 3 nodes in the tutorial), it makes sense to automate the array of weights rather than specifying them manually. It’s worth pointing out that in many cases, weights are initialized randomly, rather than all at a particular value. I haven’t tested whether this makes a difference in the performance of the ANN, but it might be worth exploring in the future.

I decided to organize the code as a script that would ask for the training set size from the user and then automatically generate the training set and train the model. The basic structure is given as follows (import statements and definition of the NeuralNetwork class are omitted):

print('Enter size of desired training set: ')
training_set_size = input()

inputs = make_training_set(int(training_set_size))
outputs = make_output_list(inputs)

NN = NeuralNetwork(inputs, outputs)

The last thing I added to the script was a function that would make generating examples for the prediction (testing) phase easier. While one-hot encoded data is highly machine-readable, it is not very human-readable. Therefore, I created a function that allows a user to input the MIDI note numbers, and then automatically converts these into one-hot format when the example is passed to the neural network as input:

def create_example(note_list):
 ex_out = np.zeros((1,128))
 for midi_nn in note_list:
  ex_out[0][midi_nn] = 1
 return ex_out

When you run the script (in the enclosed ZIP file), you will first be prompted to set the size of the desired training set. I started out with a training set of 1000 examples—a relatively small set as far as machine learning goes. Depending on the size of the set and your computer setup, running the script may take a few seconds to a few minutes.

Once it’s complete, you can generate examples to test using the neural net. Some examples are provided in the accompanying document. For instance, let’s try a closely spaced C major chord with the root on middle C (don’t forget to call the “create_example” function to correctly encode your input):

example_1 = create_example([60, 64, 67])

Then use the predict() method to see whether the neural net recognizes this example as a C major chord or not (remember, values closer to 1 mean more likely C major; values closer to 0 mean more likely not):


We get a prediction value of above 0.99—very close to 1, and therefore a very confident prediction that this chord is C major.

Now let’s try a chord that’s definitely not a triad of any kind—in fact, it’s a dissonant cluster:

example_2 = create_example([59, 36, 49])


Our prediction value is 0.00008927—a value extremely close to zero—indicating confidence that this example is not a C major triad.

A third example, a D major triad, also produces a low prediction value of 0.000178:

example_3 = create_example([62, 66, 69])


In other words, it appears that our neural net basically works. The next step will be optimizing various parameters, including the size of the training set and the number of epochs of training. We can also track the error over time to see how quickly the model learns. Until the next post, enjoy exploring the code below:

Download the code bundle here.

Introduction to Using Neural Nets

Many of my recent posts have explored how machine learning techniques can be integrated into creative and analytical musical projects. Today, I continue to explore this area by introducing artificial neural networks as a tool for analyzing music. I should preface this by stating that I am by no means an expert in this area. However, I find the possibilities fun to work out and fascinating in their implications. Writing up my experimentations also helps me clarify my own thinking and suggests avenues for improvement or further discovery.

I started out by doing a lot of research on machine learning online. Some of the best resources I’ve come across include Andrew Ng’s video series on machine learning and 3Blue1Brown’s Youtube channel. I also looked into previous research specifically using machine learning to synthesize and analyze music. I found the work of David Cope and Roger Dannenberg (here and here) especially helpful in framing and evaluating research questions pertaining specifically to music. This paper surveying AI methods in algorithmic composition has also been extremely helpful.

The most popular tools for implementing machine learning techniques tend to advertise their ease of use. This is a good thing in general, but for my own purposes—namely, learning—it was not ideal, since much of the detail of the actual operation is hidden away from me as a user (see “blackboxing”) . Consequently, instead of starting with a well-known package like keras or scikit-learn, I found a terrific tutorial that walked through building a neural network from individual functions—in other words, almost scratch.

After going through the tutorial myself and making sure everything worked as expected, I turned my attention to finding an appropriate musical question that a neural net would be suited to solve. This turned out to be more challenging than I expected, as many of the questions that first occurred to me were actually better suited to other methodologies. My (admittedly incomplete) understanding of the “sweet spot” for the use of ANNs are questions that are difficult to define using concrete rules, but that ultimately rely on consistent patterns.

Thinking in this way, I found that many of the fundamental patterns of music (as understood through Western music theory) are too well-defined to be appropriate for machine learning methods. For example, specific chords and set classes can be identified through reasonably simple rules and algorithms. When machine learning methods (such as ANNs) are applied to questions like these, they tend to “overfit” the training data—as it is sometimes put, they “memorize” the training data, rather than “learning” from the underlying trends. Because of this, they tend to perform poorly on new data.

To illustrate what I mean, let’s imagine we want to use an artificial neural network to determine whether a given chord is a major chord. Our data set will comprise random chords in pitch space (I’ll use the MIDI note numbers 0-127 to refer to specific pitches). Major chords are extremely well-defined from a mathematical perspective: they comprise a set of three pitch classes separated by specific, unchanging intervals (major third, minor third, perfect fourth). If we perform a modulo 12 operation on all of the notes in each chord in our data set, we can measure the constituent intervals and determine the quality of the chord with certainty. If the chord follows the rules, it is a major chord; if it doesn’t, it’s not.

At first glance, we might imagine that an ANN would be able to infer these rules given enough training examples. However, ANNs work by assigning weights to the connections between nodes (i.e. data points), and in pitch space, there is no correlation between a particular MIDI note number and the quality of the chord. In other words, a C major chord might contain the note C4 (MIDI note number 60)—though not necessarily—and a D major chord will never contain C4. Therefore, it is unclear how a neural network would successfully weight this note (and by extension, every other note). Instead, it is likely to “memorize” the chords in the training set, rather than “learn” any consistent underlying principle, as is the objective in machine learning.

One possible solution is careful preprocessing of the data in order to ensure that individual data points would be meaningful to the neural network when used as input. Yet this ultimately reveals how well-defined major chords are and, consequently, how unnecessary machine learning would be in this scenario. For example, we might preprocess the data by applying a modulo 12 operation to each note. This would simplify the examples considerably: from 128 different possible data points (i.e. pitches) to 12 pitch classes. Yet we encounter the same problem as above: the note C (pitch class 0) will always be present in some major chords but not others. We might expand the preprocessing stage to transpose and/or invert all chords so they are based on C (this is also known as normalizing, scaling, and/or centering the training data). However, at that point a simple comparison between the example chord and a model set of (0,4,7) would always produce correct results, rendering a machine learning approach completely superfluous.

The research question I finally decided on was determining whether a given major triad in pitch space was a C major triad. Intuitively, this seemed like an appropriate problem for an ANN because there is consistency with respect to the individual data points (each data point is either potentially part of a C major triad or not), but examples will vary greatly in terms of the octave and number of component pitches, making it (somewhat) difficult to describe in a finite number of rules. In the next post, I’ll jump into the code and implementation.

Experimental Genre Associations

This post summarizes one part of a larger digital humanities project on the use of the term "experimental" to describe music. For more on this project--including the data and code--see my Github.

The term “experimental” is used in discussions of musical genre in two contradictory ways: (1) to describe music that does not fit into any existing category, and (2) as a qualifier to describe music that occupies an aesthetically marginal position within a category (similar to the term “avant-garde”). To better understand the latter usage, I designed this quantitative project to identify the genre associations of musicians considered to be “experimental” using data from Wikipedia. I found that experimental musicians were most likely to also be categorized as rock music.

Because “experimental” is not consistently recognized as a genre in and of itself, instead of using genre labels I used a list of 184 experimental musicians from this page.

I used BeautifulSoup to parse the list and obtain the web addresses for each musician’s Wikipedia page. By scraping each musician’s Wikipedia page, I generated a list of 554 genre labels comprising 159 unique entries. I removed 93 results of “None,” in addition to one entry that was an editorial indication rather than a genre (“Edit section: Genres”), resulting in a dataset of 460 labels, of which 157 were unique.

As a final step, I consolidated the 157 unique labels into a handful of larger genre categories using substring matching. I began with well-established genre labels including hip hop/rap, pop, classical, rock, jazz, electronic, and dance. For hip hop/rap I combined the results of the substrings “hip hop” and “rap”; for electronic I used the substring “electro” (rather than “electronic”) so as to encompass words like “electroacoustic.” Next, I analyzed the list to determine if other terms were especially prevalent, and therefore warranted consolidation so as to be compared with the larger categories. I found that the terms metal, punk, and industrial were especially prevalent, and added these as well for a total of ten categories for comparison.

Musicians considered to be experimental were most likely to be associated with rock music subgenres by a wide margin, followed by electronic, pop, and metal. Remarkably, the number of rock-related labels (88) exceeded even the number of instances of the label “experimental” (70).

Prevalance of Genre Labels

Of course, it bears mentioning that the sample size for this project is extremely small, and was drawn from a list that was generated manually, rather than automatically. Consequently, the list may be especially vulnerable to systemic biases of Wikipedia editors. Nevertheless, this brief study is a starting point for better understanding the application of an especially ambiguous term.