RubyKaigi 2017

Do Androids Dream of Electronic Dance Music?

RubyKaigi2017
http://rubykaigi.org/2017/presentations/juliancheal.html

AI is everywhere in our lives these days: recommending our TV shows, planning our car trips, and running our day-to-day lives through artificially intelligent assistants like Siri and Alexa. But are machines capable of creativity? Can they write poems, paint pictures, or compose music that moves human audiences? We believe they can! In this talk, we’ll use Ruby and cutting-edge machine learning tools to train a neural network on human-generated Electronic Dance Music (EDM), then see what sorts of music the machine dreams up.

RubyKaigi 2017

00:00:00.030 I told Ms, can everyone hear me? Okay, good. Thank you.
00:00:13.320 So hello! I think we'll go ahead and get started. We are Julian Cheal and Eric Weinstein.
00:00:21.390 We're about to present our talk here at RubyKaigi 2017, titled 'Do Androids Dream of Electronic Dance Music?'
00:00:26.490 If you're wondering what we're talking about, the title is inspired by the book by Philip K. Dick, 'Do Androids Dream of Electric Sheep?'. If you are still not with me, you may know a film loosely based on it, called 'Blade Runner'. And if that doesn't resonate, there's a new one as well, so go check that out.
00:00:43.860 So, hello! Konnichiwa! You've now heard about 10 to 20% of all my Japanese, so thank you so much for coming. We really appreciate it.
00:00:52.430 Thank you to our sponsors, thank you to the organizers, and thank you to the venue. But, really, thank you for coming and sharing your time with us. We truly appreciate it. This is a computer talk, so we have to start with the basics. Part zero is hello, and who are we? So, let’s introduce ourselves.
00:01:14.610 Watashi no namae wa Julian. Dis, oh my god. Oh, my name is Julian, and I live in Bath. I think that's right, Julian-san. Julian-san, kamashita, go go go! That's my attempt at Japanese. I work at a small open-source company called Red Hat, and my main job is working on a project called ManageIQ, where I manage clouds. As you can see, there are now no clouds over Hiroshima today, so I'm good at doing that.
00:01:41.400 My name is Erik Weinstein. I'm a Rubyist from Los Angeles in California. All of my contact information is here in this human hash that I made. I work for a company called Fox Networks Group, where I work on TV shows like 'Archer', 'The Simpsons', 'Fox Sports', and 'National Geographic', building a platform to show that content to our viewers. Additionally, I wrote a book on Ruby for children called 'Ruby Wizardry', which is currently 30% off with the code 'kaiki30'. So, if you go to the No Starch Press website, you can find it at a discount. Thank you also to the good folks over at No Starch!
00:02:22.750 This is a relatively short talk, but I think it's still good to have an agenda. We'll start by defining machine learning and the various tools we can employ, such as recurrent neural networks. We'll then move on to the music-specific portion of the talk where we'll discuss music formats like MIDI. Julian will dive into that, and we’ll see how we can train a machine learning model to actually generate its own music based on the musical data it has.
00:03:01.209 I’ll take us through the first few slides, and then Julian will discuss the music. Before I get started, I want to mention that I tend to talk very quickly when I’m excited. Being here with all of you and discussing Ruby and machine learning is thrilling for me. So, if I begin to talk too fast, feel free to make a gesture to help slow me down.
00:03:39.220 Now, let's start with something simple: what is machine learning? The short answer is you can watch this talk I gave at RubyConf last year called 'Domo Arigato, Mr. Roboto: Machine Learning with Ruby'. But it might be simpler if we define it for you here. Machine learning is about generalization—moving from known information to unknown information.
00:04:03.250 The idea is to help a program assemble rules to deal with data so the machine can act without being explicitly programmed. For example, we might want to perform some kind of pattern recognition, such as identifying whether an image is of a car. Given the prices of known houses, we could also estimate the price of an unknown house. Alternatively, we might want the machine to separate data into groups based on patterns it observes, even if it doesn’t know why it's separating them in that way.
00:04:45.610 This second approach is known as unsupervised learning, which we won't delve into too much, but the first approach, supervised learning, is an important topic in machine learning. This involves transitioning from labeled data (information we know about) to unlabeled data. A classification example would be identifying if an image contains a car based on previous training. In contrast, regression would involve predicting the price of a house based on known data.
00:05:40.120 In our examples, labeled data serves as our training data, the information we provide to the machine learning model, while unlabeled data serves as test data. Understanding the structure of this data is vital. When we talk about MIDI files, Julian will explain more about their structure and how we can input them into the model.
00:06:17.080 Now, let’s transition to part two: neural networks. How many of you are familiar with machine learning or neural networks? Okay, good! I won’t spend too much time on it. I will go over neural networks in general, followed by two specific kinds that are very good for modeling music generation problems.
00:06:55.810 Now that we understand what machine learning is, we can discuss different models. One such model is the neural network, a tool modeled after the human brain. It consists of numerous neurons, or functions, that take in inputs—similar to dendrites—and outputs to represent the activation status. Neurons can produce binary outputs, or continuous values, which align with different problem types like classification and regression.
00:07:34.320 In our case, we selected a specific type of neural network known as a Recurrent Neural Network (RNN), more specifically an LSTM or Long Short-Term Memory Network. This type of network is particularly good for handling the time series data associated with music generation. Recurrent Neural Networks, including LSTMs, utilize a feedback loop that allows them to retain a certain amount of memory, enabling them to tackle temporal dimensions effectively.
00:08:25.320 As mentioned, the artificial neural network is modeled after biological brains. The LSTM network is intriguing because it can handle data that possess a temporal extension. It's perhaps confusing to think about neural networks sorting images of cars. Yet, our challenge is to determine the next note given the preceding notes in music.
00:09:03.550 To illustrate, LSTMs can process sequences of inputs through its hidden layers where later layers feed back into earlier ones. This allows the network to maintain a memory of past notes, making them effective at generating music.
00:09:43.800 This diagram shows an artificial neuron. You can see the multiple inputs that go into it, the activation function, and the output, which either fires or does not fire, similar to brain neurons. In regression problems—where continuous values are required—thresholding isn’t needed, and the signal can be output directly. Selecting and tuning activation functions is part of the art and science of machine learning, involving plenty of trial and error.
00:10:23.260 In our case, the inputs and outputs correspond to numeric values parsed from MIDI files, indicating what music is coming in and what the next note should be. Assembling these neurons into an artificial neural network involves specific layers: an input layer reflective of the data type, at least one hidden layer for learning, and an output layer for predictions. In classification problems, these output neurons correspond to distinct results.
00:11:03.610 You may have heard of deep learning or deep neural networks, which involve networks containing many layers. Deciding how many to include is part of the intricate design process in machine learning. If the network becomes overly deep, signals can weaken as they process through layers, creating challenges in learning. This phenomenon is known as backpropagation.
00:11:40.310 There exists the potential for overfitting, where the model learns peculiarities instead of the overarching patterns we seek. So, there is significant variation in how we design a neural network. We are using a recurrent neural network, where layers feed back into earlier layers, allowing memory retention to handle temporal dimensions.
00:12:23.160 We picked LSTM because they possess characteristics that allow for memory retention and good signal handling in music, where notes can arrive at irregular intervals. Now we transition to part three. After Eric's talk last year, we began discussing the types of data that can utilize machine learning.
00:13:03.560 If you attended my RubyKaigi talk last year, it focused on creating music with computers. Eric's discussion on machine learning intrigued me, as computers can create music too. For those watching on YouTube, a link to my prior talk will appear, and don't forget to subscribe!
00:13:25.990 First, let's discuss MIDI. MIDI was introduced in the mid-1980s, mainly thanks to Dave Smith. Before MIDI's introduction, musical instruments communicated through different protocols. MIDI, which stands for Musical Instrument Digital Interface, established a standard for instruments to communicate. It does this using MIDI note data, which is a series of bytes.
00:14:07.170 The first status byte indicates whether the note is on or off, allowing up to sixteen channels on one cable. This was vital in the 80s—think analog cables daisy-chaining an entire orchestra on stage! The second byte specifies which note is played, aligning with piano keys. The final byte represents note velocity, so if you press a key softly, the velocity is near zero, while a hard press can reach a velocity of 127.
00:15:06.000 A visualization of MIDI data on a piano keyboard illustrates how note data appears in binary format. The bars represent the duration notes are held. In software like Ableton Live, you can edit MIDI data in such visual formats.
00:15:54.000 To incorporate music into our machine learning algorithm, we needed MIDI data. We sourced royalty-free music by Deadmau5 from a service called Splice. The legal implications of taking someone else's music and training a neural network to generate new work are still unclear. Thus, we aimed for royalty-free data to avoid potential copyright issues.
00:16:43.850 Once gathered, we converted these MIDI files into a format compatible with TensorFlow, which entails complex transformations from one binary format to another. This took time and required accurate file identification, ultimately using 110 MIDI files from Deadmau5.
00:17:29.460 However, numerous MIDI files are available online; about 130,000 files equal around four gigabytes, encompassing genres from classical to pop and even TV show themes like 'Sesame Street.' It’s crucial to have your own data. In our initial training phase, Eric utilized generic pop songs. When I introduced my hardcore trance files, we forgot to clear previous training data. Consequently, the resulting 'music' sounded like a strange hybrid of trance and pop.
00:18:26.030 As you can tell, I’ve said ‘data’ numerous times. The next logical step is to show you our code! Here's our complete program, consisting of just 18 lines of Ruby code that reads in MIDI files, processes them into the required format, trains the model on that data, and outputs new MIDI files.
00:19:17.839 You may notice a quirky Ruby characteristic in lines 7 and 8, since a lot of this code doesn't actually write in Ruby. We can accomplish all this with only 18 lines, and I think that’s quite impressive. Feel free to take pictures of this part, but maybe don’t document the sections that follow.
00:19:49.700 Now that we understand machine learning, neural networks, and data structure, let's explore how we generate music. We will have a live demo, so all of my diagrams will not be for naught!
00:20:02.220 We utilized TensorFlow, Python, Ruby, and a library called Magenta designed specifically for music generation. Our initial idea was to build an LSTM from scratch in Ruby, but the result was too slow. Libraries like NumPy, SciPy, or TensorFlow have performance advantages because they call into C for intensive calculations.
00:20:53.569 We realized that Julian's advice required us to preprocess the data not once but twice for compatibility. The necessary library infrastructure exists in Python and TensorFlow but lacks support in Ruby. Consequently, we couldn’t build our network entirely from scratch; we needed the library support.
00:21:49.170 While there are great projects like CyberB and TensorFlow.rb, we still required extensive Python integration. Ultimately, our project comprises about 93% Python, roughly 5% Ruby, with the remainder consisting of shell scripts that are now obsolete due to our Ruby updates.
00:22:44.370 Now, let’s ensure you're ready for the dance party! I hope you are enthusiastic because we are about to present our demo, which we hope you will enjoy. Here we go!
00:23:40.000 Demo time! I’ll start by running the setup, allowing you to see the connection between our code and the music output. Here, we utilize Ableton Live, a digital audio workspace, enabling us to work with MIDI files and create instruments. The sample we’ll use for the demo is from Deadmau5.
00:24:38.890 As you can hear, Deadmau5 is known for his arpeggios—taking chords apart and playing individual notes. Each song carries his unique sound, and below, you’ll see visual representations of MIDI files—note activations shown as dots. These MIDI representations indicate note length and speed, producing a familiar sound when we use this data for training.
00:26:06.150 Now, are you ready to hear the result of 5% Ruby and 95% Python combined? Let’s find out! I hope it sounds okay because, if all goes as planned, we'll let the code perform its generated music!
00:27:31.350 As you can observe, the sound retains similarities to Deadmau5's work, but all of this music was created through machine learning. The machine predicts notes that converge well together. Does this sound harmonically correct? It does have a consistent flow, but occasionally the computer selects notes that conflict within a conventional framework.
00:30:00.780 Sometimes, you will notice irregular notes where the machinery struggles with traditional key changes. This particular output captures a rhythm resembling Deadmau5’s style while introducing slightly off-key variations. If we had opted for a larger training set, we would likely have improved the quality—a process that could take time with 110 MIDI files requiring approximately 40 minutes.
00:31:50.880 In conclusion, we’ve discussed machine learning, especially supervised learning versus unsupervised learning, and focused on the LSTM neural network that manages time series data efficiently—ideal for musical outputs. Our journey began with a modest 5% Ruby, but we aspire to increase this to a more significant percentage.
00:32:34.880 We will diligently review our data files before open-sourcing later today, ensuring compliance with copyright regulations. We invite you to contribute through pull requests, assisting us in developing machine learning tools for Ruby.
00:32:48.000 Thank you so much. We appreciate your time and would love to hear any questions you might have!
00:33:04.960 Audience Member: What constitutes the error, and how do you address corrections? In our context, the machine listens to a series of notes, predicting what should come next versus what it actually identifies, learning from any discrepancies.
00:34:22.200 Given that we employed 110 files and around 1,000 training rounds, the outcome was distinctly similar to Deadmau5. Future experiments could involve blending multiple artists’ data to produce distinct styles. You might encounter terms like overfitting, where the machine sounds too closely related to the source, or underfitting, where the output is nonsensical.
00:35:20.680 Do we have time for one or two more questions? Yes? Audience Member: Is there any understanding of how data affects the learning process? Yes, the MIDI data contains length and velocity, which we may tweak alongside sound production for further refinement of the output.
00:36:19.610 You could explore potential adjustments using physical data or more intricate audio files in future developments.
00:36:59.860 Would it be possible for you to identify original sources in the output music? We can apply visual inspections or listen to the music fed into the network, but deciphering its decisions remains challenging. There’s much we still learn about how libraries model decisions.
00:37:23.450 Audience Member: What license will you utilize for the open-source release? We’re considering the GPL or MIT licenses.
00:37:40.630 Thank you for your engaging questions, and congratulations to our presenters for an excellent talk today!