In the Name of Whiskey

00:00:14.450 Okay, hopefully the connection will stay strong. Hi, I'm Julia. I am from Google, where I work as a software engineer on the open-source team. I’m pretty excited about that. I'm so thankful that I get to talk to all of you. First of all, because I haven't been able to participate in a community in a long time, and y'all are awesome, so thank you. But secondly, I get to talk about whiskey and machine learning. Since I failed the Turing test, I identify really well with some of these algorithms I'll be discussing.

00:00:42.480 You know, back in the day, I did some stuff. This was like a lifetime ago. It wasn't my first or second career; it's my third career. I did machine learning research, where I got to play with robots. I set them loose in the lab and pretended like that was my social life. As any good theoretical computer scientist would do, I built an autonomous aerial robot, otherwise known as a blimp. We calculated how much lift we would need and decided what shape would fit within those constraints. It turns out you can fit a dodecahedron inside the constraints we had, and clearly that was the optimal structure. I made a pretty big dent in the world's helium supply, and I apologize for that. We even received a pity award for innovative hardware design, which I think was code for 'I'm not really sure how this works'. To be honest, I don't know how it works either anymore.

00:01:50.020 That was a lifetime ago. I went into a PhD program for machine learning, and it really wasn't for me. I was pretty sad because, in my naive worldview, I thought I could no longer do machine learning since they surely checked whether you received your PhD before letting you run any algorithms. I realized far too late that this wasn’t in fact true: I can actually do machine learning.

00:02:06.910 I've had the chance to do it in the past few months because, at its heart, all machine learning is just taking data, running it through an algorithm, answering questions, and getting insights. When I first started, I thought I wanted to solve the world's problems, like identifying diseases and making people less sick. But it boiled down to wanting to solve the fact that there weren't enough hugs in the world, so I built a system called 'Can I Hug That?', which takes an image and tells you if you should hug it. I polled my colleagues for examples of things they would hug or not hug, and everyone agreed that you should not hug a cactus. I agree, but it also gave me insights like this: if we look at a crocheted version of an octopus, it gets a thumbs up, while the real thing, surprisingly, does not.

00:02:31.580 So, how can I apply this to whiskey? Specifically, scotch? I only got into scotch about three years ago, which coincided with when I read a simple tutorial on doing machine learning with a whiskey dataset. If any of you are machine learning aficionados, you may have come across it—there's only one. It provided some pretty good objective data, and while I'm not great at internalizing matrix operations, I can make computers perform them for me, which lets me do some cool things.

00:03:10.490 Let’s talk about the data I have. To set some vocabulary, who’s familiar with the term 'feature vector'? I'll define it so don't worry. In a 2D space, you have basic X and Y coordinates, and a point. So your feature vector is X and Y. If you add a dimension, you get X, Y, and Z; that’s your feature vector. It describes a data point. If you add more dimensions, it becomes more complicated to visualize since our brains don’t easily handle more than three dimensions. At one point, I was dreaming in ten dimensions, and it wasn't pleasant. But this gives us the ability to represent all points in a piece of data using a feature vector.

00:04:24.960 So, what data do I have? Well, this takes us to my first whiskey, which is my gateway scotch. It is a Speyside, not a Highland, as I had mistakenly thought for too long. We have information about it, like how robust the body is, what kind of notes it has, such as smoking and medicinal qualities. However, the dataset did not include the region or the latitude and longitude. I painstakingly compiled the region from Wikipedia, while the latitude and longitude came from a different coordinate system. I learned that the world does not operate solely on latitude and longitude.

00:05:32.310 Now, if we distill this particular one into a feature vector, it looks something like this. We gather the notes, latitude, and longitude where some values are strings, others are numbers, and some are floats. We convert the strings into numbers so we can differentiate between them. This data condenses into something meaningful, resulting in a representation like this for Albany scotch. Those who have done matrix operations will note that this is indeed a fully-fledged matrix, which we can use traditional matrix math on.

00:06:43.100 Who's heard of TensorFlow? It turns out more of you than I was expecting! TensorFlow can primarily be described as a tool for deep learning research, which attempts to solve problems we previously thought required human intelligence. For example, the AlphaGo competition served as a testament to deep learning's capabilities. At its core, TensorFlow is for numerical computation; it performs matrix math quickly and efficiently. It operates with a concept called deferred execution, allowing us to define tasks without computing them until later.

00:07:26.140 Let me show you what this looks like. Right now, I have to confess that TensorFlow does not have a Ruby interface, and I apologize for that. There is a call for one, but we don't have many Ruby experts among us at Google, Asha notwithstanding. However, if you'd like, you can contribute to its open-source development. Let's dive into what TensorFlow code looks like, which has Python interfaces, and use it in an interactive form.

00:08:09.070 We start by defining a simple matrix for our whiskey data. If I define a matrix with three rows and one column, we can also define another matrix with three columns and one row, allowing multiplication between them. If we print out these matrices, we'll get a tensor object—this just represents something we can use in TensorFlow. The tensor does not have an assigned value until we evaluate it by calling an eval function, which is when we get the results of our computations. Personally, this required a shift in mindset. And to be clear, I struggled to adjust to this way of thinking. Let's close the session for now.

00:10:37.430 What do we have? We have some whiskeys mapped out with coordinates on a Google Maps interface where they are color-coded by region. We can hover over them to see their locations. For example, there's a distillery in Speyside. By looking at the map, those familiar with whiskies have a lot of questions—and rightfully so. I wanted to see what types of groupings might emerge from this data, leading us to the algorithm called K-means.

00:11:58.639 K-means has some basic components. The first is K, which represents the number of groupings we want from our data. It’s as simple as that. Using a 2D dataset like the famous iris dataset, if we specified K is four, we could visually see the groupings. But as the dimensions increase, visualization can become extremely complicated. Essentially, after picking a good value for K and assigning each data point to the closest cluster center, you then update the cluster centers based on the data points assigned to them. With enough iterations, you can get a reasonable grouping.

00:12:48.420 However, a major weakness of K-means is that you must know the right value of K in advance. You can also run into issues based on how you pick the initial cluster centers. A poor choice in your initial centers can lead to unsatisfactory results, and your outcomes can vary widely depending on that initial state.

00:13:32.510 Next, let's look at a Docker container that's running TensorFlow. While it's possible to install and compile it on your own, you might run into some issues with Fortran, and trust me—you don’t want to deal with that. Docker simplifies this process. Now, I’ll run a simple K-means algorithm with our whiskey dataset, gathering flavors and regions while letting it iterate for a specified number of steps.

00:14:20.960 The algorithm outputs cluster assignments, and I’ll pair those with the distillery names. As I import the centroid data, we can see where each cluster's center is alive in terms of those flavors. Let’s refresh the previous run. Right now, we have four clusters mapped on our whiskey map. The clusters may have less spatial conformity than we expected since they only correspond to flavor profiles.

00:15:12.060 After zooming in on the map of clusters, it’s interesting that some scotches assigned to similar clusters are far apart geographically, indicating that flavor profiles and geographical regions don’t necessarily align as expected. This clustering exercise offered critical insights, proving that flavor profile and region aren't synonymous.

00:16:02.630 I would love to show you the code for the K-means algorithm, but it’s a little bit complex for a 35-minute talk. However, I wanted to experiment with neural networks, leveraging the power of a simple feed-forward neural network to do what we did with K-means: to classify groups of scotches.

00:16:52.870 First, let's clarify what a neural network is. Essentially, it's a graph that processes input to generate output. If we think of analyzing movie reviews, we want to determine whether a review is good or bad based on its contents. However, basic neural networks struggle with understanding nuanced language. The research community faced obstacles in the 70s when they found specific tasks were challenging for neural networks.

00:18:12.830 However, with the addition of hidden layers, we can enhance the network's capability to learn from data. When we pass input data through the layers, the hidden layer helps the network better understand the complexities involved in determining outcomes. Overcoming the limitations of simple recognition allows us to ascertain whether a movie review feels positive or negative.

00:19:11.700 The algorithm for a feed-forward neural network initializes the layers with numbers, and then you optimize the network through training. In my whiskey dataset, I split the data into training and evaluation sets. The training data has 76 instances, while the evaluation has 10. This allows the network to learn from examples and fine-tune its predictions.

00:20:05.690 If we run the artificial neural network with our training data file containing the 76 scotches and the testing data with 10 instances, the powerful tool will train over many iterations, allowing it to identify patterns in the grouping of scotches. If the training goes well—as it has previously—then we will see reasonable accuracy, but we might also experience the opposite.

00:21:23.500 During one run, I found the system performed similarly to a coin flip, which was not ideal given the amounts of hope I placed in it. A better training session produced accuracy around .7, but further exploration showed some distinctions among specific scotches. Some of the predictions aligned quite well with regions.

00:22:44.960 Throughout this talk, I intended to discuss how to determine flavor profiles through propagation of coordinates for distilleries. However, my approach was flawed from the beginning due to the dataset. With only 86 data points, it became evident that there was not enough data for effective machine learning modeling. More examples yield better predictions, and when training classifiers, quantity matters significantly.

00:23:38.610 Initially, I underestimated the challenges posed by limited data, but it’s evident that something must change. Looking at the mapping between different regions and the consistency of clusters revealed issues. As mentioned, I had mistakenly categorized a Scotch as located in Germany.

00:24:35.970 Even with some solutions in mind, it became apparent I needed to treat data cleaning like a muscle that needed training. Once practiced, it becomes more manageable. A willingness to fail and learn is just as crucial in this field as understanding code and algorithms. It felt empowering to hear that this sentiment echoed around the conference.

00:25:27.320 I realized I needed more data for my experiments. I know many people who have a passion for whiskey, and therefore, I could have created my own dataset. By doing this, I would easily have accumulated over 200 data points—more than double what I currently had. In hindsight, I should have thrown a party to share in the research and recruit more participants.

00:26:02.140 If you're interested in machine learning and TensorFlow, I can refer you to some awesome resources. I recently reread a gentle introduction to machine learning, which is a nice primer on the subject. It inspired me to see how many people are using TensorFlow to tackle real-world problems.

00:27:06.990 At this point, real-world applications of techniques like neural networks can generate art by applying styles to sketches, allowing creativity without strict artistic abilities. So the takeaway here is simple: More whiskey! To comply with the theme today, I have a little sign displayed—a thank you to the octopus above, but now he appears to be armed.

00:27:54.190 If there are any questions, I'm here to answer them!

00:29:25.900 That's a wonderful question! If you were to consider the age of the Scotch into our dataset, it would add a dimension to your data. This means you would have additional data points for each Scotch, and it could enhance the quality of your clustering algorithm through a comprehensive evaluation. Honestly, incorporating age could justify some expensive purchases in the name of research. Is there any other question? Yes?

00:31:20.270 That’s an excellent point! When comparing a human’s ability to group scotches against a neural network's capability, the disparity mainly lies in the breadth of experience. While a neural network works through defined patterns in data, humans draw from a rich history of taste that informs our nuanced evaluations of what we consume.

00:32:03.920 To address the nature of the datasets; indeed, none were streaming, but there are specific classes of algorithms known as online algorithms that evolve as fresh data flows in. Incorporating real-time data generally enhances performance up to a degree.

00:32:49.610 Regarding your question about hidden layers, there's a marked difference in approach now. Selecting an appropriate count for the layer nodes is crucial, and there are resources available to help one decide. You could think of the hidden layers as entities that translate input data into relatable descriptors, relying on trial and error in the beginning. As much as I attempted to follow established guides, I often just experimented without adherence.

00:33:51.275 While this talk aimed to cover material within a constrained timeframe, I want to respect your time, so if you have further inquiries, please reach out either now or later. Thank you for your attention, and let's give a round of applause for Julia!