Is it Food? An Introduction to Machine Learning

by Matthew Mongeau

In the video titled "Is it Food? An Introduction to Machine Learning," Matthew Mongeau presents an accessible introduction to machine learning, particularly within the context of Rails applications. This talk was delivered at RailsConf 2017 and aims to bridge the gap between overly technical and simplistic explanations of machine learning concepts. Mongeau begins by noting his personal interest in practical machine learning applications and emphasizes a need for balance in understanding technical concepts. The primary focus is on image classification, specifically determining whether an image is of food.

Key points discussed in the video include:

- The Challenge of Image Recognition: The talk kicks off with a discussion of recognizing images as food and the complexities involved in machine learning concerning image classification.

- Types of Machine Learning: Mongeau explains three categories of machine learning: unsupervised learning, supervised learning, and reinforcement learning, with an emphasis on supervised learning for image recognition using neural networks.
- TensorFlow and Neural Networks: He introduces TensorFlow as a powerful framework for building and training neural networks, explaining key concepts such as tensors and data flow graphs.
- Real-World Application at Cookpad: Mongeau shares a case study drawn from his work at Cookpad, which manages a vast number of images submitted by users. The company needed to ensure that images submitted are indeed food-related, leading to the development of an automated classification system.
- Implementation Steps: He details the steps of creating a simple Rails application that classifies images as food using a Python backend serviced by Flask, highlighting the importance of using existing frameworks like TensorFlow for practical applications.
- Best Practices and Lessons Learned: Throughout his journey, Mongeau stresses the significance of utilizing appropriate tools and frameworks, learning from experiences, and adapting processes to facilitate machine learning integration in Rails applications.

In conclusion, the core takeaway emphasizes the importance of applying machine learning practically and the necessity of using robust tools such as TensorFlow, especially when building applications in environments like Rails. He advocates for embracing Python for machine learning tasks due to its strong community support and better toolsets. Mongeau ends on a hopeful note that image classification technology will continue to evolve, presenting increasing utility across various industries.

00:00:11.509 All right! Let's do great. It's working! I want to make sure that was working. Welcome everyone! How is everyone's RailsConf going? Good?

00:00:22.699 So I'm going to start with, who am I? I go by Goose, but I'm also Matthew Mongeau or Halogen and Toast online. You can find me under that name pretty much anywhere. That's the usefulness of it—no one else uses it. You may be wondering where 'Goose' comes from. It actually comes from my last name, Mondo.

00:00:40.170 Now, you might not be able to spell or pronounce this, and because of that, you might call me 'Mongoose,' which just ends up getting shortened to 'Goose.' Now everyone frequently reminds me that this comes from the movie Top Gun, but I haven't seen it. There's a very dark truth that I found in Top Gun, leading me to do a little investigation.

00:00:59.370 There are a number of movies featuring characters named 'Goose.' For instance, in the movie Mad Max, Goose dies in a horrible fire. The movie City of God has a character named Goose who gets shot and dies. As for Top Gun, I assume that he dies in some kind of automobile accident. The alternative to being called 'Goose' was that people once tried to shorten my name to 'Matt,' which was fine, but in developer chats, it became kind of confusing. People don’t always necessarily say the best things about databases. You run into this problem where you're not sure if they're talking about you or the database, and you don't want to have an existential crisis with a NoSQL database, as it makes building relationships really hard.

00:01:32.610 Perfect! Now you know everything you need to know about me. We're going to get started with 'Is It Food? A Journey of Minds, Machines, and Meals.' When I give talks, I like to think about why I want to give this particular talk. The reason I wanted to give this talk is that I wanted to learn about machine learning. I watched a bunch of talks on it, but I always felt like the information was either way too high-level or way too low-level for me. I couldn't find that happy medium that would make me happy.

00:02:04.980 One of the problems I often had was watching these talks and seeing a lot of charts discussing topics like linear regression or gradient descent. I would try to look up definitions for terms like gradient descent and get definitions like this: 'Gradient descent is a first-order iterative optimization algorithm to find a local minimum of a function using gradient descent. You take steps proportional to the negative of the gradient or the approximate gradient of the function at the current point.' That is how I feel when I read these types of definitions! So I wanted to set up some goals to ensure my talk wouldn't be like that!

00:02:50.220 I have two main goals for this talk. First, I want to focus on practicality, discussing a real-world use case and framing everything I talk about around that idea. But I should also start with some disclaimers. I will try my best to balance between practical and technical. As a result, I might end up oversimplifying some things. This is really just meant to be a high-level overview, and my goal is to encourage a few of you at the end of this to think, 'Oh, that was me! Maybe I'll try it too.'

00:03:21.690 I want to start off with a brief experiment. I want to find out if something is food, or more specifically, as people like to remind me, if it is a photograph of food. So I hope this will be a fruitful experiment. Let's look at this image. Sorry, sorry, is it just a photograph of food? How do we know? We've seen things that look like this or exactly this kind of thing, and we know it's food, but that doesn't really get to the meat of our problem.

00:04:11.769 Now, some of you may have never seen this particular dish before: yakitori. But how do we know that this is food? Does it smell good? Well, part of it is that we can immediately recognize the ingredients, break it down, and kind of process it. It looks like something we've seen before—that's pretty sweet! But what about this? Is this food? See, this one is really questionable. However, if I contrast it here and say, 'Is this food?' and then ask, 'Is this food?' you might say this one isn't food, but this one might be.

00:04:50.320 We can tell there's something inside, and we actually have a lot of experiences where we've opened up a package and eaten those delicious crispy Kit Kats and felt very happy. If you hand a child a package of Kit Kats, and they're not used to this experience yet, they can get very disappointed because there's nothing in there! Now, I just have a couple more questions. Is this food? Is it a picture of food? I'm not sure. I think we can clearly identify this as food, but what about this? I really hope nobody said yes.

00:05:37.960 What I want to address is that we're able to look at these pictures and identify their food because we know about pattern recognition. In many ways, this has been core to our survival. Pattern recognition is one of the key factors that led us to develop language, tools, and agriculture. We are really exceptional at it, except we hate slow and menial tasks. If I gave you ten thousand images and said, 'Are these all food?' you would get bored really quickly and not want to do that. So we try to venture out and automate this problem.

00:06:40.130 Automating this presents an interesting challenge because, as humans, we can recall information in an extraordinarily fuzzy way. Through past experiences, we can piece together information and form new understandings about the world around us, but this process is very hard to program. Therefore, machine learning can be used as an approximation of this kind of behavior. I want to talk about three different types of machine learning in this situation.

00:07:24.290 First, we have unsupervised learning, which specifically refers to taking information and clustering it together into individual parts. The computer decides what that means. You might have a chart with things plotted out, and you can see centroids indicating groups based on proximity. Anything in the green area would be green, anything in the red area would be red, and anything in the blue area would be blue. This clustering is the essence of unsupervised learning. Sometimes we want to know if we can label something, which is where supervised learning comes in.

00:08:16.490 In supervised learning, you provide all the information upfront and say, 'These things are this, these things are that.' Then, the system tries to learn how to identify new information based on that training. Lastly, we have reinforcement learning, which is similar but involves telling the system whether it made the right or wrong guess at each step. The system takes that information to improve over time. For the purpose of identifying images, we will only discuss supervised learning.

00:09:02.810 We will focus specifically on neural networks. I want to give you a brief history of artificial neural networks. One of the first artificial neural networks created was the perceptron, invented in 1957 by Frank Rosenblatt. This machine was designed for image recognition back in 1957, which I find extraordinarily fascinating. However, one limitation was that it could only learn to recognize linearly separable patterns, which led to some stagnation in the research and development of neural networks.

00:09:43.520 Now, fast forward to 2015 with TensorFlow. What is TensorFlow? TensorFlow was developed as a system for building and training neural networks, represented as something called a data flow graph. Each layer of this graph, depicted in oval shapes, takes in a tensor and returns a tensor, performing some operation on the input tensor. So, what is a tensor? When I looked it up, I found an image that looked extremely complicated. However, the simple answer is that a tensor is just an N-dimensional array of information. You can choose whatever dimension you want and represent it as a set of inputs that then pass through each layer of the network.

00:10:42.160 Now, this particular data flow graph represents Inception, specifically Inception version 3. You might ask, 'What is Inception?' Inception is a prebuilt data flow graph useful for categorizing images, originally built on the ImageNet dataset. This website is pretty interesting because you can use it to obtain human-classified images by category. There are tons of them available, so it’s extremely useful if you’re trying to do any kind of machine learning to download these sets of images and integrate them into your project for training purposes.

00:11:43.670 But how does all this tie together? I want to return to my original question: is it food? Why am I trying to answer this question? Well, as you may have guessed from my shirt, I work for a company called Cookpad, which has lots of data about food, specifically images. However, when you have a website that allows users to submit content, you run into some problems.

00:12:05.690 Users may not always be trying to submit food to your website, and that’s significant because if you don’t care about that, very bad things can happen. People can get upset quickly. Therefore, we want to protect our users by ensuring that the images posted are indeed food for various reasons. We don't want to show them something inappropriate. We also found that users sometimes like to add text to their images, saying things like 'You can find the real recipe over on this other competing website.' We want to prevent that, too.

00:12:57.960 So at Cookpad, we essentially have an app that looks at an image and classifies whether or not it's food. I thought it would be interesting to try and recreate what was already created and do it with a Rails app. Our main application is a Rails website, but the machine learning part is all done in Python. So I decided to build a Rails app, which was mistake number one. Actually, mistake number one was telling myself that I wouldn't use Python—I'd use Ruby because I'm giving this talk at RailsConf, and no one wants to hear the dirty word Python.

00:13:58.880 There’s a gem for handling this called TensorFlow.rb. I tried this a number of times but couldn't get it running on my machine. My guess is that there’s an issue with Clang, and this gets into compilers, which nobody enjoys debugging. You just want to download it and have it work, so that didn’t happen. One of the suggestions was to use Docker, which leads me to mistake number two: a flawed Docker setup. After setting up the Docker image, I tried compiling C and C++ programs necessary for my requirements, but they didn’t work. I kept retrying and encountering my favorite Docker problem: my startup disk was almost full.

00:14:45.660 After sifting through the documentation to delete all of my images, I decided to start over. I want to document my journey to success in solving this problem, with one small note: your mileage may vary. Let’s go over the installation and setup. My starting point was to install Python, set up a virtual environment, and then proceed with the installation of TensorFlow. I want to note here as a Rubyist that I want to embrace the Ruby community.

00:15:23.860 While I’m recommending Python, this does not mean we, as a Ruby community, shouldn’t also embrace machine learning and try to incorporate it. However, if you’re getting started, you don’t want to fight against your tools. Right now, the best toolset I’ve found is with Python. The community is really strong in machine learning, and they have put together the best tooling. I highly suggest not resisting that. So, I got this set up, and it just worked!

00:16:00.310 After installing TensorFlow, the next step was figuring out how to make it solve my specific problem. There is already a tool called Inception, which works on ImageNet images and classifies those, but I want to do something different: retraining. I want to change which image set it should identify to one that I care about. We can do this through a process called transfer learning.

00:16:51.440 If we look at this data flow graph, we have one step at the end. If we pull off that step, we can replace it with our own and benefit from all that was previously learned while applying it to our current data set. So if you want to work with Inception v3 and have it recognize your own image set, you need to collect a folder full of data. You can name the folder whatever you want; I called mine 'data' because I’m not very creative.

00:17:45.540 Inside that folder, you just need to decide your categories. Choosing categories is crucial; it won't suffice to only have a bunch of pictures of food. If all you’ve ever seen are food images, then all the model will know is food. Thus, identify things that are not food. For our case, we didn't particularly care about flowers, but those came with the example, so my company used them for some reason. We do care about avoiding pictures of people; we want to protect our users' privacy by removing any images of humans.

00:18:45.360 We also wanted to remove any images with text because of the issue mentioned before, where users wanted to say, 'Go watch the video for this recipe over here.' Thus, we created these categories and placed lots and lots of images inside those categories. For my training purposes, the folders I had contained between 1,000 to 2,000 images, except for text, which had around 600 images. The nice thing is that the images can be really small; TensorFlow operates on images sized 299 pixels by 299 pixels.

00:19:27.940 If you don’t have an image of that size, it will resize it automatically. This is an important consideration with your test data: if your subject isn't centered when it tries to resize, you might lose crucial information. So it makes sense to resize images beforehand to ensure that your subject is included properly. Now that I've collected all my data, the next step is retraining, and I'm no expert in machine learning. I didn't want to write this script myself, but luckily, there's already a script that does it.

00:20:38.680 I pulled that script down from the TensorFlow repository on GitHub, changed some directories so that instead of outputting to temp, it wrote to my local directory, and then I ran the retrain program, telling it that the image directory was my data directory. Then I waited... and waited... and waited. While waiting, it output messages like 'Looking for images in all those directories' and started building bottlenecks. What is a bottleneck? I didn’t have any clue, so I looked it up.

00:21:42.440 The definition for a bottleneck is an informal term referring to the output of the previous layer. It’s significant because, to train the network, it will continuously pass images through to see how it performs. You don’t want the image to go all the way through your data flow graph every single time; instead, it builds it once and caches it for reuse. If you do this from scratch, you won’t get pretty printing of 100 bottleneck files created; rather, it processes each file individually, telling you it has created bottlenecks.

00:22:25.290 Next, you'll see outputs like train accuracy, cross-entropy, and validation accuracy, which piqued my interest. TensorFlow does something interesting, splitting your data into three parts: training, validation, and testing. Training is the data used to tune your model, enabling it to recognize patterns. The model never encounters the testing data until it’s time for validation to avoid overfitting. We want improvements in training accuracy to manifest in the unseen data set. The last noted part, cross-entropy, serves as the loss metric.

00:23:03.590 Every setup requires some kind of loss metric to indicate what is undesirable. You aim to minimize that value to enhance your system. The performance of the program yields two output files: the graph and the labels. The graph is complicated—a mixed encoding of text and binary, not particularly interesting to view. However, the labeled file is simple, showing categories like 0, 1, 2, 3, and 4, allowing the lookup of labels to identify what each category represents.

00:23:53.780 Once I've retrained the network, I want to label images, and I found code online for that. I copied the label image script into my current directory, changed the graph and labels to point to my local directory, and it ended up being a much simpler process than the retrain; about fifty lines compared to around a thousand. This script only runs the image through the last layer that has been trained, and this process is quite quick.

00:24:57.910 We reach the point where it's time to test our labeling. I have a few test images: sushi pizza, a post box, a hamburger, a regular pizza, and some dirt. Let's determine which images are food and which aren't. To do this, I run the Python label image command followed by my image. Starting with sushi pizza, we see that it registered a 0.9 probability of being food, indicating a strong belief that it is indeed food. Next, we examine the dirt and discover it’s classified as 'other,' which is excellent since we don't want dirt to show up as food.

00:26:16.823 For the third test image, pizza, the model is less confident, registering only 84%, suggesting it is food but not with absolute certainty. This highlights an essential point: perfect accuracy is unattainable unless the model is trained solely on food. There's always a small margin of error, and aiming for full accuracy often leads to overfitting, which detracts from broader problem-solving capabilities. Having established that the system performs well, I now want to integrate this functionality into a web server using my Rails application.

00:27:45.930 So, how do I combine the Python code within Rails? As I mentioned before, I'm already implementing this exact system at my workplace. My suggestion is to treat the machine learning aspect as a separate service and call it from my Rails application. To build a service, I decided to use Flask, which is quite similar to Sinatra. I just needed to install and upgrade Flask, converting that label image Python file into a small server capable of producing responses.

00:29:02.070 The final output looks quite similar to the label image Python file, now structured as a server with one route: 'classify.' This route returns predictions, allowing us to analyze not just the top prediction but the percentages to check for high probabilities of unwanted classifications. We flag images that fall into categories we are concerned about—they are then reviewed by a human if necessary. As a result, we've observed that we achieve about 98% accuracy on all images uploaded.

00:29:59.920 I built a small Rails application that takes the images and identifies whether or not they are food by calling the service we created. Let’s explore this now. I can choose a file, starting off with sushi pizza. It confirms it’s food! Let's try something more complex—taco baby. It correctly identifies it as not food. Excellent! We can apply this process to any number of images, which is essentially all that it took: downloading a few scripts, collecting images, running the code, and figuring out how to make it work.

00:30:45.819 When I began this journey, I was adamant about not using Python—let that be a lesson! I wasted an entire day battling with Docker command lines, attempting to find the right flags for Clang, and eventually switched to the Python route, realizing that it was far simpler. The takeaway here is to understand your objectives and not fight against the current—if you’re learning, make the process easier. Once you have learned, then you can explore different paths if you wish.

00:31:59.460 I also want to recommend additional reading material related to my talk. I highly recommend 'Demystifying Deep Neural Nets' by Rosie Campbell. It offers a great in-depth perspective on the technical aspects we touched on. Additionally, for those interested in a book, I suggest 'Python Machine Learning' by Sebastian Raschka. So, that wraps up my talk!

00:32:45.560 Now, let’s assess the probability of taco baby. It registers as 'other,' potentially resembling a person with about 30% certainty—I’m not entirely sure. As for the question about how many images Cookpad classifies daily, I don’t have that exact number, but it is quite a few. When we classify text, we deal primarily with Arabic-speaking countries, as that has posed our most significant challenges, leading to a lot of Arabic text examples with different fonts and overlays.

00:33:52.130 As for the Google Cloud Vision API, I have not personally used it. I’m not on the machine learning team, so my exposure has been limited to trickle-down information. However, I’m becoming more intrigued by Python as I explore avenues to learn more about machine learning. That being said, if there's a desire to implement something similar in Ruby, it will require the community to take action, collaborating on getting TensorFlow.rb compiling on all systems.

00:34:56.410 The main developer primarily uses Linux, which means it's not being designed with OSX support in mind. The community's involvement in understanding TensorFlow, C++ coding, and Ruby extensions would be vital in making this a reality. Regarding image classification performance, it generally doesn't take too long per image. Even with extensive training data, classification occurs relatively quickly.

00:35:12.840 Conducting this process asynchronously would be advisable, as it allows for much smoother operations. Now, as you collect images, it's essential to recognize that there's no exact number of images to meet your needs—it's an imperfect science. Therefore, you should aim to gather as many images as you can, while also ensuring that you have balanced data of both food and non-food items.

00:35:30.060 As you notice shortcomings in your classification, such as wanting it to identify certain items but failing, getting more examples relevant to that case is crucial. Once you have your system stable, it’s not often necessary to retrain. If a significant issue arises, or if a new form of user abuse is identified, then you can revisit training with additional data to handle those problems effectively.

00:35:58.800 I appreciate your time, everyone!