wroc_love.rb 2017

Machine Learning For The Rescue

wroc_love.rb 2017

00:00:11.570 Thank you, thank you! That was a very nice introduction. Hello! My name is Mariusz Gil, and today I would like to talk a little bit about machine learning. I would like to share some concepts about what machine learning is and where we can use these techniques in our applications. To be honest, don't expect a comprehensive introduction to ML, because there are very nice resources available like Coursera or Udacity, where you can gain knowledge about the algorithms and the basic concepts of machine learning.
00:00:18.180 This talk will focus on certain problems we discovered in some of our past projects. I have to be honest, I'm not much of a Ruby guy; I wrote my last application in Ruby about three years ago using the CodeIgniter framework. So, remember the CodeIgniter framework? Yes, that was my last big application written in Ruby. However, I love Ruby for its syntax, and sometimes I still use it. I am primarily a PHP guy, and I want to acknowledge the PHP community. I'm one of the members of the PHP community in Europe, and we are trying to adopt and promote good practices for developing applications.
00:01:03.259 On a daily basis, I run my own company called Sauce Ministry. It is a very small one-person company; however, there is one thing that is very important to me: I get to do what I love every day. For example, today I spent my day brainstorming with a team about an interesting domain. This is what I do every day; I strive to help people and companies become better. I work with different clients, from huge portals and large applications to very small clients, and there are numerous problems I would like to share.
00:02:10.590 I would like to start by discussing one very interesting project that got me thinking about the usage of machine learning. One of my clients, a friend of mine, has a very successful and popular website in Poland; it was a lifestyle portal focused on women. One day, my friend realized there was a problem: in Google Analytics, the trends were not looking good, traffic was down, and users were starting to disappear. If there is no traffic and no users, there is no money at the end of the day. My colleague began investigating what was going on, and after some research, they discovered that a competitor had likely created a lot of bad backlinks to their website, which resulted in punishment by Google.
00:03:23.640 Part of the links were removed from the search index by Google, and if you're not in Google's business, you're not in the game. So, my friend contacted Google to seek assistance in solving this problem, and the response was very straightforward: they needed to submit a disavow file. This file required them to specify all the backlinks they wanted to remove from the index. The challenge was that this application was very popular, and the total number of backlinks was several million.
00:04:10.170 Just imagine how much time it would take to classify these backlinks! My colleagues thought about analyzing the backlinks, and we needed to classify them. They had a single file containing one million backlinks, and they needed to split this file into three groups. The first group was to contain only good links, links from reputable websites. For instance, having a backlink from a news portal with a PageRank of eight or nine would be very beneficial. The second group consisted only of links that needed to be removed, which included many websites created by spammers solely for creating backlinks. We identified several pages with tens of thousands of links that had no real value.
00:05:31.830 The third group was a more ambiguous category where we didn't care if we wanted the link or not. This example might seem artificial, but in the business context, there was a similar case with a company called Knockout. Some of you might remember when Knockout was also punished by Google for spamming links, just like my client's website. During that time, Knockout was in the process of an initial public offering (IPO), and when this problem arose, the IPO was canceled. This goes to show how critical it is to manage your backlinks carefully.
00:06:18.870 So, let's imagine we have a single file containing one million links. The task at hand was to classify these links, which meant transforming each URL into numerical representations. The URL itself is just a string; you would need to examine the content of the page to understand its context. As a human, you could open the browser, check what is on the page, and classify it, but how do you do that for one million URLs? It's a complicated job, so our client thought about transforming these URLs into sets of numbers – vectors, if you will. For example, we could determine the age of the domain, the number of outbound links on every page, and more data points to feed into a CSV file.
00:07:50.440 Our first approach was somewhat rudimentary, creating an ugly proof-of-concept code to see if this method might work. We developed a PHP application using RabbitMQ and later tested Kafka, as RabbitMQ wasn't handling the data volume well. The application fetched all the URLs, checked them, and generated various metrics. We sent a lot of files to our client. However, what do you think was the client’s response? They told us they didn’t know how to find a formula based on the collected data. With eight columns of data, finding correlations might be manageable, but if you have dozens of columns, the correlations can become much more complex. Our client decided to cancel the project because we couldn’t identify the formula we needed.
00:09:24.970 After spending several days on it, they felt it was too much to handle. But we wanted to help our friend, so we convinced them to let us try another approach. We weren't experts in machine learning, but we suggested we take a couple of weeks to explore the possibilities further. We had a collection of data: we had CSV files and vectors of numbers for every URL, but we needed to classify this information. So, our next goal was to perform machine learning tasks to get meaningful results. However, our second attempt ended up failing for a couple of reasons.
00:10:53.620 First, trying to merge Python and PHP together didn’t turn out well, and secondly, we were testing our methods on data that didn't match our objectives. Both of those aspects led to failure. The moral of the story is that you ought to know what you're doing. Acting without understanding will lead to failure, without a doubt. If you try to use any framework or library without grasping the underlying concepts, you may succeed initially, but soon enough, you’ll come across problems.
00:12:06.280 After some reflection, we decided to try a third approach known as a data-oriented machine learning workflow. Eventually, this application was introduced into a production system with an impressive 95 percent accuracy, which was quite remarkable. So, to summarize, machine learning is fundamentally about a system that learns from experience. Experience is critical, and it refers back to the algorithm’s ability to improve its tasks based on prior performance.
00:12:55.040 You have data, a machine learning task, and you need to prepare your data adequately for the job. Then you can create learning algorithms and validate your models. Afterward, you can use the models to achieve results, but measuring performance is crucial. If you do not evaluate your model, you will encounter problems. The concept of machine learning, as I see it, involves understanding focus tasks in certain contexts.
00:14:22.290 Machine learning can solve various tasks, whether it’s classifying objects into categories or predicting continuous values. For instance, in our project, we aimed to classify every single URL into one of three groups. If we have developers in the room, perhaps some of you are juniors and others seniors; it's possible to create classification models to categorize developers based on experience levels. Additionally, we could predict salary based on the number of years worked, skills, and other variables.
00:15:27.100 Machine learning can also be employed for clustering data automatically or dimensionality reduction—compressing 100-dimensional data into two-dimensional plots. You might also take advantage of association rule mining, as demonstrated by analyzing the Titanic tragedy dataset to explore which factors influenced survival rates.
00:16:00.490 Understanding the tasks you can solve with machine learning is essential. Different techniques can be applied based on your requirements. For example, supervised learning involves providing input and expected output to a model to learn a function. In contrast, unsupervised learning allows clustering data without labeled outputs. Reinforcement learning requires integrating the models into an environment to receive feedback and learn from results.
00:17:12.130 In my experience, I worked as a CTO at a company focusing on real-time content recommendation platforms. We utilized machine learning to predict the probability of user interactions with links. For example, with millions of users and links, constructing accurate classification models is crucial for maximizing revenue, as each misjudgment can lead to lost income.
00:18:52.890 A few days ago, while in Canada, I listened to an interesting talk about event sourcing and its application in code reviews. In such applications, the developers need to estimate the cost of reviewing a particular pull request. The input data could include the URL and the number of files or lines submitted for review, which we can draw upon from our historical data, allowing us to form predictions.
00:20:46.330 The script I worked on is very simple, perhaps just a demonstration. It determines the cost of code reviews based on how many files and lines are included in the request. We can slightly modify data inputs to validate different scenarios. I found many libraries for machine learning; however, most are in Java, Ruby, or Python.
00:22:12.129 If you’re a Python developer, libraries like Scikit-learn are excellent for creating and validating models. This library implements many modern algorithms, facilitating the creation of regression and clustering models. However, understanding your data is paramount; without that knowledge, you will encounter many failures.
00:23:39.240 Before diving into machine learning, it’s essential to understand the characteristics of your data. For instance, if feeding multiple features into your model, it's vital to identify which are relevant and crucial. When refining a model, dealing with many irrelevant features can often lead to confusion.
00:24:50.156 Once you identify which features to keep, the next step is to prepare data for machine learning. Tune parameters specific to your algorithms, as well as gather enough data for validating your input. Data preparation is not often emphasized but is one of the most critical steps for success.
00:25:48.184 For machine learning to be effective for your projects, it’s crucial to maintain ongoing awareness of how the features you provide influence performance. You will find various algorithms useful, and possessing an understanding of them is essential to ensure accurate classification and predictions.
00:26:44.450 Lastly, remember that in most applications, you won't need to write machine learning algorithms from scratch. Instead, you will utilize existing, tested algorithms. The question then becomes: which specific algorithm best matches the problem you are trying to solve? As you work with different algorithms, ensure ample data representation and proper characteristics.
00:27:39.160 As noted previously, if your model is misaligned with the problem, it may produce incorrect predictions. Evaluate algorithms to identify their strengths and weaknesses based on other relevant algorithms you've tried. If you can't reach a solid result, reevaluate your data collection methods and quality.
00:28:51.559 Thank you very much for joining the session! I am looking forward to your questions, as well as feedback on my earlier project examples regarding effective classification. Your insights on how to measure performance or identify better features for predictions will be invaluable. Let’s start a discussion!
00:30:27.370 I'll be around during the afterparty if anyone wants to continue this conversation. Thank you!