00:00:13.880
I have several goals for this presentation. One is to talk a little bit about the general trends in machine learning and hopefully get you excited about some of the things happening in this field. The second goal is to communicate that even though there is a lot of academic math involved in the literature of this space, it's actually very easy to get started.
00:00:30.279
A lot of it relies on very simple core insights that we can all use. Lastly, I hope to inspire you to go out and explore these ideas further.
00:00:50.640
So, first of all, what comes to mind when you hear 'machine learning'? Terminator? AI? Chess-playing programs? Roomba?
00:01:02.760
When I ask this question, a lot of people picture complicated mathematical formulas with linear algebra, calculus, and optimization theory all mashed into one field. While that's certainly true, it is, I think, unnecessarily complicating things.
00:01:31.240
The reason for this is that any course in AI or machine learning, whether at a university or self-taught, often focuses on the algorithm itself. We generally have inputs, a runtime (which may be your CPU, GPU, or whatever), and an algorithm—essentially the core insight of how your machine will learn.
00:01:56.840
Learning in general is hard—this is something we all know intrinsically. Here, you're attempting to teach a machine to learn something, which adds complexity. I believe we should expand this idea for several reasons, and I'll explain why.
00:02:22.640
First, consider the runtime. My experience with machine learning was very theoretical. I took various courses and explored the field independently, only to find that academics often treat the runtime as merely a practical constraint.
00:02:48.000
A survey of machine learning faculties at most universities will typically show a heavy presence of statisticians and optimization theorists, focusing on mathematical proofs.
00:03:00.680
While this is interesting, it often overlooks the runtime. For them, the runtime is just a constraint—someone will eventually build a machine that can run it, regardless of whether it requires a terabyte of memory. However, I came to realize that I couldn't run my supposedly great recommendation algorithm on anything but the most trivial datasets.
00:03:39.799
I discovered that my local computer science department housed a fabled machine with 40 terabytes of memory and lots of CPUs. I pursued access to it, spending two months negotiating until I successfully logged in one day, only to find that I only had access to 768 megabytes of memory.
00:04:09.879
What I had wasn't the supercomputer I imagined, but rather a commodity cluster of 50 machines. I learned quickly that practical constraints exist with distributed systems, which led me to explore the challenges they present.
00:04:39.560
Today's cloud computing platforms make a significant difference, but there's still not a lot of research done. In fact, you all are ahead of most academic research in your knowledge of distributed systems.
00:04:55.640
Next is the data input. Until recently, data input has been scarce. For instance, in natural language processing, researchers often worked with datasets of millions of files. Today, however, we have trillions of pages on the web to work with.
00:05:58.440
The Ruby community plays a crucial role in generating terabytes of unstructured and structured data. Yet, we face a conundrum: we have the capability to process huge datasets, but we're also collecting data at an unprecedented rate. So, what does that mean for us?
00:06:44.840
This brings us to an interesting paper published in the early 2000s by a group of researchers from Microsoft titled 'More Input vs. Better Algorithms.' They wanted to determine how throwing orders of magnitude more data at existing algorithms would impact performance.
00:07:58.600
What they found was fascinating; as you increased the input size, the performance of all algorithms improved. For instance, one algorithm, referred to as learner 5, started at about 78% accuracy and performed even better with more data. This phenomenon eventually led to what's termed 'Data-Driven Learning,' emphasizing the importance of having access to large datasets.
00:08:57.840
You can even go so far as to make your data the algorithm itself. Consider a simple example—anyone reading the jumbled text on the slide would struggle to determine how many distinct words are present due to the absence of spaces; however, recognizing 'word segmentation' can be tricky, as this is an ongoing challenge for many in the field.
00:10:28.200
How do we go about solving this problem? We could build a grammar model, grab a toolkit that already provides a solution, or make an educated guess. The latter could involve estimating the probability of word segmentation based on how frequently certain letters appear in known data.
00:11:38.919
For example, you could write a web scraper for Google, search for specific letters, and determine their frequency across the web. Employing such a simple, data-driven approach can yield remarkably accurate results, akin to what Google does through its translation services.
00:12:50.679
This concept demonstrates how algorithms in machine learning often revolve around simple insights. One example of this is learning through compression. Compression, which is essentially about identifying significant concepts within data, can serve as a useful metaphor for understanding machine learning.
00:14:37.880
Let's say you're tasked with predicting whether a certain fruit is tasty. You might hypothesize that the feel and color of the fruit are determining factors. By gathering data and plotting it, you can visually separate the 'tasty' fruits from the 'not tasty' ones using a decision boundary - this embodies the perceptron algorithm, a basic model for classification in machine learning.
00:16:21.320
However, perceptrons have limitations and don't work for all datasets. Consider a scenario where you're only working with color as your dataset—you simply cannot draw a linear boundary that effectively separates two classes in that case. The key insight is that you can expand your data space into more dimensions, allowing for a clearer boundary.
00:17:36.399
This realization directly connects to support vector machines, where data is thrown into n-dimensional space, helping separate positive from negative examples. Thankfully, Ruby developers have resources at their disposal—libraries like libsvm, which is well-supported and conducive for tackling straightforward tasks like spam classification.
00:18:41.200
Moving on to another significantly impactful area in machine learning, we have recommendation systems. In essence, you have users and objects (like movies) that get ranked by users. By leveraging linear algebra concepts, we can analyze what a specific user might like based on their preferences, enabling us to predict rankings for unrated objects.
00:20:53.600
Matrix decomposition can help us do this. Consider an image, which represents a large matrix of pixel values. With methods like singular value decomposition (SVD), we can effectively compress this information while still roughly preserving the important characteristics of the image. This is a basic yet powerful approach used in computer vision.
00:22:27.440
Similarly, we can apply this concept to recommendation systems by finding significant features within the data. By running SVD on a matrix derived from user-object interactions, you can discover essential correlations that help improve recommendations.
00:23:38.760
Next, we have clustering, an essential machine learning task. Just like humans can visually discern clusters among data points, we can devise algorithms to do just that. The challenge is defining the 'similarity' between data points, as it greatly influences how clusters are formed.
00:24:48.880
For simple strings, you can immediately see similarities, as they share common characters. The challenge lies in determining how to quantify that similarity. By employing compression techniques, you can compare two strings to identify shared similarities based on size reductions resulting from compression.
00:26:07.360
Once you derive a similarity score based on compression, you can effectively cluster huge datasets by measuring degrees of similarity. This approach is powerful because, unlike domain-specific methods, it requires no prior knowledge of the data.
00:27:50.160
In summary, we've discussed runtime, data input, and algorithm performance. Many algorithms may appear complex but have simple core insights. Additionally, ensemble methods demonstrate the power of combining multiple simple models for solving complex problems.
00:29:24.400
This strategy is exemplified in competitions like the Netflix prize, where collaboration led to improvements. My final note is to encourage everyone to leverage data-driven learning, focusing on runtime, and applying ensemble methods for better performance. Thank you all for your attention!
00:30:53.000
Unfortunately, we don’t have time for questions now, but I encourage you to catch me during the break. Thank you once again!