00:00:06.080
Laura is going to come up now. Fun fact about Laura: she started out studying social and cultural anthropology, and then, as it often happens, she ended up in tech because, well, one needs to pay rent somehow, and rent is getting expensive even in Berlin. So, let's give it up for our final talk of the day! Thank you.
00:01:04.799
Thanks! I haven't even said anything of importance yet. My name is Laura and I've been studying various topics for the past 11 years, and I’m still not done. I've also helped organize Rails Girls Berlin for quite a while. If you’re involved in Rails Girls or similar projects, either as a learner, coach, or organizer, please raise your hand. Awesome! That’s a lot of you and thank you for keeping this community growing—it’s really important. Give yourselves a round of applause!
00:01:50.399
Now, let’s get this ship started. In 1998, Lauryn Hill released her debut album, 'The Miseducation of Lauryn Hill', which has a connection to today’s title. The title references several works that point out how the American educational system has indoctrinated Black communities with white supremacy instead of teaching them about Black history and the Black present. The question of who educates and who produces knowledge is vital—it addresses who holds power over a seemingly universal truth.
00:02:31.840
The Miseducation of Lauryn Hill is still one of my favorite albums today. This title will guide us through the next 30 minutes—maybe just 25 if I forget to breathe between sentences. I want you to keep in mind how not just humans, but also machines gain and produce knowledge. That is, how machines learn. I promise that I will leave out all the math, so you should all be able to follow.
00:03:05.040
So, let's take one more step back in time to 1950, when Alan Turing published a paper. He asked whether machines can think. Defining 'thinking' was really hard for Turing, so he tweaked the question to whether machines can imitate thinking. This is what you might know as the imitation game, where we try to discern if a human can recognize a difference between interacting with another human or a machine. While it’s difficult to define thinking, defining learning seems a bit easier.
00:03:50.400
For the purpose of this talk, I define learning as a process in three steps. The first step involves reproducing knowledge, similar to learning vocabulary in a new language or programming. The second step is remixing knowledge; this is where we create something new from existing knowledge. And the final step, the most interesting part of acquiring knowledge, is reflection. Here, we learn to understand the limits of our knowledge, discerning what our knowledge can do, cannot do, and what it should do.
00:05:14.320
Now, let's discuss three key elements of machine learning: data, algorithms, and—perhaps surprisingly—humans. Data on its own can reproduce knowledge, and algorithms can remix information, but humans are necessary for the reflection part. So, now you understand what this is all about. Before we dive deeper, I want to give a quick content warning. When we get to the last part of this talk, I’ll be discussing hate speech that includes racist and sexual violence. I will let you know when it’s time to look away.
00:05:59.760
Let’s begin with the reproduction part of machine learning, or, more generally, knowledge. Machine learning is quite adept at reproducing what we as a human collective already know. It can identify spam messages, make music recommendations, or translate from one language to another. However, machines can also learn the wrong things. For example, Caroline Cinders once described how her music recommendation algorithm became skewed after she intensely listened to breakup music, which caused her Spotify to suggest Mumford and Sons constantly. Unfortunately, to fix this, she had to delete her account, create a new one, and avoid that band.
00:08:12.240
Beyond these humorous examples, there are more serious cases where reproducing human behavior can be detrimental. Machines often learn biases, evidenced by cases like Amazon's recruiting tool, which demonstrated discriminatory hiring patterns, or biases in the justice system. The artist Hito Steyerl suggests that those who own past data 'colonize the future', meaning that machine learning relies on historical data to predict future outcomes. Consequently, data alone cannot learn beyond the reproduction level.
00:09:20.240
Now, let’s touch on the remixing part of learning—though I don’t mean scratching DJ style. Before continuing my talk, I want to ascertain if you are human. Please select all the images with a bike; since I have no proper interface for this, I’ll suggest a fun agreement. Everyone knows what jazz hands are, right? Let’s use them! So, when I show you an image that contains a bike, give me your best jazz hands. If not, just stay still. Let’s begin: Does this image contain a bike? How about this one? Did someone bring their bike and hide it in the image? Now, let’s check this last picture; it contains Rotterdam architecture that you should check out.
00:11:22.159
Congratulations! You’re not robots. However, since I was the one asking the questions, we still don’t know if I’m a machine. Let's assume I am. Now that you've classified some bikes for me, I will learn how to classify bikes. First, I simplify the image because I'm not the most resourceful machine. I’ll lower the resolution and drop colors until it’s sufficient for me to identify parts based on basic shapes, like horizontal and vertical lines. Humans can identify these shapes easily, but I, as a machine, need to perform more rigorous comparisons throughout the entire image.
00:12:37.759
What I do is take one shape and start comparing it to the image progressively. If I find a match, I create a new image based on that comparison. This extraction process leads to what’s known as a convolutional neural network. My purpose is to determine which shapes are significant for identifying a bike. Once I’ve completed this, I can confirm whether a new image contains a bike or not.
00:14:11.360
However, there are issues with this approach. For instance, a study last year aimed at tricking machine classifiers by creating images that have the basic shapes of a bike, despite the images themselves being nonsensical to humans. So, here’s the learning point: algorithms can do well in reproducing knowledge and extracting shapes. However, they often fail to see the bigger picture, focusing too much on individual elements without recognizing their contextual significance.
00:15:29.200
Now, let’s talk about reflection. Imagine if I asked you to identify hate speech instead of bikes. Identifying hate speech is much more complex, involving nuanced understanding rather than simple object recognition. Currently, I’m assisting in a research project analyzing hateful communication on social media and in German news media concerning migration. Our goal is to find methods and software that can recognize hate speech early and suggest de-escalation strategies. Finding hate speech is significantly harder than identifying bikes.
00:17:05.200
In our project, we blend communication science with computer science to tackle this challenge. When we identify hate speech, we employ what’s known as supervised learning, which uses a labeled dataset that indicates whether statements constitute hate speech. Our data is then divided into training and testing sets. During training, we use convolutional neural networks to recognize patterns similar to hate speech statements and create a classifier based on this understanding.
00:18:41.600
However, machine classifiers often yield unsatisfactory results, especially early on. We must tweak parameters until we achieve better accuracy. This means adjusting various elements repeatedly until we are satisfied with our classifier’s predictive capabilities. Nonetheless, it’s crucial to emphasize that machines cannot inherently recognize their failures; there must be a human component scrutinizing the output.
00:20:39.200
An example from our research project involved analyzing a massive dataset from Wikipedia and Twitter that was not labeled for hate speech but for toxicity. A notable incident involved a phrase that was flagged as toxic by our algorithm. The classifier strongly associated swear words with hate speech, yet in this context, it was merely an individual apologizing. This demonstrates how classic indicators for hate speech can mislead classifiers.
00:22:34.720
Moreover, hate speech can manifest without using defamatory language; subtle rhetorical questions and ironic remarks often escape detection. Words can evolve in meaning, making it challenging for both machines and humans to stay abreast of their significance. Unlike machines, humans can adapt their understanding to changing language contexts, which is fundamentally crucial to effectively comprehending hate speech.
00:24:56.000
In our research, we concentrate on defining hate speech clearly, as it consists of intentionally discriminatory messages. This definition transcends mere emotions and is not confined to spoken words; hate can manifest in rational ways and must be identified contextually. We contextualize hate speech through methodologies, forming a coding guide to aid humans in understanding it significantly better than a machine could.
00:27:07.760
During discussions, coders analyze statements, debate their meanings, and refine the codebook for clarity. This collaborative process allows for valuable exchanges, ensuring the classifiers are well-grounded in the practical applications of our definitions. It is crucial for coders to recognize hate speech through various indicators, such as legitimizing violence and making dehumanizing comparisons.
00:30:06.720
Now, let me conclude with a quote from Lauryn Hill’s album: 'Consequences are no coincidences.' The knowledge produced by machine learning is no coincidence; it reflects the data we choose to feed into the machine, how we build the algorithm, and ultimately, how we question and reflect on the outcomes. If you program machines or use them, I urge you to be the reflective human in this process.