Things I Learned the Hard Way Building a Search Engine

00:00:06.439 Hello, everyone. My name is Katarzyna Turbiasz-Bugała. Today, I will discuss my experiences building a search engine, particularly focusing on search mechanisms. But before diving into that, let me share a bit about my background. I graduated in biology from a university in Poland and spent around seven years as a geological researcher. Four years ago, I shifted my career to programming, and I am still actively engaged in this field.

00:00:25.529 At the start of my programming journey, I came across a wonderful company, where I'm currently working on an e-commerce app. My primary responsibility is developing the search mechanism for this app. Initially, my task was to replace an existing search mechanism based on PostgreSQL with a new one based on Elasticsearch. As someone without a formal programming education, when I first heard the term Elastic Search, my imposter syndrome spiked. However, determined to succeed, I immersed myself into endless research online and learned a great deal about Elasticsearch.

00:01:19.259 Elasticsearch is a powerful tool that is self-sufficient and customizable in almost every aspect you can imagine. However, mastering it can take a significant amount of time, which is why I’m not focusing on it directly today. Instead, I'll discuss the underlying information retrieval solutions that support Elasticsearch and most other search tools you might encounter.

00:01:34.650 Before delving into what information retrieval is, let’s take a moment to think about the essence of search. Search, in my view, is a communication act between the user and their computer, where the user requests information they're interested in. Successful searches depend on a common language; without it, effective communication cannot happen. For search to work efficiently, we need to figure out how the user can formulate their query in a way that can be easily understood by the computer.

00:02:56.239 One simple solution could be to use filters, checkboxes, and radio buttons in the interface, allowing users to easily navigate to their desired query. However, text search doesn't lend itself to this straightforwardness. The challenges arise because text search is rooted in natural language, which can lead to ambiguous and non-exhaustive queries from users. We often have a collection of unstructured texts which invariably leads us to a significant issue: relevance in text search is not binary. Instead, it presents as a spectrum. In traditional database queries, a document either meets the criteria or it doesn't. In contrast, with text search, we can have varying degrees of relevance.

00:04:58.640 Now, this raises the question: how do we effectively deal with these challenges? This is where information retrieval comes into play, acting like our ‘Superman’ rescue. At its core, information retrieval deals with the representation, storage, organization, and access to various information items such as documents, web pages, catalogs, structured records, and multimedia objects. The primary goal of an information retrieval system is to retrieve documents that are relevant to a user query while minimizing the retrieval of non-relevant documents.

00:06:04.540 To illustrate how information retrieval works, consider a collection of items—be it a set of products in an online shop, text documents, or music pieces. Each item’s textual representation is prepared, referred to as a document, which can be considered a bag of words. On one end, we have users with specific information needs, expressed as queries. Ideally, somewhere in the universe, there exists an item that perfectly meets the user's needs—an item we don't necessarily have access to.

00:07:58.260 So, we must make a bold assumption: the query a user enters into our text field represents some form of ideal item representation. We can then compare this with our collection, using ranking functions to determine similarity, thus producing relevant results. The importance of relevance cannot be overstated; it indicates how well a document satisfies the user's information need.

00:09:03.769 Now, let’s focus specifically on our collection. To make things interesting, we’ll work with quotes from a tech subculture. These quotes will be translated and somewhat 'encrypted' using a quasi-language I created. Please refrain from trying to read them too closely during this presentation; you will have a chance to analyze relevant aspects later, but the goal here is to prevent preconceiving interpretations based on the text itself.

00:10:23.090 In the example at hand, let’s say our user is looking for concepts related to sweets, specifically marshmallows and caramel. To optimize our ranking function without having to read through every text and redefine our solution every time we revisit the function, we introduce an inverted index in a simplified manner. This index consists of a dictionary that contains unique terms present in our documents, and postings pointing to the occurrences of these terms within specific documents.

00:11:49.199 Now, moving on to measuring similarity, this can seem quite abstract. To tackle this concept, we represent text as vectors in multi-dimensional space, where each dimension corresponds to a specific term of interest. Relevance then becomes equated with the similarity of these vectors. For instance, if we consider two texts containing specific words like `information retrieval` and `fun`, we can derive their respective corresponding vectors by either presence or absence of the relevant terms.

00:13:16.600 To compute similarity, we can utilize cosine similarity. This entails performing a dot product of the vectors, where we multiply corresponding coordinates together and sum the results. Creating a similarity matrix enables us to see the best matches for our query, revealing how closely related the documents are to the terms within our query.

00:14:50.230 As we assess the matches, it's noteworthy that repetition of terms can influence relevance. This raises the question: should we include a measure for term frequency? While frequency can enhance relevance, we must also recognize that not all words carry the same semantic weight. Words like pronouns and conjunctions often add little to the semantic meaning and should ideally be excluded from our calculations to prevent skewing results.

00:16:03.500 After filtering out these 'stop words' from our queries, we can further adjust our similarity metrics, refining our results. This approach leads us to products that are more aligned with the user's true interests. Moreover, we can build family words from our identified terms to capture derivative terms such as ‘idealistic’ from ‘ideal.' Leveraging techniques like lemmatization and stemming—two crucial tools in text processing—can allow us to better handle the language's inherent variability.

00:17:15.350 Now, as we wrap our minds around these concepts, if we reconsider our queries while keeping their context and families in mind, we'll likely achieve significantly improved results. For instance, if a user enters a query like `Ruby programming`, documents with the term `programming` in various usages would inherently be more relevant based on occurrence and context, leading to more applicable documents.

00:18:32.380 Additionally, the use of a term’s document frequency, indicating how prevalent a term is across the document set, could also refine our relevance criteria. By incorporating this factor, we can discern which terms contribute more valuable insight to the documents.

00:20:22.130 Jumping into the mathematics, we settle on a concept known as tf-idf—which stands for term frequency-inverse document frequency. This formula helps quantify terms' relevance in contrast to total document counts within our information set. By adjusting our measures and normalizations based on document lengths, we can refine similarity scoring.

00:22:47.600 So, after implementing such measures, our function presents a more nuanced understanding of the documents’ relevance to the user query. Still, can we refine further? Absolutely! The integration of synonyms—exploring semantic similarity through related terms and definitions—could significantly enhance our retrieval mechanisms, allowing for a broader scope of search while maintaining focus.

00:24:55.490 As we refine our queries and responses, we can factor in opportunities to boost documents that exactly match user queries or contain additional relevant terms in the same order. We can even explore standard spelling variations and account for user errors to improve our system's effectiveness. Each of these measures contributes to a more effective mechanism meant to address user needs.

00:25:55.150 Now, as I start to wrap up this presentation, let’s take a moment to summarize key elements I’ve discussed regarding search and information retrieval: First and foremost, relevance is recognized as a spectrum. Secondly, relevance can be conceptualized as the similarity between a user's query and the documents in our repositories, and managing term counts proves to be a complex yet essential part of this process.

00:27:19.610 The crux of effective information retrieval utilizes both tf-idf measurements and length normalization to account for the differing document lengths we encounter. Lastly, while semantics cannot be overlooked, we have the ability to control its influence through measures such as stop-word removal, synonym implementation, and family word capturing. A critical point to remember is that the vector model used here is merely one approach—there are other models like BM25F that might suit your needs for structured data far better.

00:29:19.480 And finally, in the spirit of experimentation that shaped my inquiry into search, I embarked on finding the meaning of life for a user, and while the results weren't what I anticipated, they point towards the ongoing nature of improving understanding in the world of search. As I conclude, I invite any questions you may have regarding my talk or the subject matter I've presented. Thank you!