Is it me you're searching for?

Talks

James Kiesel

Is it me you're searching for?

by James Kiesel

Is it me you're searching for? is a talk by James Kiesel presented at RubyDay 2015 in Turin, focusing on the implementation and significance of full-text search in PostgreSQL, especially within Ruby on Rails applications. Kiesel opens the talk by addressing gender diversity issues in the tech industry and encourages initiatives to promote inclusivity. He then transitions into the main theme about search functionalities, which is often overlooked in programming discussions despite its critical importance.

The key points discussed throughout the video include:

Importance of Search: Searching through data is a common requirement in many applications, and understanding how to implement search effectively is essential for Ruby developers.
Challenges of Search: Search is often subjective and context-dependent, making it complex to implement a one-size-fits-all solution. Thus, starting with a basic, functional search and iterating from there is advised.
Different Search Solutions:
- ILIKE Operator: A basic method that allows pattern matching in queries but lacks advanced features like relevance ranking or handling variations in search terms.
- Third-Party Search Engines: Tools like Elastic Search and Sphinx provide powerful searching capabilities but add complexity and require data duplication across systems.
- PostgreSQL Full-Text Search: A built-in feature of PostgreSQL that allows for effective ranking, sorting, and querying of text data with various advanced functionalities such as custom weights and multi-language support.
Search Implementation Mechanics: The talk details how to build search functionality using PostgreSQL's TSearch and plain TSearch, discussing how the data is transformed and ranked to provide relevant results.
Example Project: Kiesel presents a demonstration project centered around searching song lyrics, outlining the process of setting up the database schema, implementing the search, and optimizing performance by storing search vectors within the database.
Best Practices: He highlights the importance of understanding the specific search needs of an application and suggests starting with simple solutions that can be enhanced as necessary.

In conclusion, Kiesel emphasizes the robust features of PostgreSQL’s full-text search capabilities, advocating for their use in Ruby on Rails applications while also addressing the complexity and evolving demands of search functionalities. The presentation serves as a call to action for developers to consider the intricacies of search in their applications and to leverage the built-in capabilities of PostgreSQL effectively.

00:00:14.599 Hello, ladies and gentlemen. Do we actually have any ladies here at all tonight? Today, we have one, two—I'm sorry, two. This always makes me really sad to see. The state of our industry is currently very low on gender diversity, and I think we deserve better. Thank you guys so much for being here, and if you are in a position to help, please do! Support initiatives like Rails Girls and Girls Who Code. Ask your employers whether they're looking into improving gender diversity. We deserve more diverse insights in our industry and our craft, and the only way to achieve this is to get more different types of people involved.

00:00:41.160 Alright, sorry about that. I'll get off my soapbox and move on to Lionel Richie. Do we all know who Lionel Richie is? I hear some ecstatic chuckles from the audience. The title of this talk is "Is It Me You're Searching For?" It's about full-text search in Postgres, and I'm going to be discussing a bit about the full-text search tools in the database that you're probably already using in your Rails application, which might just be good enough for search functionality in your app.

00:01:06.320 A little bit about me: my name is James, and I work for an open-source decision-making software called Lumio, which is based out of New Zealand. This software is designed to help groups make decisions together in a non-hierarchical way. If you're interested in contributing to open source, I coordinate contributions, so please do get in contact with me. We’re very happy to have both beginners and experienced people alike. You can find me on the internet under the handle GdPelican on GitHub or Twitter, though I don't tweet very much, so GitHub is recommended. You can check out every single commit I've made in the past two years. Feel free to shoot me an email at [email protected].

00:01:42.079 So what do we need to know about search? Out of curiosity, how many people here are working on an app that has search functionality? Quite a lot of you! I think it's a really common use case because as soon as we get an app with a certain amount of data, we really want to have some way for users to filter and search through that data. It’s a topic that I don’t think we talk about very often. When we go to Rails conferences or converse on the internet, we talk a lot about CRUD, performance, APIs, and even Tender Love's cats for some reason, but you don’t hear the conversation about search too much.

00:02:06.480 Since a vast majority of apps out there can really benefit from a search function, it’s important as Ruby programmers to know at least a little bit about it. We’re going to explore a few different options to integrate search into your app. So, why do we need to know about search? I found that search is really hard. It's subjective, imprecise, and highly dependent on the user's context. If we have one user searching for a term, they may not be looking for the same document as another user searching for that same term.

00:02:35.480 So, when I think about implementing a search solution, I think about starting somewhere decent and iterating from there. I know I am not going to come up with a perfect solution right off the bat—definitely not ever. Google is still working on it, and I’m not going to reproduce Google Search inside my app. Therefore, I really want to emphasize starting with a good enough approach for search and then adding features later as needed.

00:03:04.360 Another thing to consider about search is that it’s very dependent on your particular data. Most of the time, when we're talking about performance, we're addressing the volume of our data, and that definitely affects the performance of our search. However, another aspect that influences search performance is the actual data—its type, size, and the number of fields we're searching across. This varies for each individual use case, so more than in many other domains, we need to be conscious about user testing and recognize that our data is a unique snowflake that deserves special consideration.

00:03:29.800 Now, let's descend from the clouds and discuss some options for full-text search. Who here is familiar with the story of Goldilocks and the Three Bears? One person? Let me tell you the story. Goldilocks and the Three Bears is a children's fairy tale, believed to be German, that tells the tale of a little girl with golden locks who finds an empty house in the woods. Normally, three bears live there, but when she arrives, they're not home. She goes inside and finds a table with three bowls of porridge. Goldilocks tastes the first bowl, and it’s too cold. She tastes the second bowl, and it’s too hot. But when she tries the third bowl, it’s just right.

00:04:04.200 In the context of a search implementation, just like Goldilocks sampling the porridge, we have options for different types of search solutions. We want to analyze what we need to implement and then pick the best option for us. Let’s start with the ‘cold porridge’—the baby bear porridge—which refers to the ‘ILIKE’ operator in the search world. Many of you probably know the ILIKE operator? A show of hands? Okay, good!

00:04:37.120 For those unfamiliar, the ILIKE operator is similar to a regex match on a column. For instance, we have a table of singers, and we want to select names that match Lionel. We can use the wildcard operators, which allows anything in the string to be before or after the term Lionel, and it will return any matches. This method is incredibly simple to implement—just one line of code in a WHERE clause, and it works decently for filtering small datasets. At times, we still use it in Lumio to filter groups of usernames out of member data, which usually has between 10 and 100 users, and it performs well.

00:05:02.560 However, it’s not quite suitable for full-text search. The reason is that it doesn't calculate relevance. As soon as we start getting more than three or four matches, we want to understand how well the match corresponds to our query, and ILIKE has no such capability. It also struggles to determine the intention of the search. For instance, if we search for a phrase with two to three words, we need to match the punctuation and spelling exactly for a document to return. If we're searching for the plural of a word, and the document contains just the singular form, it won't return a match. This makes it a pretty weak option.

00:05:29.760 Now, let’s look at our next type of search solution, which is much more powerful. This is a third-party JSON search engine, often open-source, that is powered by the Apache Lucene search engine. These solutions function as separate processes that run outside of your app and maintain a copy of your search data. Therefore, when your user performs a search, you can send a request to this separate process, and it will return the search results for you. It’s like a microservice approach to search, and these solutions offer a lot of powerful functionalities such as faceting and aggregations, various languages, and custom user queries. Many people use these in production.

00:06:11.200 Examples include Elastic Search, Sunspot, and Sphinx. At the moment, Elastic Search is the industry standard while Sphinx is emerging as a lighter and easier-to-implement version. However, while this option certainly provides excellent search capabilities, there are some drawbacks you should be aware of. Running search on a different process means essentially duplicating our data across different databases, which complicates our architecture. We’d be adding another stack to manage, and while learning a few queries for Postgres is quite manageable, it still requires expertise.

00:06:51.600 This is a key concern for Lumio since we’re an open-source project, and we want to ensure that it's extremely easy for anyone to fork our code, download it, and set up their own instance. Adding another dependency to our tech stack is something we try to avoid. So, if your application heavily relies on searching, particularly if you’re running a site like Airbnb or Expedia where searching is a primary function, this solution is probably the way to go. But if you don’t need that level of capacity, it’s advisable to go with a simpler option that simplifies your architecture.

00:07:18.960 That brings us to our last option, the ‘just right porridge’: Postgres full-text search. This feature comes installed with Postgres and has been around for quite some time, since version 8.3, which launched around the time Obama entered office. If you’re running a Rails app, you likely already have Postgres in your stack. It allows you to rank and sort across columns and tables, giving you granular control over rankings and search vectors. With a little care, it is definitely fast enough for production usage. There’s also a gem called PG Search, which I won't cover in detail during this talk, but I’d be happy to share my experiences with it afterward if anyone is interested.

00:07:51.000 Let’s discuss the mechanics of Postgres search for a moment. I like to think of it as a three-part machine: the query, the vector, and the rank. The query interfaces with the text input by the user, while the vector works with the documents held in our data set and the rank compares the query and vector to determine the relevance of the two to one another. In Postgres, there are two querying functions: TSearch and plain TSearch. These functions take strings and convert them to a TS query, which is essentially an array of lexemes.

00:08:24.080 Here’s an interesting aspect: when we call a search query, we can lose common words known as stop words, which are typically the top 200-250 most frequent words in a language. If a document contains a stop word, it doesn’t mean it’s relevant to our search. Hence, we eliminate these from our searches altogether. For instance, if we search for 'Ruby Day,' we might return 'Rubi' due to this process of transformation, enhancing our search relevancy.

00:08:55.760 The difference between TSearch and plain TSearch is crucial. TSearch requires queries to be formatted in a particular way using Boolean operators, like AND or NOT. If you enter ‘Ruby Day’ as a query, it expects it to be formatted a certain way, or it will return an error. Plain TSearch, on the other hand, is more flexible. It accepts raw user input, stringing the inputs together using the AND operator. I generally prefer to use plain TSearch as it tends to suffice for most use cases.

00:09:26.920 Moving on to the vector, we wander into the next step of how we connect to the documents in our database. Through the function TSearch vector, similar to our query, we convert a string—in this case, our search document—into what's essentially a hash. Here, the keys are the lexemes from our TS query, and the values contain positional information. In a more complex search scenario, this can be significantly larger, containing diverse search terms.

00:09:59.920 Now we have our query and vector, both collections of lexems. We can utilize ranking to compare their relevance. In Postgres, there are two main steps—in terms of ranking—first using the ‘@@’ operator and then applying an actual ranking function. The initial step is quite quick, determining which documents are even relevant at all, while the second step performs more intensive calculations to rank how relevant those documents are. We apply the ‘@@’ operator between our query and vector, which will help return a Boolean evaluation of relevancy.

00:10:39.520 The final step is where we utilize TS Rank. It takes both a vector and a query, outputting a number associated with their relevance. There are two ranking functions available, one of which, TS Rank CD, specifically measures proximity. It gives extra weight to documents where the search terms are in closer proximity to one another, making it a preferred method when quality topical matches are essential in our search.

00:11:10.520 With the basic framework for conducting searches set, we've covered quite a few methods at a basic level. Postgres also provides additional functionalities worth noting. One important feature is custom weight fields, which come into play especially when searching across multiple columns. For instance, if a model includes a title and a description, a matching word in the title should rank higher than one found in the description.

00:11:44.920 Hence, you can apply different weights to your vectors utilizing the ‘set weight’ function to tailor the importance of each column in your search results. An example might be a song object where a title match could weigh heavily, while a match in the lyrics carries less weight. The default weights tend to be relative to one another, meaning that a match in the A column holds much more significance than one found in column D.

00:12:29.680 Another interesting capability of Postgres is its language support. If you decide to employ 2TS queries in a different language, specify the desired language in your query to optimize it. This allows you to search seamlessly across languages if your app caters to a multilingual audience.

00:12:59.440 Additionally, there's the TS Headline function which can be quite useful for results display. Instead of presenting users with lengthy excerpts from documents, it can return succinct snippets highlighting where the matches occur within that document, enhancing user experience. You can even adjust the return format to suit your display needs.

00:13:36.080 In summary, we can convert user input into lexical queries, evaluate stored documents as ordered lexes, determine which documents match, rank the results based on quality, combine multiple vectors, conduct multilingual searches, and extract relevant snippets from found documents. This array of functionalities is quite robust, making it achievable to implement a search solution effectively.

00:14:10.080 Throughout this talk, I've largely discussed concepts without diving into code, but let’s touch on a bit of programming. I plan to create a simplistic search engine centered around a database of song lyrics. I will be placing five classic love ballads into a database, and you’ll be able to follow along at home as I have open-sourced this project. The hosting for this is at RubyDay.herokuapp.com.

00:15:00.960 To start, we’ll establish a basic database schema with a singer and song model where each singer has many songs. From there, we’ll use seed records—adding four singers and five songs with attached lyrics— and create a view that displays a table showcasing the title, duration, song name, rank, and headline for matches. We will define a simple controller that takes a query and returns matching songs based on the aforementioned view.

00:15:38.520 For implementation, we’ll be utilizing a SQL query composed of select statements, joining the two tables, and arranging the results in a specified manner. This search query will also include a headline based on lyrics matching our search terms. One important insight from our implementation is we call the search vector twice in two different ranking functions. This can lead to performance issues since it parses the same document twice. Instead, we can optimize our code by modifying our structure.

00:16:18.960 We could implement a method to store our vectors directly in a TS Vector column within Postgres. This way, the search vector is set only upon saving a record, meaning that each search query will now utilize this pre-constructed data. This optimization allows for significantly improved search performance by reducing computational overhead.

00:17:15.680 Our final scope iteration can then leverage this preconstructed data much more efficiently, eliminating the need to duplicate vector calculations on each search. This gives us a clean, efficient, and elegant way to run our searches without excessive load.

00:17:59.520 If you find yourself requiring even more advanced search functionalities, I recommend exploring the implementations of larger application platforms like Discourse, which manage extensive and complex searching capabilities. Their architecture includes features such as advanced user queries, faceting, and multi-language support, catering to diverse user needs.

00:18:44.080 In summary, if you’re trying to build a search feature for your application, it’s essential to assess the specific needs of your app, determine the level of complexity your users require from the search experience, and consider implementing solutions that ease the technical burden while still delivering needed functionality. Thank you for your time!

rubyday 2015