Indexing

Summarized using AI

Open Source Search for Rails

Allison Zadrozny • January 30, 2024 • online

Open Source Search for Rails

In this presentation at the WNB.rb Meetup, Allison Zadrozny discusses how to effectively structure code to provide relevant search results in a basic Rails application without relying on JavaScript. The session focuses on open-source tools for implementing search functionality, along with best practices and considerations for integration.

Key Points:

  • Introductory Overview:

    • Allison introduces herself as a product engineer and lead designer at Bonsai, sharing the demo project built around a search feature for books.
    • She acknowledges the importance of relevance in search, which many attendees highlighted as a key factor.
  • Building a User Interface (UI):

    • A well-structured UI is essential for different search use cases, such as exploratory searches (Pinterest, YouTube) versus specific searches (Amazon).
    • The UI should guide the design and structure of the search index.
  • Iterative Approach to Search Queries:

    • Good search functionality requires constant iteration over queries and indexing to improve user experience.
    • Useful tools must be in place for quick iteration.
  • Understanding Search Indexing:

    • An index is crucial for efficiently retrieving information, and the talk briefly introduces a prior work, "storysearch.com," which explains indexing concepts.
    • Emphasizes the importance of using large, normalized datasets—citing sources like Kaggle for data acquisition.
  • Workflow Steps:

    • Setting up a database with PostgreSQL, followed by integration with search engines like OpenSearch.
    • Highlights the distinctions between indexing in relational databases versus dedicated search databases.
  • Data Handling and Indexing Tools:

    • Allison details how to transform data for searching using Ruby objects and creating indexing jobs through rake tasks.
    • Shares examples of filtering results and managing mappings and features to optimize the index structure according to the UI needs.
  • Search Tools and Integration:

    • Describes the controller setup for handling search requests, the creation of a BookSearcher service, and the importance of clean response handling.
    • Emphasizes the inclusion of presentation aspects through the use of partials for better UI management.

Conclusion:

  • The session concludes with a call for attendees to utilize and contribute to the open-source repository showcased in the demo, encouraging collaboration and discussion on best practices.
  • The overarching message highlights the significance of efficient indexing, responsive design, and the balance between shipping features and maintaining code quality amidst tech debt and time constraints.

Open Source Search for Rails
Allison Zadrozny • January 30, 2024 • online

A tiny how-to primer on the best way to structure your code to deliver relevant search results to users in a basic Rails app without the use of javascript. With some code examples and a demo repo, I'll briefly touch on why teams choose open source, best practices for your integration, things to look out for, and further resources.
https://www.wnb-rb.dev/meetups/2024/01/30

WNB.rb Meetup

00:00:00.320 hi um I'm Ally and I'm gonna be talking
00:00:03.639 about search but first let's talk about
00:00:06.200 what we're going to build in this demo
00:00:07.680 and I'm GNA try to do a live situation
00:00:09.639 here so let's do
00:00:13.360 it let's serve it
00:00:17.240 and
00:00:20.240 wait oh this always happens okay here we
00:00:23.080 go open sorts open search book search I
00:00:26.279 was like what am I going to do with data
00:00:27.880 all right let's do books let's just do
00:00:29.279 search book
00:00:30.480 so um we've got a little Hunger Games
00:00:34.480 action and there's a bunch of like stuff
00:00:37.480 with sus in columns we've got some
00:00:39.160 highlighting some filtering over here
00:00:42.559 let's say I've had a bad night and I
00:00:44.640 want to do hanger games and we can see
00:00:47.800 that we still have good results um
00:00:51.239 there's a lot more I probably could do
00:00:53.000 there in terms of like you know we can
00:00:55.640 have auto complete or things like that
00:00:57.920 but um you know that would be a lot of
00:01:00.600 ours I was like let's just do this but
00:01:03.600 yeah um this is what we're building
00:01:07.000 and
00:01:08.799 um I'm so glad the live demo worked
00:01:13.000 um and I have some fuzzy match if
00:01:15.880 anyone's worked with open source search
00:01:17.840 which a lot of you have based on the
00:01:21.119 um uh form that I sent out then you know
00:01:25.600 what fuzzy match um is and that's what
00:01:27.720 kind of helps with that non um
00:01:30.560 exact match that you have to usually do
00:01:32.560 with indexes and
00:01:35.520 postgress the tldr is that there is a a
00:01:38.799 repo you can go check it out all of the
00:01:42.040 stuff is in there you can run it um the
00:01:46.079 live demo gods are smiling upon me today
00:01:48.719 yes thank you again um go check it out
00:01:51.600 if you'd like I've been working on it
00:01:52.960 for a while um more about me I'm Ally
00:01:56.799 I'm a product engineer and lead designer
00:01:58.680 at bonsai this is a real photo of me in
00:02:01.759 my life right now just kidding this is a
00:02:04.680 real photo of me in my life right now
00:02:07.719 for all of you uh parents out there um
00:02:10.840 it's pretty crazy so but I'm here we're
00:02:13.560 doing
00:02:15.519 it okay so the question is how do you
00:02:18.319 build meaningful search and a lot of
00:02:20.800 y'all responded when I asked in the
00:02:23.040 forums what is the most important thing
00:02:25.319 for search so many of you were like the
00:02:28.120 right results relevancy and I would say
00:02:31.440 I 100% agree with you um but the first
00:02:35.560 building block is a UI that supports the
00:02:38.519 use case there are many different use
00:02:40.280 cases for search you use something like
00:02:43.000 Pinterest um or YouTube it's an
00:02:45.280 exploratory experience um and there are
00:02:49.480 different ways to interact with your
00:02:52.280 search index and build your index based
00:02:55.080 on a exploratory UI versus a very
00:02:58.800 specific UI that about finding the exact
00:03:01.200 thing that you're looking for which is
00:03:03.080 something like Amazon um text us which I
00:03:07.159 don't know if anyone is here from text
00:03:08.599 us um so the UI is the the important
00:03:12.120 thing and everything builds off of there
00:03:14.680 the second thing is the ability to
00:03:16.280 iterate on things um iterate on your
00:03:19.159 index iterate on your queries to support
00:03:21.480 that user experience and the last thing
00:03:24.040 is if you have to iterate quickly then
00:03:25.640 that means you need good tools to help
00:03:27.360 you iterate quickly and we'll talk about
00:03:29.159 those in a second
00:03:30.760 if you don't know what an index is or um
00:03:34.000 you don't have a lot of experience with
00:03:35.239 search there's actually only a small
00:03:36.840 part of you that didn't have some
00:03:38.680 experience um I would encourage you to
00:03:40.680 go look at story search.com um it's a
00:03:43.760 comic I made um many years ago and it
00:03:48.599 walks
00:03:49.599 through um how to create an index what
00:03:52.720 it
00:03:53.519 is um some like actual structure of an
00:03:58.400 index um and did all the illustrations
00:04:00.720 it was really fun it was quite a long
00:04:02.200 time ago but uh that was always a good
00:04:05.560 one okay back to this so just to Cave up
00:04:09.319 before we start the search world is
00:04:11.560 Giant and it's getting larger every day
00:04:14.840 with AI Vector search I mean the space
00:04:17.759 is just getting so big um within that
00:04:21.239 world is fulltech search plus
00:04:24.440 rails um and yes I will um have those
00:04:27.919 slides available and share them for yall
00:04:30.720 and then so like just in even more
00:04:33.000 caveat let's just zoom into this area
00:04:35.400 here like this is what I feel like my
00:04:37.919 knowledge is within this like giant
00:04:40.720 search World um it's like the moon of
00:04:43.479 the earth to the sun like that's that's
00:04:46.680 that's how crazy intense this subject is
00:04:49.680 and like um document retrieval
00:04:52.520 information retrieval it's a really cool
00:04:54.759 very nerdy world and I absolutely love
00:04:58.639 it
00:05:00.280 okay so let's talk about the workflow
00:05:01.800 that you go through in order to build
00:05:03.639 the application that I showed you a
00:05:05.039 second ago first there's a ton of setup
00:05:07.759 and you also need that UI I didn't put
00:05:09.440 that in there because that's kind of a
00:05:10.880 given um but the second thing is that
00:05:13.560 you have a query or relevancy issue you
00:05:16.400 work on that query you run the queries
00:05:18.319 to test it you identify the issue and
00:05:20.479 then you reindex and then you keep on
00:05:22.479 working on the query and then you re
00:05:23.880 index you work on the query and you re
00:05:25.720 index so okay you're you're getting the
00:05:27.120 idea here like the biggest and most
00:05:29.120 important part here is you have to
00:05:31.080 reindex a lot you work on these very
00:05:33.639 specific queries so there's a lot of
00:05:36.120 referencing like to documentation of
00:05:38.319 like well how does this career work and
00:05:39.759 how do I have to form this Json piece um
00:05:43.479 and just so you know just to get to that
00:05:45.120 little like part where I have
00:05:47.759 highlighting fuzzy matching pagination
00:05:51.800 filters with aggregations like you know
00:05:55.360 I got maybe like to step eight and i'
00:05:57.160 had to keep on working and keep on
00:05:58.520 working and keep on working so it's a
00:06:00.080 lot of
00:06:01.240 iteration okay so good search and good
00:06:04.120 iteration requires good tools let's talk
00:06:06.919 about those tools you've got your
00:06:08.599 database and your data you have the
00:06:10.639 search engine itself um indexing tools
00:06:14.840 searching tools and in front an
00:06:18.520 interpretation let's talk about the data
00:06:20.919 first so I just went to Goodreads and
00:06:23.919 grabbed a giant CSV to index um well
00:06:28.360 first to put it into postgress and then
00:06:30.360 put it into um elastic search because
00:06:32.880 that's usually like going from postgress
00:06:35.240 to elastic search is usually what people
00:06:36.599 or SQL um deal with and the important
00:06:40.919 thing to think about is if you're doing
00:06:42.400 like a practice project and you want to
00:06:44.520 find some data um kaggle is great but I
00:06:48.000 really would um encourage you to look at
00:06:51.199 sort of the metadata about the data set
00:06:53.280 to see you know its score um because
00:06:57.759 sometimes they're incomplete roow
00:07:00.039 or it can be really difficult to deal
00:07:01.879 with certain data sets because they're
00:07:03.360 not really normalized it's a whole
00:07:06.960 subject in um the the search retrieval
00:07:11.599 world is like finding good data sets to
00:07:14.000 work with and then also thinking about
00:07:16.639 you know how much data do I need to test
00:07:19.960 things is a is a whole situation as well
00:07:23.599 I feel like I'm going to give you guys
00:07:24.560 all the like broad knowledge here
00:07:27.080 because you could really drill down and
00:07:29.560 have like a week's worth long of
00:07:30.960 conversation in every topic but there's
00:07:33.160 a really great hstack talk called um
00:07:35.599 representative query sets for offline
00:07:37.560 testing and I've linked it here and I'll
00:07:39.120 share the slides again but basically the
00:07:42.160 gist
00:07:43.199 is with search um and search technology
00:07:47.879 it's really important to work with large
00:07:50.319 data sets because that's the point of
00:07:52.319 search is that postgress postgress is
00:07:54.879 slow after a certain point postgress
00:07:56.879 indices are slow after a certain point
00:07:58.919 and so you need need to be able to test
00:08:00.360 your queries with large data sets but
00:08:02.479 like the your your indexing is going to
00:08:04.720 take a long time and it's going to take
00:08:06.800 up a lot of space on your computer
00:08:08.159 depending on how you know you're doing
00:08:10.440 like if you're if you're doing elastic
00:08:11.960 SE for open search locally or if you're
00:08:14.280 like deploying something in Cloud it's
00:08:15.479 Canna be really expensive so it's a
00:08:17.520 whole topic that talk is a really great
00:08:20.400 starting point if you're interested in
00:08:22.479 that but I just grabbed kaggle it's like
00:08:25.159 there was like 27,000 rows it was like
00:08:27.360 really small for a search set and even
00:08:29.840 did even smaller than that so okay so
00:08:32.599 let's set up the database we've got our
00:08:34.080 gy post yay um I did want to make a note
00:08:38.039 about the index search because a lot of
00:08:39.640 us do that we do like I like searches on
00:08:42.159 an index and the thing is is that you
00:08:44.600 get to a certain point where that
00:08:46.040 becomes
00:08:47.160 non-scalable and you really need to move
00:08:50.399 your search concerns out of the read and
00:08:54.240 write situation of a relational database
00:08:56.519 into a specific type of data structure
00:08:59.200 that is is meant for search and
00:09:01.440 eventually as things scale the most
00:09:03.320 important thing is being able to create
00:09:05.040 indices test those indices against one
00:09:07.600 another and deploy them and so it's this
00:09:09.920 whole workflow of um in the production
00:09:12.480 area that you really just want to
00:09:13.880 separate from your schema from your um
00:09:16.720 production database you want to you want
00:09:18.480 those to be separate so that's why when
00:09:21.360 I people ask me about postgress or index
00:09:23.920 search for whatever like relational
00:09:25.760 database they've choose I'm like well
00:09:27.959 it's a different thing relational
00:09:30.160 databases are meant for long-term
00:09:31.880 storage search databas databases are
00:09:35.079 meant for um quick iteration and
00:09:37.839 throwing away and re recreating pretty
00:09:40.240 quickly
00:09:41.959 so all right I needed to get this in I
00:09:44.959 downloaded a
00:09:46.120 CSV um from kaggle and then I just
00:09:49.640 created a rake task um I'm going through
00:09:52.440 each row I probably could have done this
00:09:54.279 a lot like using like bulk or some other
00:09:57.279 stuff but I was like okay let's just get
00:09:58.440 it done um and I checked if it's new or
00:10:03.279 not so I wouldn't be writing uh multiple
00:10:05.519 copies of something and then yeah just
00:10:07.920 running rake import books that took
00:10:10.519 quite a long time um so yeah we've got
00:10:13.760 our we've got our data like R did to go
00:10:16.240 so the next thing is the search engine I
00:10:18.320 chose open search which is um from uh
00:10:22.519 Amazon and it's a fork of elastic search
00:10:26.680 once they um they license changed from
00:10:31.120 MIT or know the Apache license to a new
00:10:34.959 like weird elastic license so it's this
00:10:37.839 whole like thing in the search World um
00:10:42.000 there's some drama it's quite fun
00:10:45.639 um so anyway we're using open search
00:10:47.920 today and I work at Bonsai we host open
00:10:50.959 search this is why I know a lot about
00:10:52.600 this stuff um so I grabbed uh some
00:10:55.279 credentials from a cluster that I'm
00:10:57.680 using for testing and I put it in my EnV
00:11:02.079 file um you can do it in your R
00:11:04.240 credentials file as well um but I just
00:11:06.320 was doing this click and dirty um and
00:11:08.639 then I consume that URL by setting up a
00:11:11.120 new open search uh client which is also
00:11:13.680 set up in my gy file um and then this
00:11:17.639 default uh is the Local Host 9200 that's
00:11:21.760 if you have it locally that's like the
00:11:23.440 standard host um for open search or
00:11:26.200 elastic search if you're running it on
00:11:27.440 your local machine all right so we've
00:11:29.959 got the search engine we've got our data
00:11:31.720 let's move on to getting that data into
00:11:35.720 um an index so the important thing I'm
00:11:39.320 going to go back a few slides here is
00:11:42.639 that I really should have added this
00:11:44.600 other slide sorry
00:11:46.720 y'all we have a search box 10 Blue Links
00:11:51.240 this is like the standard in the search
00:11:53.440 design World you've got 10 your top 10
00:11:55.760 results are the first ones that people
00:11:57.600 see after a search
00:11:59.800 and I'm only I'm only worried about the
00:12:01.560 title the author and the description and
00:12:05.760 then it's genre so it's very just a very
00:12:08.519 few fields that I need to get things
00:12:11.200 started for the driving the creation of
00:12:14.279 the index one of the things that people
00:12:17.399 get stuck on is they say okay I have all
00:12:20.120 of these fields on the book I've got the
00:12:22.279 ISBN I've got you know quotes I've got
00:12:26.519 um rating review full Tex
00:12:29.680 like all these other things and they
00:12:31.880 just like shove it all into the index
00:12:34.600 but the design did not account for any
00:12:36.480 of those things yet so there's no reason
00:12:38.680 to blow up your index to a giant size
00:12:40.920 for things that you're not going to use
00:12:42.079 in your design um so that keeping that
00:12:45.519 in mind the design builds the index not
00:12:48.760 the other way around um I'm going to go
00:12:51.360 quickly back to my other slide
00:12:58.120 here
00:12:59.680 so I've got my indexing tools I need to
00:13:02.040 get data
00:13:03.240 in one of the things that we've done
00:13:06.720 over well that I've learned over the
00:13:08.600 years is how to use plain old Ruby
00:13:10.920 objects poros to help with the process
00:13:14.680 of getting data into um our search
00:13:18.800 engine and then getting it out and one
00:13:21.120 of the things that we use is a search
00:13:23.279 model um module and uh it helps with
00:13:27.760 some like mappings declarations and
00:13:30.240 index features we include this in a
00:13:35.320 actual model that we want to search so
00:13:38.240 for example in the book
00:13:43.000 model I include the search model so we
00:13:45.880 have all those
00:13:49.920 um features including some um things
00:13:54.079 that add it to this searchable models
00:13:56.920 class I and it has a few different
00:13:59.279 methods here the index name which we'll
00:14:01.639 use later in an indexing job Max
00:14:04.959 mappings which help with the different
00:14:06.600 sort of like search tools that you want
00:14:08.680 to use later for querying index
00:14:11.399 features which are the actual fields and
00:14:13.959 their values um and then the index
00:14:16.920 action the reason why I'm calling it
00:14:19.199 index features will will come U to be
00:14:21.839 more clear in a second so in the book
00:14:25.199 I'm overwriting the index features and
00:14:27.279 I'm sending the things that I need for
00:14:28.800 that UI I know I want to show the author
00:14:31.440 or search by the author so I'm going to
00:14:33.480 have the author I know I want to use the
00:14:35.279 description to search with to sort of
00:14:37.240 see if I can like use things like Peta
00:14:41.079 Katniss whatever the query might be to
00:14:43.959 have that show up because that's not in
00:14:45.199 the title but it might be helpful the
00:14:47.240 genres which we'll use to filter um the
00:14:50.600 rating which might be useful for scoring
00:14:52.920 and then also the title of course so
00:14:55.480 those are the features of the search and
00:14:57.320 that is actually a very specific term
00:14:59.040 term in um information retrieval world
00:15:03.279 is like if something is being used to
00:15:05.880 search a data set um it's a feature that
00:15:09.079 we use to search on so the author is a
00:15:12.680 field that is a feature that we are
00:15:14.320 searching
00:15:15.720 on there's a lot of like terminology
00:15:18.079 that I had to get used to In This World
00:15:20.560 um okay back to
00:15:27.959 it so we're including this model into
00:15:30.839 our book and we're creating a rake task
00:15:35.199 um with the names space of open search
00:15:37.959 and for each model search model that I
00:15:40.199 include that module with I want to
00:15:42.519 create an indexing job uh for that
00:15:46.199 model so it's a nice way to not have to
00:15:48.720 like re like rewrite every single thing
00:15:51.519 out like index the books index the you
00:15:54.360 know genres or whatever it may be in
00:15:57.399 that indexing job we have um a perform
00:16:01.040 action which first deletes any um index
00:16:05.040 that exists with that same name so if I
00:16:06.759 already have an index called books in my
00:16:08.440 cluster I want to remove it because
00:16:10.000 we're reindexing I want to create the
00:16:12.480 new indices I'm going to put up I'm
00:16:14.600 going to push up the mappings and then
00:16:16.639 I'm going to bulk insert the
00:16:18.440 records and this batch size is just 50 I
00:16:21.680 could make it bigger or smaller
00:16:23.920 depending on you know the needs um it
00:16:27.480 also creates some nice functions to sort
00:16:30.399 of think about how long that indexing
00:16:31.959 job is taking and then for the next
00:16:34.160 batch size I might increase it or
00:16:36.000 decrease it depending on you know how
00:16:38.040 that um like inflight request is
00:16:41.199 going so yeah we do rake open search
00:16:43.560 reindex and now I've got records in my
00:16:46.240 index okay so the indexing tools are
00:16:48.720 done let's move on to searching tools
00:16:51.440 the searching tools they I mean there's
00:16:53.319 some Basics right we have to have a
00:16:54.480 controller action um we're calling it
00:16:57.480 search and books the index and we're
00:17:01.560 going to see if there if there's a
00:17:03.439 params with a queue present we're going
00:17:06.439 to run this service called book Searcher
00:17:08.919 new with the prams and we're going to
00:17:10.880 respond with the different formats
00:17:12.959 available to
00:17:14.559 us in the book Searcher I'm basically
00:17:18.079 rewriting things and pulling stuff out
00:17:20.120 of my action so that I don't have this
00:17:22.199 Giant action with all these different
00:17:23.919 things that I'm doing um there's a lot
00:17:26.640 of stuff that I didn't include in this
00:17:28.559 slide because it's actually like a much
00:17:30.520 larger file but the biggest and most
00:17:32.280 important one is that I have a run
00:17:33.919 action a run method where I am creating
00:17:38.080 this response variable and I'm using the
00:17:40.320 open search client to search with a body
00:17:43.840 with a method called generate query
00:17:46.000 which is a giant method within the book
00:17:49.120 book Searcher um service up here that
00:17:52.120 takes the pams does a lot of if block
00:17:54.640 logic to figure out the best query for
00:17:58.559 the user to send to open search because
00:18:00.880 that's actually the biggest part is
00:18:02.520 figuring out all the different things
00:18:04.880 that you have to use with your inputs to
00:18:06.640 create this big ass J oh sorry Json
00:18:09.840 object in the open search um and
00:18:14.640 then I'm creating um a book Searcher
00:18:19.360 response with the response from the open
00:18:22.400 search client um and the reason why I'm
00:18:24.559 doing that will be more clear in a
00:18:25.799 second and this book Searcher response I
00:18:29.120 take their response from open the open
00:18:31.400 search client and consume it and then I
00:18:33.919 have a bunch of like getter methods to
00:18:36.840 parse the
00:18:38.000 response
00:18:39.760 um yes I like the budget I've never
00:18:43.000 heard that
00:18:44.440 before um and just some really nice
00:18:47.400 little readers so that I can consume
00:18:49.360 that in the views and not have to do dig
00:18:52.960 you know and like go all the way into oh
00:18:55.840 I did just invent it yay that's great um
00:18:59.200 and dig into all of that Json um I just
00:19:02.640 I hate seeing that in views and I like
00:19:04.640 things to be nice and clean when I'm
00:19:06.360 accessing things so that's why I've got
00:19:08.039 this book Searcher
00:19:10.039 response so it means something like this
00:19:12.320 with like response dig hits hits each do
00:19:14.960 blah blah blah I have response. hits
00:19:17.919 each do Etc and I can keep on like
00:19:21.400 working down that
00:19:23.600 road okay the last piece is the front
00:19:26.400 end interpretation which you see a
00:19:28.280 little bit here here um but I've got my
00:19:32.360 title I've got a little form and then
00:19:34.960 I've got displaying the results that
00:19:37.600 come back this actually can be a lot
00:19:40.240 more complicated when you get into the
00:19:42.960 deep end of it and so that's why
00:19:45.360 partials are really freaking important
00:19:48.400 here um I've got a search form as a
00:19:50.880 partial each hit as a partial the
00:19:53.440 filters as are partial it's partials all
00:19:55.600 the way down um and there's like way
00:19:58.080 more that I could do to when this is
00:19:59.919 fully built out as more of a production
00:20:02.559 app um okay that is a
00:20:06.600 lot um and I feel like most of the thing
00:20:09.840 that I was doing is just creating this
00:20:11.159 repo as a resource um and I'm GNA
00:20:14.440 continue to work on it and it's open
00:20:16.880 source please use it please ask me
00:20:18.440 questions on it if you'd like um Fork it
00:20:21.799 uh tell me you're so silly why would you
00:20:23.640 do a birk seure response you should do
00:20:25.240 this instead I don't know whatever um
00:20:27.480 and I will say as a final yet like you
00:20:30.080 know all the things in here the the
00:20:32.360 different search tools the rate tasks
00:20:34.640 the way that I'm doing the indexing job
00:20:35.960 with a search model whatever it may be
00:20:38.200 the these are best practices using the
00:20:40.679 poros um having something to handle your
00:20:43.520 query response however there's also like
00:20:46.520 Tech debt time constraints multiple
00:20:49.400 rewrites weird things going on with your
00:20:51.600 team and there's the thing that ends up
00:20:53.720 shipping and that's the most important
00:20:55.720 part what are we actually shipping and
00:20:58.400 like what are we getting in front of
00:20:59.840 users so I'd like to just offer that as
00:21:02.640 a final
00:21:04.120 close
00:21:05.799 questions
Explore all talks recorded at WNB.rb Meetup
+20