00:00:00.320
hi um I'm Ally and I'm gonna be talking
00:00:03.639
about search but first let's talk about
00:00:06.200
what we're going to build in this demo
00:00:07.680
and I'm GNA try to do a live situation
00:00:09.639
here so let's do
00:00:13.360
it let's serve it
00:00:17.240
and
00:00:20.240
wait oh this always happens okay here we
00:00:23.080
go open sorts open search book search I
00:00:26.279
was like what am I going to do with data
00:00:27.880
all right let's do books let's just do
00:00:29.279
search book
00:00:30.480
so um we've got a little Hunger Games
00:00:34.480
action and there's a bunch of like stuff
00:00:37.480
with sus in columns we've got some
00:00:39.160
highlighting some filtering over here
00:00:42.559
let's say I've had a bad night and I
00:00:44.640
want to do hanger games and we can see
00:00:47.800
that we still have good results um
00:00:51.239
there's a lot more I probably could do
00:00:53.000
there in terms of like you know we can
00:00:55.640
have auto complete or things like that
00:00:57.920
but um you know that would be a lot of
00:01:00.600
ours I was like let's just do this but
00:01:03.600
yeah um this is what we're building
00:01:07.000
and
00:01:08.799
um I'm so glad the live demo worked
00:01:13.000
um and I have some fuzzy match if
00:01:15.880
anyone's worked with open source search
00:01:17.840
which a lot of you have based on the
00:01:21.119
um uh form that I sent out then you know
00:01:25.600
what fuzzy match um is and that's what
00:01:27.720
kind of helps with that non um
00:01:30.560
exact match that you have to usually do
00:01:32.560
with indexes and
00:01:35.520
postgress the tldr is that there is a a
00:01:38.799
repo you can go check it out all of the
00:01:42.040
stuff is in there you can run it um the
00:01:46.079
live demo gods are smiling upon me today
00:01:48.719
yes thank you again um go check it out
00:01:51.600
if you'd like I've been working on it
00:01:52.960
for a while um more about me I'm Ally
00:01:56.799
I'm a product engineer and lead designer
00:01:58.680
at bonsai this is a real photo of me in
00:02:01.759
my life right now just kidding this is a
00:02:04.680
real photo of me in my life right now
00:02:07.719
for all of you uh parents out there um
00:02:10.840
it's pretty crazy so but I'm here we're
00:02:13.560
doing
00:02:15.519
it okay so the question is how do you
00:02:18.319
build meaningful search and a lot of
00:02:20.800
y'all responded when I asked in the
00:02:23.040
forums what is the most important thing
00:02:25.319
for search so many of you were like the
00:02:28.120
right results relevancy and I would say
00:02:31.440
I 100% agree with you um but the first
00:02:35.560
building block is a UI that supports the
00:02:38.519
use case there are many different use
00:02:40.280
cases for search you use something like
00:02:43.000
Pinterest um or YouTube it's an
00:02:45.280
exploratory experience um and there are
00:02:49.480
different ways to interact with your
00:02:52.280
search index and build your index based
00:02:55.080
on a exploratory UI versus a very
00:02:58.800
specific UI that about finding the exact
00:03:01.200
thing that you're looking for which is
00:03:03.080
something like Amazon um text us which I
00:03:07.159
don't know if anyone is here from text
00:03:08.599
us um so the UI is the the important
00:03:12.120
thing and everything builds off of there
00:03:14.680
the second thing is the ability to
00:03:16.280
iterate on things um iterate on your
00:03:19.159
index iterate on your queries to support
00:03:21.480
that user experience and the last thing
00:03:24.040
is if you have to iterate quickly then
00:03:25.640
that means you need good tools to help
00:03:27.360
you iterate quickly and we'll talk about
00:03:29.159
those in a second
00:03:30.760
if you don't know what an index is or um
00:03:34.000
you don't have a lot of experience with
00:03:35.239
search there's actually only a small
00:03:36.840
part of you that didn't have some
00:03:38.680
experience um I would encourage you to
00:03:40.680
go look at story search.com um it's a
00:03:43.760
comic I made um many years ago and it
00:03:48.599
walks
00:03:49.599
through um how to create an index what
00:03:52.720
it
00:03:53.519
is um some like actual structure of an
00:03:58.400
index um and did all the illustrations
00:04:00.720
it was really fun it was quite a long
00:04:02.200
time ago but uh that was always a good
00:04:05.560
one okay back to this so just to Cave up
00:04:09.319
before we start the search world is
00:04:11.560
Giant and it's getting larger every day
00:04:14.840
with AI Vector search I mean the space
00:04:17.759
is just getting so big um within that
00:04:21.239
world is fulltech search plus
00:04:24.440
rails um and yes I will um have those
00:04:27.919
slides available and share them for yall
00:04:30.720
and then so like just in even more
00:04:33.000
caveat let's just zoom into this area
00:04:35.400
here like this is what I feel like my
00:04:37.919
knowledge is within this like giant
00:04:40.720
search World um it's like the moon of
00:04:43.479
the earth to the sun like that's that's
00:04:46.680
that's how crazy intense this subject is
00:04:49.680
and like um document retrieval
00:04:52.520
information retrieval it's a really cool
00:04:54.759
very nerdy world and I absolutely love
00:04:58.639
it
00:05:00.280
okay so let's talk about the workflow
00:05:01.800
that you go through in order to build
00:05:03.639
the application that I showed you a
00:05:05.039
second ago first there's a ton of setup
00:05:07.759
and you also need that UI I didn't put
00:05:09.440
that in there because that's kind of a
00:05:10.880
given um but the second thing is that
00:05:13.560
you have a query or relevancy issue you
00:05:16.400
work on that query you run the queries
00:05:18.319
to test it you identify the issue and
00:05:20.479
then you reindex and then you keep on
00:05:22.479
working on the query and then you re
00:05:23.880
index you work on the query and you re
00:05:25.720
index so okay you're you're getting the
00:05:27.120
idea here like the biggest and most
00:05:29.120
important part here is you have to
00:05:31.080
reindex a lot you work on these very
00:05:33.639
specific queries so there's a lot of
00:05:36.120
referencing like to documentation of
00:05:38.319
like well how does this career work and
00:05:39.759
how do I have to form this Json piece um
00:05:43.479
and just so you know just to get to that
00:05:45.120
little like part where I have
00:05:47.759
highlighting fuzzy matching pagination
00:05:51.800
filters with aggregations like you know
00:05:55.360
I got maybe like to step eight and i'
00:05:57.160
had to keep on working and keep on
00:05:58.520
working and keep on working so it's a
00:06:00.080
lot of
00:06:01.240
iteration okay so good search and good
00:06:04.120
iteration requires good tools let's talk
00:06:06.919
about those tools you've got your
00:06:08.599
database and your data you have the
00:06:10.639
search engine itself um indexing tools
00:06:14.840
searching tools and in front an
00:06:18.520
interpretation let's talk about the data
00:06:20.919
first so I just went to Goodreads and
00:06:23.919
grabbed a giant CSV to index um well
00:06:28.360
first to put it into postgress and then
00:06:30.360
put it into um elastic search because
00:06:32.880
that's usually like going from postgress
00:06:35.240
to elastic search is usually what people
00:06:36.599
or SQL um deal with and the important
00:06:40.919
thing to think about is if you're doing
00:06:42.400
like a practice project and you want to
00:06:44.520
find some data um kaggle is great but I
00:06:48.000
really would um encourage you to look at
00:06:51.199
sort of the metadata about the data set
00:06:53.280
to see you know its score um because
00:06:57.759
sometimes they're incomplete roow
00:07:00.039
or it can be really difficult to deal
00:07:01.879
with certain data sets because they're
00:07:03.360
not really normalized it's a whole
00:07:06.960
subject in um the the search retrieval
00:07:11.599
world is like finding good data sets to
00:07:14.000
work with and then also thinking about
00:07:16.639
you know how much data do I need to test
00:07:19.960
things is a is a whole situation as well
00:07:23.599
I feel like I'm going to give you guys
00:07:24.560
all the like broad knowledge here
00:07:27.080
because you could really drill down and
00:07:29.560
have like a week's worth long of
00:07:30.960
conversation in every topic but there's
00:07:33.160
a really great hstack talk called um
00:07:35.599
representative query sets for offline
00:07:37.560
testing and I've linked it here and I'll
00:07:39.120
share the slides again but basically the
00:07:42.160
gist
00:07:43.199
is with search um and search technology
00:07:47.879
it's really important to work with large
00:07:50.319
data sets because that's the point of
00:07:52.319
search is that postgress postgress is
00:07:54.879
slow after a certain point postgress
00:07:56.879
indices are slow after a certain point
00:07:58.919
and so you need need to be able to test
00:08:00.360
your queries with large data sets but
00:08:02.479
like the your your indexing is going to
00:08:04.720
take a long time and it's going to take
00:08:06.800
up a lot of space on your computer
00:08:08.159
depending on how you know you're doing
00:08:10.440
like if you're if you're doing elastic
00:08:11.960
SE for open search locally or if you're
00:08:14.280
like deploying something in Cloud it's
00:08:15.479
Canna be really expensive so it's a
00:08:17.520
whole topic that talk is a really great
00:08:20.400
starting point if you're interested in
00:08:22.479
that but I just grabbed kaggle it's like
00:08:25.159
there was like 27,000 rows it was like
00:08:27.360
really small for a search set and even
00:08:29.840
did even smaller than that so okay so
00:08:32.599
let's set up the database we've got our
00:08:34.080
gy post yay um I did want to make a note
00:08:38.039
about the index search because a lot of
00:08:39.640
us do that we do like I like searches on
00:08:42.159
an index and the thing is is that you
00:08:44.600
get to a certain point where that
00:08:46.040
becomes
00:08:47.160
non-scalable and you really need to move
00:08:50.399
your search concerns out of the read and
00:08:54.240
write situation of a relational database
00:08:56.519
into a specific type of data structure
00:08:59.200
that is is meant for search and
00:09:01.440
eventually as things scale the most
00:09:03.320
important thing is being able to create
00:09:05.040
indices test those indices against one
00:09:07.600
another and deploy them and so it's this
00:09:09.920
whole workflow of um in the production
00:09:12.480
area that you really just want to
00:09:13.880
separate from your schema from your um
00:09:16.720
production database you want to you want
00:09:18.480
those to be separate so that's why when
00:09:21.360
I people ask me about postgress or index
00:09:23.920
search for whatever like relational
00:09:25.760
database they've choose I'm like well
00:09:27.959
it's a different thing relational
00:09:30.160
databases are meant for long-term
00:09:31.880
storage search databas databases are
00:09:35.079
meant for um quick iteration and
00:09:37.839
throwing away and re recreating pretty
00:09:40.240
quickly
00:09:41.959
so all right I needed to get this in I
00:09:44.959
downloaded a
00:09:46.120
CSV um from kaggle and then I just
00:09:49.640
created a rake task um I'm going through
00:09:52.440
each row I probably could have done this
00:09:54.279
a lot like using like bulk or some other
00:09:57.279
stuff but I was like okay let's just get
00:09:58.440
it done um and I checked if it's new or
00:10:03.279
not so I wouldn't be writing uh multiple
00:10:05.519
copies of something and then yeah just
00:10:07.920
running rake import books that took
00:10:10.519
quite a long time um so yeah we've got
00:10:13.760
our we've got our data like R did to go
00:10:16.240
so the next thing is the search engine I
00:10:18.320
chose open search which is um from uh
00:10:22.519
Amazon and it's a fork of elastic search
00:10:26.680
once they um they license changed from
00:10:31.120
MIT or know the Apache license to a new
00:10:34.959
like weird elastic license so it's this
00:10:37.839
whole like thing in the search World um
00:10:42.000
there's some drama it's quite fun
00:10:45.639
um so anyway we're using open search
00:10:47.920
today and I work at Bonsai we host open
00:10:50.959
search this is why I know a lot about
00:10:52.600
this stuff um so I grabbed uh some
00:10:55.279
credentials from a cluster that I'm
00:10:57.680
using for testing and I put it in my EnV
00:11:02.079
file um you can do it in your R
00:11:04.240
credentials file as well um but I just
00:11:06.320
was doing this click and dirty um and
00:11:08.639
then I consume that URL by setting up a
00:11:11.120
new open search uh client which is also
00:11:13.680
set up in my gy file um and then this
00:11:17.639
default uh is the Local Host 9200 that's
00:11:21.760
if you have it locally that's like the
00:11:23.440
standard host um for open search or
00:11:26.200
elastic search if you're running it on
00:11:27.440
your local machine all right so we've
00:11:29.959
got the search engine we've got our data
00:11:31.720
let's move on to getting that data into
00:11:35.720
um an index so the important thing I'm
00:11:39.320
going to go back a few slides here is
00:11:42.639
that I really should have added this
00:11:44.600
other slide sorry
00:11:46.720
y'all we have a search box 10 Blue Links
00:11:51.240
this is like the standard in the search
00:11:53.440
design World you've got 10 your top 10
00:11:55.760
results are the first ones that people
00:11:57.600
see after a search
00:11:59.800
and I'm only I'm only worried about the
00:12:01.560
title the author and the description and
00:12:05.760
then it's genre so it's very just a very
00:12:08.519
few fields that I need to get things
00:12:11.200
started for the driving the creation of
00:12:14.279
the index one of the things that people
00:12:17.399
get stuck on is they say okay I have all
00:12:20.120
of these fields on the book I've got the
00:12:22.279
ISBN I've got you know quotes I've got
00:12:26.519
um rating review full Tex
00:12:29.680
like all these other things and they
00:12:31.880
just like shove it all into the index
00:12:34.600
but the design did not account for any
00:12:36.480
of those things yet so there's no reason
00:12:38.680
to blow up your index to a giant size
00:12:40.920
for things that you're not going to use
00:12:42.079
in your design um so that keeping that
00:12:45.519
in mind the design builds the index not
00:12:48.760
the other way around um I'm going to go
00:12:51.360
quickly back to my other slide
00:12:58.120
here
00:12:59.680
so I've got my indexing tools I need to
00:13:02.040
get data
00:13:03.240
in one of the things that we've done
00:13:06.720
over well that I've learned over the
00:13:08.600
years is how to use plain old Ruby
00:13:10.920
objects poros to help with the process
00:13:14.680
of getting data into um our search
00:13:18.800
engine and then getting it out and one
00:13:21.120
of the things that we use is a search
00:13:23.279
model um module and uh it helps with
00:13:27.760
some like mappings declarations and
00:13:30.240
index features we include this in a
00:13:35.320
actual model that we want to search so
00:13:38.240
for example in the book
00:13:43.000
model I include the search model so we
00:13:45.880
have all those
00:13:49.920
um features including some um things
00:13:54.079
that add it to this searchable models
00:13:56.920
class I and it has a few different
00:13:59.279
methods here the index name which we'll
00:14:01.639
use later in an indexing job Max
00:14:04.959
mappings which help with the different
00:14:06.600
sort of like search tools that you want
00:14:08.680
to use later for querying index
00:14:11.399
features which are the actual fields and
00:14:13.959
their values um and then the index
00:14:16.920
action the reason why I'm calling it
00:14:19.199
index features will will come U to be
00:14:21.839
more clear in a second so in the book
00:14:25.199
I'm overwriting the index features and
00:14:27.279
I'm sending the things that I need for
00:14:28.800
that UI I know I want to show the author
00:14:31.440
or search by the author so I'm going to
00:14:33.480
have the author I know I want to use the
00:14:35.279
description to search with to sort of
00:14:37.240
see if I can like use things like Peta
00:14:41.079
Katniss whatever the query might be to
00:14:43.959
have that show up because that's not in
00:14:45.199
the title but it might be helpful the
00:14:47.240
genres which we'll use to filter um the
00:14:50.600
rating which might be useful for scoring
00:14:52.920
and then also the title of course so
00:14:55.480
those are the features of the search and
00:14:57.320
that is actually a very specific term
00:14:59.040
term in um information retrieval world
00:15:03.279
is like if something is being used to
00:15:05.880
search a data set um it's a feature that
00:15:09.079
we use to search on so the author is a
00:15:12.680
field that is a feature that we are
00:15:14.320
searching
00:15:15.720
on there's a lot of like terminology
00:15:18.079
that I had to get used to In This World
00:15:20.560
um okay back to
00:15:27.959
it so we're including this model into
00:15:30.839
our book and we're creating a rake task
00:15:35.199
um with the names space of open search
00:15:37.959
and for each model search model that I
00:15:40.199
include that module with I want to
00:15:42.519
create an indexing job uh for that
00:15:46.199
model so it's a nice way to not have to
00:15:48.720
like re like rewrite every single thing
00:15:51.519
out like index the books index the you
00:15:54.360
know genres or whatever it may be in
00:15:57.399
that indexing job we have um a perform
00:16:01.040
action which first deletes any um index
00:16:05.040
that exists with that same name so if I
00:16:06.759
already have an index called books in my
00:16:08.440
cluster I want to remove it because
00:16:10.000
we're reindexing I want to create the
00:16:12.480
new indices I'm going to put up I'm
00:16:14.600
going to push up the mappings and then
00:16:16.639
I'm going to bulk insert the
00:16:18.440
records and this batch size is just 50 I
00:16:21.680
could make it bigger or smaller
00:16:23.920
depending on you know the needs um it
00:16:27.480
also creates some nice functions to sort
00:16:30.399
of think about how long that indexing
00:16:31.959
job is taking and then for the next
00:16:34.160
batch size I might increase it or
00:16:36.000
decrease it depending on you know how
00:16:38.040
that um like inflight request is
00:16:41.199
going so yeah we do rake open search
00:16:43.560
reindex and now I've got records in my
00:16:46.240
index okay so the indexing tools are
00:16:48.720
done let's move on to searching tools
00:16:51.440
the searching tools they I mean there's
00:16:53.319
some Basics right we have to have a
00:16:54.480
controller action um we're calling it
00:16:57.480
search and books the index and we're
00:17:01.560
going to see if there if there's a
00:17:03.439
params with a queue present we're going
00:17:06.439
to run this service called book Searcher
00:17:08.919
new with the prams and we're going to
00:17:10.880
respond with the different formats
00:17:12.959
available to
00:17:14.559
us in the book Searcher I'm basically
00:17:18.079
rewriting things and pulling stuff out
00:17:20.120
of my action so that I don't have this
00:17:22.199
Giant action with all these different
00:17:23.919
things that I'm doing um there's a lot
00:17:26.640
of stuff that I didn't include in this
00:17:28.559
slide because it's actually like a much
00:17:30.520
larger file but the biggest and most
00:17:32.280
important one is that I have a run
00:17:33.919
action a run method where I am creating
00:17:38.080
this response variable and I'm using the
00:17:40.320
open search client to search with a body
00:17:43.840
with a method called generate query
00:17:46.000
which is a giant method within the book
00:17:49.120
book Searcher um service up here that
00:17:52.120
takes the pams does a lot of if block
00:17:54.640
logic to figure out the best query for
00:17:58.559
the user to send to open search because
00:18:00.880
that's actually the biggest part is
00:18:02.520
figuring out all the different things
00:18:04.880
that you have to use with your inputs to
00:18:06.640
create this big ass J oh sorry Json
00:18:09.840
object in the open search um and
00:18:14.640
then I'm creating um a book Searcher
00:18:19.360
response with the response from the open
00:18:22.400
search client um and the reason why I'm
00:18:24.559
doing that will be more clear in a
00:18:25.799
second and this book Searcher response I
00:18:29.120
take their response from open the open
00:18:31.400
search client and consume it and then I
00:18:33.919
have a bunch of like getter methods to
00:18:36.840
parse the
00:18:38.000
response
00:18:39.760
um yes I like the budget I've never
00:18:43.000
heard that
00:18:44.440
before um and just some really nice
00:18:47.400
little readers so that I can consume
00:18:49.360
that in the views and not have to do dig
00:18:52.960
you know and like go all the way into oh
00:18:55.840
I did just invent it yay that's great um
00:18:59.200
and dig into all of that Json um I just
00:19:02.640
I hate seeing that in views and I like
00:19:04.640
things to be nice and clean when I'm
00:19:06.360
accessing things so that's why I've got
00:19:08.039
this book Searcher
00:19:10.039
response so it means something like this
00:19:12.320
with like response dig hits hits each do
00:19:14.960
blah blah blah I have response. hits
00:19:17.919
each do Etc and I can keep on like
00:19:21.400
working down that
00:19:23.600
road okay the last piece is the front
00:19:26.400
end interpretation which you see a
00:19:28.280
little bit here here um but I've got my
00:19:32.360
title I've got a little form and then
00:19:34.960
I've got displaying the results that
00:19:37.600
come back this actually can be a lot
00:19:40.240
more complicated when you get into the
00:19:42.960
deep end of it and so that's why
00:19:45.360
partials are really freaking important
00:19:48.400
here um I've got a search form as a
00:19:50.880
partial each hit as a partial the
00:19:53.440
filters as are partial it's partials all
00:19:55.600
the way down um and there's like way
00:19:58.080
more that I could do to when this is
00:19:59.919
fully built out as more of a production
00:20:02.559
app um okay that is a
00:20:06.600
lot um and I feel like most of the thing
00:20:09.840
that I was doing is just creating this
00:20:11.159
repo as a resource um and I'm GNA
00:20:14.440
continue to work on it and it's open
00:20:16.880
source please use it please ask me
00:20:18.440
questions on it if you'd like um Fork it
00:20:21.799
uh tell me you're so silly why would you
00:20:23.640
do a birk seure response you should do
00:20:25.240
this instead I don't know whatever um
00:20:27.480
and I will say as a final yet like you
00:20:30.080
know all the things in here the the
00:20:32.360
different search tools the rate tasks
00:20:34.640
the way that I'm doing the indexing job
00:20:35.960
with a search model whatever it may be
00:20:38.200
the these are best practices using the
00:20:40.679
poros um having something to handle your
00:20:43.520
query response however there's also like
00:20:46.520
Tech debt time constraints multiple
00:20:49.400
rewrites weird things going on with your
00:20:51.600
team and there's the thing that ends up
00:20:53.720
shipping and that's the most important
00:20:55.720
part what are we actually shipping and
00:20:58.400
like what are we getting in front of
00:20:59.840
users so I'd like to just offer that as
00:21:02.640
a final
00:21:04.120
close
00:21:05.799
questions