RubyConf 2022

Data indexing with RGB (Ruby, Graphs and Bitmaps)

In this talk, we will go on a journey through Zappi’s data history and how we are using Ruby, a graph database, and a bitmap store to build a unique data engine. A journey that starts with the problem of a disconnected data set and serialised data frames, and ends with the solution of an in-memory index. We will explore how we used RedisGraph to model the relationships in our data, connecting semantically equal nodes. Then delve into how a query layer was used to index a bitmap store and, in turn, led to us being able to interrogate our entire dataset orders of magnitude faster than before.

RubyConf 2022

00:00:00.000 ready for takeoff
00:00:16.920 all right all right all right hi
00:00:18.840 everyone I'm super excited to be here
00:00:21.359 this morning at the first day of
00:00:23.460 rubyconf are you guys all excited
00:00:25.980 yeah nice you can feel the fresh first
00:00:29.960 first morning of the conference energy
00:00:32.940 I'm Benji I'm a software engineer at
00:00:35.219 zappy I'm originally from Cape Town in
00:00:37.620 South Africa but I currently live in
00:00:39.600 London just a little bit about me I love
00:00:42.660 traveling I love being in nature I love
00:00:45.000 cooking and eating all kinds of food so
00:00:46.980 you can imagine how much I've enjoyed
00:00:48.600 Houston
00:00:49.500 I also love technology coding and data
00:00:52.500 so if you're interested in any of these
00:00:54.420 come grab me afterwards for a chat
00:00:57.600 I'm going to be talking to you today
00:00:59.219 about a fun project that we've been
00:01:01.320 working on for the past year in the
00:01:03.359 zappy X team which is Zappy's r d unit
00:01:07.200 the project is all about a custom data
00:01:09.479 indexing system that we built using Ruby
00:01:11.820 graphs and bitmaps I hope you enjoy
00:01:15.540 before we dive into the nitty-gritty
00:01:17.340 here's just a quick overview of what
00:01:19.080 we'll be running through this morning
00:01:20.700 we'll start off with a bit of background
00:01:22.439 information and I'll paint a little
00:01:24.299 picture of the world before we had RGB
00:01:27.420 spoiler alert it was black and white
00:01:29.939 had to make that joke but we'll dig into
00:01:33.600 some of the problems that we had which
00:01:35.280 included how we're applying context onto
00:01:37.380 our data how we were storing our data
00:01:39.780 and how we lacked connections between
00:01:41.759 our data points
00:01:43.740 I'll then touch on I'll then touches on
00:01:47.100 what we needed and run through a quick
00:01:48.479 demo of what we came up with and after
00:01:50.640 that we'll get a bit nerdy and talk
00:01:52.500 about the composition of the measure
00:01:53.939 store and dive into some of the
00:01:55.619 technical details around the bitmaps and
00:01:57.960 graphs
00:01:59.040 and to finish off with I'll run through
00:02:00.720 some of the metrics of the final
00:02:01.799 solution and what the next steps are for
00:02:03.840 the project
00:02:05.280 I hope that sounds good let's get
00:02:07.500 cracking so at zappy we're all about
00:02:10.140 collecting survey data we've got Suites
00:02:12.900 of research products which have
00:02:14.819 collections of questions or surveys on
00:02:17.040 them and on those we perform some
00:02:19.020 modeling to get useful insights so we're
00:02:22.020 usually testing stimulus such as videos
00:02:24.239 or images and we do the whole thing from
00:02:27.000 ensuring that the right people answer
00:02:28.620 the survey and that they're rewarded for
00:02:30.840 answering that survey all the way to
00:02:33.120 executing the IP in our computation
00:02:35.520 engine to drive the insights that gets
00:02:37.739 get displayed in the charts to our users
00:02:40.379 so we'll run through what that looks
00:02:42.300 like in our system real quick
00:02:44.280 so we've got our respondents that answer
00:02:46.860 the survey in a question the the
00:02:48.959 questions in the survey
00:02:50.580 and these then get turned into what we
00:02:52.860 call measures now a measure is a digital
00:02:55.440 representation of a reading from The
00:02:57.780 Real World and I might use question and
00:03:00.660 measure interchangeably throughout the
00:03:02.340 talk but that's just because our reading
00:03:04.680 from The Real World world is coming in
00:03:06.540 from survey questions
00:03:08.760 but these these questions get passed
00:03:10.860 into our reporting platform where we can
00:03:13.140 start doing the some modeling on them
00:03:14.819 through our in-house computation engine
00:03:16.680 called Quattro uh Quattro essentially
00:03:19.560 allows us to use Python's pandas through
00:03:22.319 Ruby so that we can store and perform
00:03:24.300 operations on these measures through the
00:03:27.120 form of a data frame our CTO Brendan
00:03:29.879 goes into the real reasons why we did
00:03:31.680 this in his talk from rubyconf in 2014
00:03:33.739 but at least for me the best reason is
00:03:36.360 that we just love Ruby
00:03:39.239 so these modeled measures then get
00:03:41.220 stored into our SQL database in the form
00:03:43.560 of serialized or pickled as they call
00:03:45.599 them in Python data frames and when our
00:03:48.720 users come into the platform they can
00:03:50.640 select the surveys that they're
00:03:51.900 interested in and then they can dive
00:03:53.640 into the various charts that we provide
00:03:55.319 when pulling the data for these charts
00:03:57.959 out we're fetching the respondent level
00:03:59.760 data from SQL and this data is mostly
00:04:02.700 pre-aggregated and cached and then we
00:04:05.700 can do some additional computation on it
00:04:07.860 to derive the useful Insight that gets
00:04:10.019 shown in the chart
00:04:12.000 so these charts allow for filtering of
00:04:13.860 respondents so that you can get a better
00:04:15.239 understanding of how different
00:04:16.799 demographics respond into your simile
00:04:18.720 and they can also give our users
00:04:20.639 benchmarks to compare their numbers
00:04:22.199 against
00:04:23.520 the platform at the moment is incredibly
00:04:25.860 good at this type of analysis where a
00:04:28.259 user has selected a subset of their
00:04:29.880 studies and they want to do some kind of
00:04:31.919 cross-comparison between them and the
00:04:35.340 architecture of the platform is also
00:04:37.139 really good at storing the dependencies
00:04:39.000 behind the models and that we're
00:04:41.220 Computing and the computation engine is
00:04:43.320 optimized for processing processing
00:04:45.060 these models and their dependencies
00:04:48.540 but we want more we want to query all of
00:04:52.139 our data and in real time so we want to
00:04:55.500 store the connections on and
00:04:56.940 relationships between the different data
00:04:59.160 points that we run
00:05:00.780 all of this is so that we can do
00:05:02.940 meta-analysis over the whole data set so
00:05:05.520 that we can get an even deeper
00:05:06.720 understanding of the platform the data
00:05:09.120 in our platform
00:05:10.919 so as you would imagine nothing is ever
00:05:12.960 that simple when you want all the things
00:05:15.479 so let's take a look at some of the
00:05:17.460 problems that we're facing that were
00:05:19.020 stopping us from getting there
00:05:22.020 the first problem that we needed to
00:05:24.120 consider is context
00:05:25.800 so when fetching all of the data for
00:05:27.900 something we need to make sure that that
00:05:29.880 data that we're that we're fetching
00:05:31.740 actually represents the same thing and
00:05:34.440 the best way to think about this is
00:05:35.820 through an example so let's consider the
00:05:38.520 case where we want to find out how the
00:05:40.680 brand Yamaha is doing for a particular
00:05:43.139 metric something like ease of use so how
00:05:45.900 easy was this thing to use
00:05:48.840 so if we wanted to write a query for
00:05:50.820 this we'd say get me all of the data
00:05:53.100 where the measure is ease of use and the
00:05:55.919 brand is Yamaha
00:05:57.300 and we get all this data back and we
00:05:59.699 plot the distribution of it and we're
00:06:01.620 like hang on there's something funky in
00:06:03.240 the data we've got these two bumps where
00:06:05.580 some people thought that it was easy to
00:06:07.440 use and others thought that it was hard
00:06:09.180 to use which for the sake of this
00:06:11.100 example is unexpected
00:06:13.620 so we take a second to think about it
00:06:15.419 and we realize oh hang on a second
00:06:16.919 Yamaha make motorcycles and they also
00:06:19.560 make pianos this could be what's causing
00:06:22.080 this anomaly where motorcycles could be
00:06:24.180 pretty hard to use and pianos are maybe
00:06:26.759 pretty easy to use so now we need to
00:06:29.580 know in what context a given measure was
00:06:31.800 asked in that survey was it in the
00:06:33.840 vehicle category or was it in the
00:06:35.520 musical instruments category and this
00:06:37.860 additional level of context is really
00:06:39.479 key when running this meta-analysis and
00:06:42.360 for me the ease of use for both of these
00:06:44.699 categories would probably be zero I'd
00:06:46.620 crash the motorbike and probably flip
00:06:48.720 the piano out of Rage
00:06:50.940 um but but no let's get back to it so
00:06:53.100 the next problem that we had is storage
00:06:54.960 so as I mentioned before we're storing
00:06:57.840 all of our data as serialized data
00:07:00.180 frames in SQL so we're also storing them
00:07:03.000 each measure on the survey level so if
00:07:05.220 we run four surveys all with the same
00:07:07.560 measure we would end up with four
00:07:09.300 serialized measures which would need to
00:07:11.340 be concatenated together in order to get
00:07:14.580 back the data for those four measures
00:07:16.500 combined
00:07:18.419 so to put this into some code for a
00:07:20.819 given measure we need to go through each
00:07:22.500 of the surveys Fetch and deserialize the
00:07:25.259 measure and ultimately concatenate them
00:07:27.479 together at the end and you put this
00:07:29.580 onto a database where you've got tens of
00:07:31.560 thousands of surveys and that's a lot of
00:07:33.419 deserialization and concatenation so
00:07:36.000 it's pretty slow to put it lightly
00:07:39.539 the last thing that we had to consider
00:07:41.099 is the connecting of different data
00:07:43.259 points and we call this harmonization
00:07:45.680 harmonization is all about knowing what
00:07:48.120 data should be treated the same
00:07:50.039 and we can break this down into three
00:07:52.440 broad categories so you've got the
00:07:54.479 measure or the question the stimuli or
00:07:57.300 the context in which something was asked
00:07:58.979 and the audience or the people that
00:08:01.020 you're asking so we'll run through a
00:08:03.000 couple of examples of this
00:08:05.160 for measure-based harmonization let's
00:08:07.560 say we've got two questions uh question
00:08:10.319 one asked on a scale of one to ten how
00:08:12.539 easy is this to use
00:08:14.580 and question two asked after that
00:08:17.340 amazing experience how how would you
00:08:19.500 rate its ease of use out of 10. and
00:08:22.259 let's say that we these these two
00:08:23.879 measures mean the same thing so we get
00:08:26.160 this little bi-directional arrow
00:08:28.080 indicating that they're the same now
00:08:30.300 when fetching all of the data for
00:08:31.620 question one I should get the data for
00:08:33.360 both question one and question two
00:08:34.979 together and these measures are now what
00:08:37.380 we'd call harmonized
00:08:39.000 it's worth mentioning that these
00:08:41.339 probably shouldn't have been considered
00:08:42.599 the same because the second question is
00:08:45.000 putting in some additional bias so this
00:08:47.279 might skew the results so in fact
00:08:49.320 question two can only call itself the
00:08:51.660 same as question one but not the other
00:08:53.700 way around so now we have a one
00:08:55.860 directional error and question two
00:08:58.560 should have this data both question one
00:09:00.240 and question two and question one should
00:09:02.100 only have question one
00:09:04.080 so it's a lot of questions
00:09:06.360 um the same principle applies to other
00:09:07.980 factors of harmonization where in the
00:09:10.080 case of the stimuli you would treat the
00:09:11.760 metadata of the study as the question in
00:09:14.040 this example so something like this
00:09:15.959 where in one survey we asked motorcycles
00:09:18.720 and in another one we asked motorbikes
00:09:20.700 we know that these two things should be
00:09:22.380 the same thing going forward so their
00:09:24.300 data can be harmonized
00:09:28.140 so we needed something special something
00:09:30.240 that will allow us to query our entire
00:09:32.160 data set with both context and
00:09:34.320 harmonization and it also needs to be
00:09:36.720 fast like real time fast
00:09:39.600 so before I continue I forgot to mention
00:09:42.420 at the start that there's a prize to
00:09:43.860 ever can guess the number of times I've
00:09:45.720 said data in this talk so come grab me
00:09:48.000 afterwards
00:09:49.320 but let's bring it back in we've gone
00:09:50.940 through what the world looked like
00:09:52.320 before the measure store and hopefully
00:09:54.240 we have an idea of the kinds of problems
00:09:56.160 that we were facing but now that's over
00:09:58.440 let's get ready for the cool stuff
00:10:01.080 welcome to the measure store and as the
00:10:03.779 name suggests we're storing measures the
00:10:06.779 measure store was built with the core
00:10:08.279 principle that the API needs to be very
00:10:10.440 simple and easy to understand when
00:10:12.720 making a request to the measure store
00:10:14.160 you can give it three things you can
00:10:16.019 give it the context or the scope
00:10:18.660 you can give it the measure and the
00:10:20.940 dimensions that you're interested in if
00:10:22.680 you're interested in them that last
00:10:23.940 one's optional but let's apply that to
00:10:26.040 the Yamaha example
00:10:27.779 so you can see over here that we're
00:10:29.640 scoping the query measure store.scoped
00:10:31.980 and we're scoping it to ask for surveys
00:10:34.680 that have been asked in a category of
00:10:36.720 motorcycle and brand of Yamaha and we're
00:10:40.260 also asking for the ease of use measure
00:10:42.240 at the end of it as seen in those Square
00:10:44.459 braces
00:10:46.320 and over here we're doing the same thing
00:10:48.300 but we're saying that we're only
00:10:49.800 interested in people that answered
00:10:51.300 between a rating between 7 and 10 so 7 8
00:10:54.839 9 10.
00:10:56.040 we'll dig into this a bit further after
00:10:57.779 the demo but you can also perform basic
00:10:59.880 operations on these queries such as
00:11:01.560 counting the number of respondents or
00:11:03.540 printing out the indexes
00:11:05.700 um so with us seeing a little bit how to
00:11:07.440 interface with the measure store let's
00:11:09.300 see it in action
00:11:11.820 and so for the first demo we're going to
00:11:14.519 see how a particular measure has
00:11:16.019 performed across all of the countries
00:11:17.579 that we run ads in so the measure that
00:11:19.620 we're going to look at is add
00:11:20.820 distinctiveness on a scale of one to
00:11:22.920 five how distinctive was this ad
00:11:25.380 so what we'll do is we'll Loop over all
00:11:27.600 the countries that we run ads in and
00:11:29.339 we'll get back the distribution for that
00:11:31.680 measure on a given country and I haven't
00:11:33.660 shown it in this code snippet but we'll
00:11:35.760 print these out in an ASCII chart so
00:11:37.620 that we can visualize the data
00:11:40.380 um and here we go we can see it running
00:11:43.200 and there you get back all of the data
00:11:46.320 and so for the sake of this example
00:11:48.000 let's say we just want to compare the
00:11:49.620 United States versus the UK
00:11:52.019 so here is the data for the United
00:11:53.820 States and we can see that it's all come
00:11:56.100 back in 17 milliseconds this is for 800
00:11:59.160 000 respondents and we can see that the
00:12:01.380 trend is pretty good and that ads in the
00:12:03.060 state seem to be pretty distinctive
00:12:04.800 about 240 000 people are giving it a
00:12:07.620 full 5 out of five and only a few people
00:12:09.959 giving it a one or two out of five
00:12:13.019 now let's take a look at the UK and this
00:12:15.839 has come back in just six milliseconds
00:12:17.760 but there's slightly less respondents
00:12:19.680 with only 180 000 people in the study uh
00:12:23.040 sorry that have been run in the UK so we
00:12:25.019 can already tell that we've run more ads
00:12:26.459 in the in the US than in the UK but the
00:12:29.459 result has got an interesting Trend we
00:12:31.140 can immediately see that there are less
00:12:33.060 people giving the ad five out of five
00:12:35.160 and it's pretty flat between three and
00:12:37.440 five and and quite a few people giving
00:12:39.839 it a two out of five so the ads in the
00:12:42.300 UK in the UK are less distinctive than
00:12:44.820 those in the states
00:12:46.500 um or might or more likely the people in
00:12:48.360 the UK are slightly more pessimistic
00:12:50.100 than people in the states which is
00:12:52.079 probably due to the weather
00:12:55.740 um but now let's take a look at another
00:12:57.180 type of analysis which involves Crossing
00:12:59.579 of two different measures and this is
00:13:01.620 really popular when trying to see how
00:13:03.300 two measures will relate to each other
00:13:05.459 and what we'll do is we'll cross the
00:13:07.260 persuasion persuasion measure with the
00:13:09.779 watched full add measure and this will
00:13:11.880 tell us if people who watch the full ad
00:13:13.560 find it more persuasive
00:13:16.200 so watchful ad only has two Dimensions
00:13:18.480 it's either a yes or a no and we'll Loop
00:13:20.940 over those and and do this cross product
00:13:22.860 between persuasion and that dimension
00:13:26.639 and we'll run that real quick
00:13:29.339 and we get the data back we get two
00:13:31.019 histograms
00:13:32.279 but let's dig into that so here's the
00:13:34.620 distribution for persuasion and those
00:13:36.540 who watch the full ad and this data came
00:13:39.180 back in 23 milliseconds and it's doing a
00:13:41.579 cross between 1.5 million respondents
00:13:44.220 who asked who asked the persuasive
00:13:46.860 persuasion question and 1.3 million
00:13:49.320 respondents who watched the full ad
00:13:51.420 and here's the same thing for those who
00:13:53.519 didn't watch the full ad which only took
00:13:55.139 eight milliseconds but that's just
00:13:57.240 because there's far less respondents and
00:13:59.519 we can see that generally those who
00:14:00.899 watched the full ad found it more
00:14:02.579 persuasive which we probably could have
00:14:04.380 predicted
00:14:05.279 because they actually watched the whole
00:14:06.839 thing
00:14:08.040 but the last Quick demo that I'll do is
00:14:10.079 is for harmonization and a classic in
00:14:12.420 this problem is looking at regions and
00:14:15.180 over here we've got all the regions for
00:14:16.920 the United States and we can see that
00:14:18.959 there's a bit of a translations issue
00:14:20.459 we've we've got studies that have been
00:14:22.079 run in Spanish but also in English so
00:14:24.779 less harmonizes data so that we can
00:14:27.139 analyze them all together and to start
00:14:29.880 off with we'll look at the counts before
00:14:31.860 harmonization for the Northeast region
00:14:34.019 and norester and please excuse my
00:14:36.839 pronunciation I don't really speak
00:14:38.760 Spanish
00:14:40.500 um but we've got 360
00:14:42.360 000 respondents from the Northeast
00:14:44.279 region and 89 000 from norester
00:14:47.639 and we can go and harmonize that by
00:14:50.820 using this semantically same EQ function
00:14:53.880 for semantic equality and for the sake
00:14:56.579 of this example we're going to give it
00:14:58.199 the bi-directional truth so that means
00:15:00.240 the two-directional arrow
00:15:03.480 and we run that and we can see that when
00:15:05.880 you count the number of respondents for
00:15:07.440 that they both have the same thing
00:15:10.079 so these these two points are now
00:15:11.760 harmonized and we'll do the same for
00:15:13.860 South and so
00:15:15.899 um so we can see the counts before doing
00:15:17.820 it over here
00:15:19.860 and we can do this harmonization but
00:15:22.079 this time we're not going to say that
00:15:23.339 it's bi-directional
00:15:24.720 which is the default option as as you
00:15:26.940 can see and we run that and here we get
00:15:30.300 the data back and we can see that South
00:15:32.160 now includes Sur but sir doesn't include
00:15:35.339 South so the state has been harmonized
00:15:37.800 in One Direction
00:15:40.199 all right so we've seen the measure
00:15:43.019 store in action and we understand a bit
00:15:44.940 about the API and how to interact with
00:15:46.860 it now let's go into what's happening
00:15:49.139 under the hood
00:15:50.459 so we can break down the measure store
00:15:52.199 into two key components you've got the
00:15:54.899 part that stores the raw data and you've
00:15:56.940 got the part that stores the context
00:15:58.620 onto that data
00:16:00.420 so we'll start off by talking about the
00:16:02.160 way that we store the data and the best
00:16:03.959 way to think about the raw data store is
00:16:05.940 as an index with a bunch of columns so
00:16:09.000 in our application the index is
00:16:11.100 respondents and the columns are going to
00:16:13.740 be the measures broken down into
00:16:15.240 dimensions
00:16:16.560 so each Dimension is a bitmap and if a
00:16:20.040 respondent answered yes for a particular
00:16:21.779 Dimension they'll get a one and if not
00:16:24.060 they'll get a zero so let's take a look
00:16:26.579 at what that looks like so we've got
00:16:28.199 four respondents and they were all asked
00:16:30.120 the ease of use question and we know
00:16:32.279 that this question is a 10 point scale
00:16:33.899 so you've got 10 Dimensions going across
00:16:36.420 respondent number one answer to seven so
00:16:39.060 they've got a one four seven and zero
00:16:41.339 for everything else
00:16:43.139 um and responding two answer to five so
00:16:45.000 their row looks like this and respondent
00:16:47.519 three gave an eight and four gave a
00:16:49.560 seven
00:16:50.880 uh so there we have it this is how we
00:16:52.920 store our ease of use measure for all of
00:16:55.740 our respondents and when we first
00:16:58.259 implemented this it was really fast to
00:17:00.180 return all of the data for a measure but
00:17:02.699 we noticed that it was getting quite big
00:17:03.899 in terms of size and we can see it in
00:17:06.179 this example that we need 10 bits to
00:17:08.579 store a single respondent which is
00:17:10.919 arguably better than 32 or 64 but we
00:17:15.299 still wanted to get this more compressed
00:17:16.740 this data is really sparse and we found
00:17:18.959 a compression algorithm called roaring
00:17:20.880 bitmaps which essentially allows us to
00:17:23.520 discard the zeros so our bitmap actually
00:17:26.220 looks something like this
00:17:28.799 so now we only need one bit to store a
00:17:31.140 single respondent which is pretty neat
00:17:33.419 and so we've optimized the store to get
00:17:35.700 all of the data for a given measure
00:17:37.380 regardless of the survey so no more
00:17:39.720 iteration through surveys no more
00:17:41.460 deserialization and no more
00:17:43.440 concatenation to pull it all together
00:17:45.660 yay
00:17:47.400 um but let's look at take a look at what
00:17:48.780 that would be like if we did in fact
00:17:50.340 have some surveys
00:17:52.919 so we'll add two surveys into the mix
00:17:54.720 we've got respondents one and two who
00:17:57.059 have answered survey number one and
00:17:58.919 three and four who have answered survey
00:18:00.660 number two
00:18:01.740 in terms of the query that we saw
00:18:03.419 earlier these would be the scope part of
00:18:06.000 their query so we've added two bitmaps
00:18:08.520 one for each survey and if we want to
00:18:10.200 fetch the data for survey number one for
00:18:12.720 the ease of use measure we can fetch all
00:18:14.940 of the respondents that's all survey
00:18:16.380 number one and we can take the ease of
00:18:19.080 use measure and do a bitwise and between
00:18:21.299 those two so it would look something
00:18:22.919 like this
00:18:24.480 and voila we have all of the data for
00:18:27.179 Server number one and the ease of use
00:18:28.980 measure and all of that in a few CPU
00:18:31.200 cycles and the same applies for survey
00:18:33.780 number two
00:18:35.760 um and this principle goes for any type
00:18:37.559 of data that we're storing in the bitmap
00:18:39.780 store so we can combine different
00:18:41.700 measures or Scopes using these bit Ops
00:18:46.140 but now that we understand the way that
00:18:47.400 we store in the data let's take a look
00:18:49.440 at how we connect it all together
00:18:52.200 so initially when thinking about the
00:18:54.299 problem of storing the relationships
00:18:56.160 between the data points we looked at
00:18:58.200 using a many-to-many relationship in a
00:19:00.059 SQL table
00:19:01.320 but when we started to think about it we
00:19:03.240 realized that this wasn't the right tool
00:19:04.799 for the job and the main reason for that
00:19:07.320 was because we needed to do what we call
00:19:09.419 multi-hop traversal so it turned into a
00:19:12.059 graph problem and to show you what I
00:19:13.919 mean by this I'll show you the structure
00:19:16.140 of the graph
00:19:17.760 and the first node that I'll introduce
00:19:20.039 is called the scope node and it
00:19:22.320 essentially stores a entity attribute
00:19:24.360 value triple as well as a storage key so
00:19:27.299 in this case The Entity is survey the
00:19:29.820 value is the attribute is category and
00:19:32.039 the values motorbike and we can do this
00:19:34.440 for the other variants of the scope so
00:19:36.539 we've got motorcycle and two wheel we
00:19:39.660 can now put an equals Edge between them
00:19:41.400 to indicate that they should be
00:19:42.720 harmonized and it's a one directional
00:19:44.760 Arrow as you can see
00:19:46.799 so now when I want to fetch all of the
00:19:48.780 data for the scope motorbike I can I now
00:19:52.500 know to fetch that of motorcycle and for
00:19:54.840 two wheels even though there isn't this
00:19:56.520 direct link between motorbike on the
00:19:58.740 left and two wheel on the right and
00:20:00.539 that's just multi-hop traversal
00:20:03.900 um and at this point it's worth
00:20:05.100 mentioning that we're using a property
00:20:06.960 graph so we can store the node type
00:20:09.179 which is scope and we can store some
00:20:11.460 properties to that which are that is
00:20:13.679 that hash of entity attribute value
00:20:16.080 and the storage key is a pointer to the
00:20:18.600 bitmap that stores the respondents for
00:20:20.940 that particular scope a bit more on this
00:20:23.580 to come
00:20:24.780 and the next node that we have is the
00:20:27.120 measure node which has a value property
00:20:29.340 to say what the measure name is so ease
00:20:32.220 of use and we can scope this measure by
00:20:34.799 adding a scoped Edge to indicate that
00:20:37.020 this measure has been asked in the
00:20:38.820 context of the connected scopes
00:20:42.200 and a measure node needs to have
00:20:44.520 measurements and these measurements have
00:20:46.980 a storage key and these storage Keys
00:20:49.799 point to the bitmaps that hold the raw
00:20:51.780 data for that particular Dimension or
00:20:54.120 measurement and in the example of the
00:20:56.400 bitmaps that we saw earlier each of
00:20:58.500 these would be a single column of the
00:21:00.600 dimensions for the ease of use measure
00:21:03.419 the measurement nodes capture a measure
00:21:05.940 so we can add that edge to them and
00:21:08.580 we've got this capture's Edge
00:21:11.700 and finally we've got the constant node
00:21:13.919 which contains the details of the
00:21:15.780 measurement and this has a value
00:21:17.580 property to say what the dimension was
00:21:19.740 so 5 6 or 7.
00:21:22.320 uh and if you if and these are connected
00:21:25.679 to the measurement nodes through a
00:21:27.000 measures Edge
00:21:28.320 and these measures edges have a
00:21:30.659 dimension property to indicate the type
00:21:32.460 of constant that it is so in this case
00:21:34.200 it was a choice
00:21:36.299 so sorry that's a lot to take in
00:21:38.820 um and if I lost you in that process let
00:21:40.620 me quickly summarize how it all connects
00:21:42.840 together
00:21:44.039 so we have the graph with the nodes and
00:21:46.919 edges which store the Scopes measures
00:21:49.500 and the relationships between them and
00:21:52.080 then we've got the bitmaps which hold
00:21:53.580 the raw data so when I run a query in
00:21:56.280 the measure store with the scope of
00:21:58.020 category as motorbike
00:21:59.880 the graph will go and fetch that scope
00:22:02.820 and all of its harmonized Scopes which
00:22:05.460 are connected by that equals Edge and
00:22:07.740 from these Scopes we can take that
00:22:09.600 storage key property which is the
00:22:11.520 pointer to those bitmaps
00:22:13.980 and there we have our Scopes and that
00:22:16.620 part of the query is sorted so now we
00:22:19.200 need to grab the measure which is the
00:22:21.000 ease of use measure so we find the
00:22:23.280 measure node which has the value ease of
00:22:25.320 use and we grab all of its measurements
00:22:28.080 and from those measurements we can get
00:22:30.419 the bitmaps from the storage key
00:22:32.700 and there we have all the data that's
00:22:35.580 needed to fetch uh to fetch our final
00:22:38.640 measure
00:22:40.260 um so to get that final measure we'll do
00:22:42.480 a bitwise all of all the scope bitmaps
00:22:45.120 and a bitwise or over all the measure
00:22:47.340 bitmaps and do an and between those to
00:22:49.919 get the final result
00:22:52.679 so you can see that we're using the
00:22:54.720 graph to determine what data to fetch
00:22:56.760 and we fetch the corresponding data from
00:22:59.039 that bitmap store
00:23:00.600 and there we have it our data index
00:23:02.880 using Ruby graphs and bitmaps
00:23:05.820 but I haven't mentioned how it's running
00:23:07.620 in production or what tools we're using
00:23:09.960 to implement it so let me introduce to
00:23:12.480 you our good friend redis
00:23:14.460 and because of how compressed the data
00:23:16.620 is with that roaring compression
00:23:18.539 and more on this in the next section
00:23:19.860 we're able to store both the graph and
00:23:22.260 the bitmaps in memory through redis and
00:23:25.140 redis provides us with a really useful
00:23:27.179 database graph database module called
00:23:29.520 graph redis graph which is what we're
00:23:32.640 using to store those nodes and edges
00:23:35.299 reddish graph supports open decipher
00:23:37.679 which is one of the more popular query
00:23:39.480 languages for graph databases and so
00:23:42.179 this could be translatable to other
00:23:43.880 graph graph Technologies
00:23:46.860 um and for the bitmap store we're using
00:23:48.419 a a module called redis roaring which
00:23:51.360 was written by aviciano
00:23:53.760 um and we've been really impressed by
00:23:55.260 how fast the whole system has been
00:23:56.880 running so let's take a look at some of
00:23:59.520 the performance metrics
00:24:01.860 and to give a rough Benchmark for the
00:24:04.919 for the measure store I've fetched all
00:24:07.020 of the age measures for one of our
00:24:08.940 products where we've got 3 608 surveys
00:24:12.620 and this has a total of 1.5 million
00:24:15.419 respondents I ran this on the current
00:24:17.520 reporting platform which is doing that
00:24:19.440 whole fetching of the measure and
00:24:22.100 deserializing and concatenating of them
00:24:24.179 and it came back in 90 seconds
00:24:27.059 and I did the same thing in the measure
00:24:28.740 store which is scoping the query on the
00:24:30.780 product and fetching the age bitmap and
00:24:33.120 it took
00:24:34.760 495 milliseconds which is a 180x speed
00:24:38.220 up
00:24:39.059 I also did a test to see how fast it
00:24:40.980 could return all of the data for the age
00:24:43.260 measure across all of our products and
00:24:45.900 this just took nine milliseconds which
00:24:47.760 is for about 5 million respondents and
00:24:49.799 it's so much faster because we don't
00:24:51.780 have to Traverse the graph to find those
00:24:53.880 Scopes we can just go and fetch that
00:24:55.559 data straight from the bitmap store
00:24:58.919 and to compare the storage benefits of
00:25:01.140 the measure store I loaded a subset of
00:25:03.900 40 studies from the current reporting
00:25:05.940 platform into a SQL database and they
00:25:09.000 took up about 3.6 gigs and when I loaded
00:25:12.539 that same data into the measure store
00:25:13.980 the graph came up to 23.3 megabytes and
00:25:17.760 the bookmaps came up to just 4.46
00:25:20.159 megabytes so a total of 27 megabytes
00:25:23.580 total and that's about a 128x
00:25:26.880 compression ratio
00:25:30.360 um so what's next for the measure store
00:25:32.520 well we've just graduated the project
00:25:34.200 out of its r d phase and we're starting
00:25:36.779 to scale it up with a few products in
00:25:38.820 mind uh and in terms of our main
00:25:41.279 priorities right now we're doing some
00:25:42.779 battle testing to see how it will behave
00:25:45.059 under extreme load
00:25:46.620 and at the same time we've got folks who
00:25:48.539 are looking into how we can use the
00:25:49.860 measure store as a back end for modeling
00:25:52.380 data through the bitmaps and we hope
00:25:54.900 that one day we'll be able to open
00:25:55.980 source it so that if you also need to
00:25:57.960 harmonize your data with context you can
00:26:00.179 use it too
00:26:01.559 thank you for listening to me go on
00:26:03.179 about graphs and bitmaps and data enjoy
00:26:05.760 the rest of the conference