Data indexing with RGB (Ruby, Graphs and Bitmaps)

00:00:00.000 ready for takeoff

00:00:16.920 all right all right all right hi

00:00:18.840 everyone I'm super excited to be here

00:00:21.359 this morning at the first day of

00:00:23.460 rubyconf are you guys all excited

00:00:25.980 yeah nice you can feel the fresh first

00:00:29.960 first morning of the conference energy

00:00:32.940 I'm Benji I'm a software engineer at

00:00:35.219 zappy I'm originally from Cape Town in

00:00:37.620 South Africa but I currently live in

00:00:39.600 London just a little bit about me I love

00:00:42.660 traveling I love being in nature I love

00:00:45.000 cooking and eating all kinds of food so

00:00:46.980 you can imagine how much I've enjoyed

00:00:48.600 Houston

00:00:49.500 I also love technology coding and data

00:00:52.500 so if you're interested in any of these

00:00:54.420 come grab me afterwards for a chat

00:00:57.600 I'm going to be talking to you today

00:00:59.219 about a fun project that we've been

00:01:01.320 working on for the past year in the

00:01:03.359 zappy X team which is Zappy's r d unit

00:01:07.200 the project is all about a custom data

00:01:09.479 indexing system that we built using Ruby

00:01:11.820 graphs and bitmaps I hope you enjoy

00:01:15.540 before we dive into the nitty-gritty

00:01:17.340 here's just a quick overview of what

00:01:19.080 we'll be running through this morning

00:01:20.700 we'll start off with a bit of background

00:01:22.439 information and I'll paint a little

00:01:24.299 picture of the world before we had RGB

00:01:27.420 spoiler alert it was black and white

00:01:29.939 had to make that joke but we'll dig into

00:01:33.600 some of the problems that we had which

00:01:35.280 included how we're applying context onto

00:01:37.380 our data how we were storing our data

00:01:39.780 and how we lacked connections between

00:01:41.759 our data points

00:01:43.740 I'll then touch on I'll then touches on

00:01:47.100 what we needed and run through a quick

00:01:48.479 demo of what we came up with and after

00:01:50.640 that we'll get a bit nerdy and talk

00:01:52.500 about the composition of the measure

00:01:53.939 store and dive into some of the

00:01:55.619 technical details around the bitmaps and

00:01:57.960 graphs

00:01:59.040 and to finish off with I'll run through

00:02:00.720 some of the metrics of the final

00:02:01.799 solution and what the next steps are for

00:02:03.840 the project

00:02:05.280 I hope that sounds good let's get

00:02:07.500 cracking so at zappy we're all about

00:02:10.140 collecting survey data we've got Suites

00:02:12.900 of research products which have

00:02:14.819 collections of questions or surveys on

00:02:17.040 them and on those we perform some

00:02:19.020 modeling to get useful insights so we're

00:02:22.020 usually testing stimulus such as videos

00:02:24.239 or images and we do the whole thing from

00:02:27.000 ensuring that the right people answer

00:02:28.620 the survey and that they're rewarded for

00:02:30.840 answering that survey all the way to

00:02:33.120 executing the IP in our computation

00:02:35.520 engine to drive the insights that gets

00:02:37.739 get displayed in the charts to our users

00:02:40.379 so we'll run through what that looks

00:02:42.300 like in our system real quick

00:02:44.280 so we've got our respondents that answer

00:02:46.860 the survey in a question the the

00:02:48.959 questions in the survey

00:02:50.580 and these then get turned into what we

00:02:52.860 call measures now a measure is a digital

00:02:55.440 representation of a reading from The

00:02:57.780 Real World and I might use question and

00:03:00.660 measure interchangeably throughout the

00:03:02.340 talk but that's just because our reading

00:03:04.680 from The Real World world is coming in

00:03:06.540 from survey questions

00:03:08.760 but these these questions get passed

00:03:10.860 into our reporting platform where we can

00:03:13.140 start doing the some modeling on them

00:03:14.819 through our in-house computation engine

00:03:16.680 called Quattro uh Quattro essentially

00:03:19.560 allows us to use Python's pandas through

00:03:22.319 Ruby so that we can store and perform

00:03:24.300 operations on these measures through the

00:03:27.120 form of a data frame our CTO Brendan

00:03:29.879 goes into the real reasons why we did

00:03:31.680 this in his talk from rubyconf in 2014

00:03:33.739 but at least for me the best reason is

00:03:36.360 that we just love Ruby

00:03:39.239 so these modeled measures then get

00:03:41.220 stored into our SQL database in the form

00:03:43.560 of serialized or pickled as they call

00:03:45.599 them in Python data frames and when our

00:03:48.720 users come into the platform they can

00:03:50.640 select the surveys that they're

00:03:51.900 interested in and then they can dive

00:03:53.640 into the various charts that we provide

00:03:55.319 when pulling the data for these charts

00:03:57.959 out we're fetching the respondent level

00:03:59.760 data from SQL and this data is mostly

00:04:02.700 pre-aggregated and cached and then we

00:04:05.700 can do some additional computation on it

00:04:07.860 to derive the useful Insight that gets

00:04:10.019 shown in the chart

00:04:12.000 so these charts allow for filtering of

00:04:13.860 respondents so that you can get a better

00:04:15.239 understanding of how different

00:04:16.799 demographics respond into your simile

00:04:18.720 and they can also give our users

00:04:20.639 benchmarks to compare their numbers

00:04:22.199 against

00:04:23.520 the platform at the moment is incredibly

00:04:25.860 good at this type of analysis where a

00:04:28.259 user has selected a subset of their

00:04:29.880 studies and they want to do some kind of

00:04:31.919 cross-comparison between them and the

00:04:35.340 architecture of the platform is also

00:04:37.139 really good at storing the dependencies

00:04:39.000 behind the models and that we're

00:04:41.220 Computing and the computation engine is

00:04:43.320 optimized for processing processing

00:04:45.060 these models and their dependencies

00:04:48.540 but we want more we want to query all of

00:04:52.139 our data and in real time so we want to

00:04:55.500 store the connections on and

00:04:56.940 relationships between the different data

00:04:59.160 points that we run

00:05:00.780 all of this is so that we can do

00:05:02.940 meta-analysis over the whole data set so

00:05:05.520 that we can get an even deeper

00:05:06.720 understanding of the platform the data

00:05:09.120 in our platform

00:05:10.919 so as you would imagine nothing is ever

00:05:12.960 that simple when you want all the things

00:05:15.479 so let's take a look at some of the

00:05:17.460 problems that we're facing that were

00:05:19.020 stopping us from getting there

00:05:22.020 the first problem that we needed to

00:05:24.120 consider is context

00:05:25.800 so when fetching all of the data for

00:05:27.900 something we need to make sure that that

00:05:29.880 data that we're that we're fetching

00:05:31.740 actually represents the same thing and

00:05:34.440 the best way to think about this is

00:05:35.820 through an example so let's consider the

00:05:38.520 case where we want to find out how the

00:05:40.680 brand Yamaha is doing for a particular

00:05:43.139 metric something like ease of use so how

00:05:45.900 easy was this thing to use

00:05:48.840 so if we wanted to write a query for

00:05:50.820 this we'd say get me all of the data

00:05:53.100 where the measure is ease of use and the

00:05:55.919 brand is Yamaha

00:05:57.300 and we get all this data back and we

00:05:59.699 plot the distribution of it and we're

00:06:01.620 like hang on there's something funky in

00:06:03.240 the data we've got these two bumps where

00:06:05.580 some people thought that it was easy to

00:06:07.440 use and others thought that it was hard

00:06:09.180 to use which for the sake of this

00:06:11.100 example is unexpected

00:06:13.620 so we take a second to think about it

00:06:15.419 and we realize oh hang on a second

00:06:16.919 Yamaha make motorcycles and they also

00:06:19.560 make pianos this could be what's causing

00:06:22.080 this anomaly where motorcycles could be

00:06:24.180 pretty hard to use and pianos are maybe

00:06:26.759 pretty easy to use so now we need to

00:06:29.580 know in what context a given measure was

00:06:31.800 asked in that survey was it in the

00:06:33.840 vehicle category or was it in the

00:06:35.520 musical instruments category and this

00:06:37.860 additional level of context is really

00:06:39.479 key when running this meta-analysis and

00:06:42.360 for me the ease of use for both of these

00:06:44.699 categories would probably be zero I'd

00:06:46.620 crash the motorbike and probably flip

00:06:48.720 the piano out of Rage

00:06:50.940 um but but no let's get back to it so

00:06:53.100 the next problem that we had is storage

00:06:54.960 so as I mentioned before we're storing

00:06:57.840 all of our data as serialized data

00:07:00.180 frames in SQL so we're also storing them

00:07:03.000 each measure on the survey level so if

00:07:05.220 we run four surveys all with the same

00:07:07.560 measure we would end up with four

00:07:09.300 serialized measures which would need to

00:07:11.340 be concatenated together in order to get

00:07:14.580 back the data for those four measures

00:07:16.500 combined

00:07:18.419 so to put this into some code for a

00:07:20.819 given measure we need to go through each

00:07:22.500 of the surveys Fetch and deserialize the

00:07:25.259 measure and ultimately concatenate them

00:07:27.479 together at the end and you put this

00:07:29.580 onto a database where you've got tens of

00:07:31.560 thousands of surveys and that's a lot of

00:07:33.419 deserialization and concatenation so

00:07:36.000 it's pretty slow to put it lightly

00:07:39.539 the last thing that we had to consider

00:07:41.099 is the connecting of different data

00:07:43.259 points and we call this harmonization

00:07:45.680 harmonization is all about knowing what

00:07:48.120 data should be treated the same

00:07:50.039 and we can break this down into three

00:07:52.440 broad categories so you've got the

00:07:54.479 measure or the question the stimuli or

00:07:57.300 the context in which something was asked

00:07:58.979 and the audience or the people that

00:08:01.020 you're asking so we'll run through a

00:08:03.000 couple of examples of this

00:08:05.160 for measure-based harmonization let's

00:08:07.560 say we've got two questions uh question

00:08:10.319 one asked on a scale of one to ten how

00:08:12.539 easy is this to use

00:08:14.580 and question two asked after that

00:08:17.340 amazing experience how how would you

00:08:19.500 rate its ease of use out of 10. and

00:08:22.259 let's say that we these these two

00:08:23.879 measures mean the same thing so we get

00:08:26.160 this little bi-directional arrow

00:08:28.080 indicating that they're the same now

00:08:30.300 when fetching all of the data for

00:08:31.620 question one I should get the data for

00:08:33.360 both question one and question two

00:08:34.979 together and these measures are now what

00:08:37.380 we'd call harmonized

00:08:39.000 it's worth mentioning that these

00:08:41.339 probably shouldn't have been considered

00:08:42.599 the same because the second question is

00:08:45.000 putting in some additional bias so this

00:08:47.279 might skew the results so in fact

00:08:49.320 question two can only call itself the

00:08:51.660 same as question one but not the other

00:08:53.700 way around so now we have a one

00:08:55.860 directional error and question two

00:08:58.560 should have this data both question one

00:09:00.240 and question two and question one should

00:09:02.100 only have question one

00:09:04.080 so it's a lot of questions

00:09:06.360 um the same principle applies to other

00:09:07.980 factors of harmonization where in the

00:09:10.080 case of the stimuli you would treat the

00:09:11.760 metadata of the study as the question in

00:09:14.040 this example so something like this

00:09:15.959 where in one survey we asked motorcycles

00:09:18.720 and in another one we asked motorbikes

00:09:20.700 we know that these two things should be

00:09:22.380 the same thing going forward so their

00:09:24.300 data can be harmonized

00:09:28.140 so we needed something special something

00:09:30.240 that will allow us to query our entire

00:09:32.160 data set with both context and

00:09:34.320 harmonization and it also needs to be

00:09:36.720 fast like real time fast

00:09:39.600 so before I continue I forgot to mention

00:09:42.420 at the start that there's a prize to

00:09:43.860 ever can guess the number of times I've

00:09:45.720 said data in this talk so come grab me

00:09:48.000 afterwards

00:09:49.320 but let's bring it back in we've gone

00:09:50.940 through what the world looked like

00:09:52.320 before the measure store and hopefully

00:09:54.240 we have an idea of the kinds of problems

00:09:56.160 that we were facing but now that's over

00:09:58.440 let's get ready for the cool stuff

00:10:01.080 welcome to the measure store and as the

00:10:03.779 name suggests we're storing measures the

00:10:06.779 measure store was built with the core

00:10:08.279 principle that the API needs to be very

00:10:10.440 simple and easy to understand when

00:10:12.720 making a request to the measure store

00:10:14.160 you can give it three things you can

00:10:16.019 give it the context or the scope

00:10:18.660 you can give it the measure and the

00:10:20.940 dimensions that you're interested in if

00:10:22.680 you're interested in them that last

00:10:23.940 one's optional but let's apply that to

00:10:26.040 the Yamaha example

00:10:27.779 so you can see over here that we're

00:10:29.640 scoping the query measure store.scoped

00:10:31.980 and we're scoping it to ask for surveys

00:10:34.680 that have been asked in a category of

00:10:36.720 motorcycle and brand of Yamaha and we're

00:10:40.260 also asking for the ease of use measure

00:10:42.240 at the end of it as seen in those Square

00:10:44.459 braces

00:10:46.320 and over here we're doing the same thing

00:10:48.300 but we're saying that we're only

00:10:49.800 interested in people that answered

00:10:51.300 between a rating between 7 and 10 so 7 8

00:10:54.839 9 10.

00:10:56.040 we'll dig into this a bit further after

00:10:57.779 the demo but you can also perform basic

00:10:59.880 operations on these queries such as

00:11:01.560 counting the number of respondents or

00:11:03.540 printing out the indexes

00:11:05.700 um so with us seeing a little bit how to

00:11:07.440 interface with the measure store let's

00:11:09.300 see it in action

00:11:11.820 and so for the first demo we're going to

00:11:14.519 see how a particular measure has

00:11:16.019 performed across all of the countries

00:11:17.579 that we run ads in so the measure that

00:11:19.620 we're going to look at is add

00:11:20.820 distinctiveness on a scale of one to

00:11:22.920 five how distinctive was this ad

00:11:25.380 so what we'll do is we'll Loop over all

00:11:27.600 the countries that we run ads in and

00:11:29.339 we'll get back the distribution for that

00:11:31.680 measure on a given country and I haven't

00:11:33.660 shown it in this code snippet but we'll

00:11:35.760 print these out in an ASCII chart so

00:11:37.620 that we can visualize the data

00:11:40.380 um and here we go we can see it running

00:11:43.200 and there you get back all of the data

00:11:46.320 and so for the sake of this example

00:11:48.000 let's say we just want to compare the

00:11:49.620 United States versus the UK

00:11:52.019 so here is the data for the United

00:11:53.820 States and we can see that it's all come

00:11:56.100 back in 17 milliseconds this is for 800

00:11:59.160 000 respondents and we can see that the

00:12:01.380 trend is pretty good and that ads in the

00:12:03.060 state seem to be pretty distinctive

00:12:04.800 about 240 000 people are giving it a

00:12:07.620 full 5 out of five and only a few people

00:12:09.959 giving it a one or two out of five

00:12:13.019 now let's take a look at the UK and this

00:12:15.839 has come back in just six milliseconds

00:12:17.760 but there's slightly less respondents

00:12:19.680 with only 180 000 people in the study uh

00:12:23.040 sorry that have been run in the UK so we

00:12:25.019 can already tell that we've run more ads

00:12:26.459 in the in the US than in the UK but the

00:12:29.459 result has got an interesting Trend we

00:12:31.140 can immediately see that there are less

00:12:33.060 people giving the ad five out of five

00:12:35.160 and it's pretty flat between three and

00:12:37.440 five and and quite a few people giving

00:12:39.839 it a two out of five so the ads in the

00:12:42.300 UK in the UK are less distinctive than

00:12:44.820 those in the states

00:12:46.500 um or might or more likely the people in

00:12:48.360 the UK are slightly more pessimistic

00:12:50.100 than people in the states which is

00:12:52.079 probably due to the weather

00:12:55.740 um but now let's take a look at another

00:12:57.180 type of analysis which involves Crossing

00:12:59.579 of two different measures and this is

00:13:01.620 really popular when trying to see how

00:13:03.300 two measures will relate to each other

00:13:05.459 and what we'll do is we'll cross the

00:13:07.260 persuasion persuasion measure with the

00:13:09.779 watched full add measure and this will

00:13:11.880 tell us if people who watch the full ad

00:13:13.560 find it more persuasive

00:13:16.200 so watchful ad only has two Dimensions

00:13:18.480 it's either a yes or a no and we'll Loop

00:13:20.940 over those and and do this cross product

00:13:22.860 between persuasion and that dimension

00:13:26.639 and we'll run that real quick

00:13:29.339 and we get the data back we get two

00:13:31.019 histograms

00:13:32.279 but let's dig into that so here's the

00:13:34.620 distribution for persuasion and those

00:13:36.540 who watch the full ad and this data came

00:13:39.180 back in 23 milliseconds and it's doing a

00:13:41.579 cross between 1.5 million respondents

00:13:44.220 who asked who asked the persuasive

00:13:46.860 persuasion question and 1.3 million

00:13:49.320 respondents who watched the full ad

00:13:51.420 and here's the same thing for those who

00:13:53.519 didn't watch the full ad which only took

00:13:55.139 eight milliseconds but that's just

00:13:57.240 because there's far less respondents and

00:13:59.519 we can see that generally those who

00:14:00.899 watched the full ad found it more

00:14:02.579 persuasive which we probably could have

00:14:04.380 predicted

00:14:05.279 because they actually watched the whole

00:14:06.839 thing

00:14:08.040 but the last Quick demo that I'll do is

00:14:10.079 is for harmonization and a classic in

00:14:12.420 this problem is looking at regions and

00:14:15.180 over here we've got all the regions for

00:14:16.920 the United States and we can see that

00:14:18.959 there's a bit of a translations issue

00:14:20.459 we've we've got studies that have been

00:14:22.079 run in Spanish but also in English so

00:14:24.779 less harmonizes data so that we can

00:14:27.139 analyze them all together and to start

00:14:29.880 off with we'll look at the counts before

00:14:31.860 harmonization for the Northeast region

00:14:34.019 and norester and please excuse my

00:14:36.839 pronunciation I don't really speak

00:14:38.760 Spanish

00:14:40.500 um but we've got 360

00:14:42.360 000 respondents from the Northeast

00:14:44.279 region and 89 000 from norester

00:14:47.639 and we can go and harmonize that by

00:14:50.820 using this semantically same EQ function

00:14:53.880 for semantic equality and for the sake

00:14:56.579 of this example we're going to give it

00:14:58.199 the bi-directional truth so that means

00:15:00.240 the two-directional arrow

00:15:03.480 and we run that and we can see that when

00:15:05.880 you count the number of respondents for

00:15:07.440 that they both have the same thing

00:15:10.079 so these these two points are now

00:15:11.760 harmonized and we'll do the same for

00:15:13.860 South and so

00:15:15.899 um so we can see the counts before doing

00:15:17.820 it over here

00:15:19.860 and we can do this harmonization but

00:15:22.079 this time we're not going to say that

00:15:23.339 it's bi-directional

00:15:24.720 which is the default option as as you

00:15:26.940 can see and we run that and here we get

00:15:30.300 the data back and we can see that South

00:15:32.160 now includes Sur but sir doesn't include

00:15:35.339 South so the state has been harmonized

00:15:37.800 in One Direction

00:15:40.199 all right so we've seen the measure

00:15:43.019 store in action and we understand a bit

00:15:44.940 about the API and how to interact with

00:15:46.860 it now let's go into what's happening

00:15:49.139 under the hood

00:15:50.459 so we can break down the measure store

00:15:52.199 into two key components you've got the

00:15:54.899 part that stores the raw data and you've

00:15:56.940 got the part that stores the context

00:15:58.620 onto that data

00:16:00.420 so we'll start off by talking about the

00:16:02.160 way that we store the data and the best

00:16:03.959 way to think about the raw data store is

00:16:05.940 as an index with a bunch of columns so

00:16:09.000 in our application the index is

00:16:11.100 respondents and the columns are going to

00:16:13.740 be the measures broken down into

00:16:15.240 dimensions

00:16:16.560 so each Dimension is a bitmap and if a

00:16:20.040 respondent answered yes for a particular

00:16:21.779 Dimension they'll get a one and if not

00:16:24.060 they'll get a zero so let's take a look

00:16:26.579 at what that looks like so we've got

00:16:28.199 four respondents and they were all asked

00:16:30.120 the ease of use question and we know

00:16:32.279 that this question is a 10 point scale

00:16:33.899 so you've got 10 Dimensions going across

00:16:36.420 respondent number one answer to seven so

00:16:39.060 they've got a one four seven and zero

00:16:41.339 for everything else

00:16:43.139 um and responding two answer to five so

00:16:45.000 their row looks like this and respondent

00:16:47.519 three gave an eight and four gave a

00:16:49.560 seven

00:16:50.880 uh so there we have it this is how we

00:16:52.920 store our ease of use measure for all of

00:16:55.740 our respondents and when we first

00:16:58.259 implemented this it was really fast to

00:17:00.180 return all of the data for a measure but

00:17:02.699 we noticed that it was getting quite big

00:17:03.899 in terms of size and we can see it in

00:17:06.179 this example that we need 10 bits to

00:17:08.579 store a single respondent which is

00:17:10.919 arguably better than 32 or 64 but we

00:17:15.299 still wanted to get this more compressed

00:17:16.740 this data is really sparse and we found

00:17:18.959 a compression algorithm called roaring

00:17:20.880 bitmaps which essentially allows us to

00:17:23.520 discard the zeros so our bitmap actually

00:17:26.220 looks something like this

00:17:28.799 so now we only need one bit to store a

00:17:31.140 single respondent which is pretty neat

00:17:33.419 and so we've optimized the store to get

00:17:35.700 all of the data for a given measure

00:17:37.380 regardless of the survey so no more

00:17:39.720 iteration through surveys no more

00:17:41.460 deserialization and no more

00:17:43.440 concatenation to pull it all together

00:17:45.660 yay

00:17:47.400 um but let's look at take a look at what

00:17:48.780 that would be like if we did in fact

00:17:50.340 have some surveys

00:17:52.919 so we'll add two surveys into the mix

00:17:54.720 we've got respondents one and two who

00:17:57.059 have answered survey number one and

00:17:58.919 three and four who have answered survey

00:18:00.660 number two

00:18:01.740 in terms of the query that we saw

00:18:03.419 earlier these would be the scope part of

00:18:06.000 their query so we've added two bitmaps

00:18:08.520 one for each survey and if we want to

00:18:10.200 fetch the data for survey number one for

00:18:12.720 the ease of use measure we can fetch all

00:18:14.940 of the respondents that's all survey

00:18:16.380 number one and we can take the ease of

00:18:19.080 use measure and do a bitwise and between

00:18:21.299 those two so it would look something

00:18:22.919 like this

00:18:24.480 and voila we have all of the data for

00:18:27.179 Server number one and the ease of use

00:18:28.980 measure and all of that in a few CPU

00:18:31.200 cycles and the same applies for survey

00:18:33.780 number two

00:18:35.760 um and this principle goes for any type

00:18:37.559 of data that we're storing in the bitmap

00:18:39.780 store so we can combine different

00:18:41.700 measures or Scopes using these bit Ops

00:18:46.140 but now that we understand the way that

00:18:47.400 we store in the data let's take a look

00:18:49.440 at how we connect it all together

00:18:52.200 so initially when thinking about the

00:18:54.299 problem of storing the relationships

00:18:56.160 between the data points we looked at

00:18:58.200 using a many-to-many relationship in a

00:19:00.059 SQL table

00:19:01.320 but when we started to think about it we

00:19:03.240 realized that this wasn't the right tool

00:19:04.799 for the job and the main reason for that

00:19:07.320 was because we needed to do what we call

00:19:09.419 multi-hop traversal so it turned into a

00:19:12.059 graph problem and to show you what I

00:19:13.919 mean by this I'll show you the structure

00:19:16.140 of the graph

00:19:17.760 and the first node that I'll introduce

00:19:20.039 is called the scope node and it

00:19:22.320 essentially stores a entity attribute

00:19:24.360 value triple as well as a storage key so

00:19:27.299 in this case The Entity is survey the

00:19:29.820 value is the attribute is category and

00:19:32.039 the values motorbike and we can do this

00:19:34.440 for the other variants of the scope so

00:19:36.539 we've got motorcycle and two wheel we

00:19:39.660 can now put an equals Edge between them

00:19:41.400 to indicate that they should be

00:19:42.720 harmonized and it's a one directional

00:19:44.760 Arrow as you can see

00:19:46.799 so now when I want to fetch all of the

00:19:48.780 data for the scope motorbike I can I now

00:19:52.500 know to fetch that of motorcycle and for

00:19:54.840 two wheels even though there isn't this

00:19:56.520 direct link between motorbike on the

00:19:58.740 left and two wheel on the right and

00:20:00.539 that's just multi-hop traversal

00:20:03.900 um and at this point it's worth

00:20:05.100 mentioning that we're using a property

00:20:06.960 graph so we can store the node type

00:20:09.179 which is scope and we can store some

00:20:11.460 properties to that which are that is

00:20:13.679 that hash of entity attribute value

00:20:16.080 and the storage key is a pointer to the

00:20:18.600 bitmap that stores the respondents for

00:20:20.940 that particular scope a bit more on this

00:20:23.580 to come

00:20:24.780 and the next node that we have is the

00:20:27.120 measure node which has a value property

00:20:29.340 to say what the measure name is so ease

00:20:32.220 of use and we can scope this measure by

00:20:34.799 adding a scoped Edge to indicate that

00:20:37.020 this measure has been asked in the

00:20:38.820 context of the connected scopes

00:20:42.200 and a measure node needs to have

00:20:44.520 measurements and these measurements have

00:20:46.980 a storage key and these storage Keys

00:20:49.799 point to the bitmaps that hold the raw

00:20:51.780 data for that particular Dimension or

00:20:54.120 measurement and in the example of the

00:20:56.400 bitmaps that we saw earlier each of

00:20:58.500 these would be a single column of the

00:21:00.600 dimensions for the ease of use measure

00:21:03.419 the measurement nodes capture a measure

00:21:05.940 so we can add that edge to them and

00:21:08.580 we've got this capture's Edge

00:21:11.700 and finally we've got the constant node

00:21:13.919 which contains the details of the

00:21:15.780 measurement and this has a value

00:21:17.580 property to say what the dimension was

00:21:19.740 so 5 6 or 7.

00:21:22.320 uh and if you if and these are connected

00:21:25.679 to the measurement nodes through a

00:21:27.000 measures Edge

00:21:28.320 and these measures edges have a

00:21:30.659 dimension property to indicate the type

00:21:32.460 of constant that it is so in this case

00:21:34.200 it was a choice

00:21:36.299 so sorry that's a lot to take in

00:21:38.820 um and if I lost you in that process let

00:21:40.620 me quickly summarize how it all connects

00:21:42.840 together

00:21:44.039 so we have the graph with the nodes and

00:21:46.919 edges which store the Scopes measures

00:21:49.500 and the relationships between them and

00:21:52.080 then we've got the bitmaps which hold

00:21:53.580 the raw data so when I run a query in

00:21:56.280 the measure store with the scope of

00:21:58.020 category as motorbike

00:21:59.880 the graph will go and fetch that scope

00:22:02.820 and all of its harmonized Scopes which

00:22:05.460 are connected by that equals Edge and

00:22:07.740 from these Scopes we can take that

00:22:09.600 storage key property which is the

00:22:11.520 pointer to those bitmaps

00:22:13.980 and there we have our Scopes and that

00:22:16.620 part of the query is sorted so now we

00:22:19.200 need to grab the measure which is the

00:22:21.000 ease of use measure so we find the

00:22:23.280 measure node which has the value ease of

00:22:25.320 use and we grab all of its measurements

00:22:28.080 and from those measurements we can get

00:22:30.419 the bitmaps from the storage key

00:22:32.700 and there we have all the data that's

00:22:35.580 needed to fetch uh to fetch our final

00:22:38.640 measure

00:22:40.260 um so to get that final measure we'll do

00:22:42.480 a bitwise all of all the scope bitmaps

00:22:45.120 and a bitwise or over all the measure

00:22:47.340 bitmaps and do an and between those to

00:22:49.919 get the final result

00:22:52.679 so you can see that we're using the

00:22:54.720 graph to determine what data to fetch

00:22:56.760 and we fetch the corresponding data from

00:22:59.039 that bitmap store

00:23:00.600 and there we have it our data index

00:23:02.880 using Ruby graphs and bitmaps

00:23:05.820 but I haven't mentioned how it's running

00:23:07.620 in production or what tools we're using

00:23:09.960 to implement it so let me introduce to

00:23:12.480 you our good friend redis

00:23:14.460 and because of how compressed the data

00:23:16.620 is with that roaring compression

00:23:18.539 and more on this in the next section

00:23:19.860 we're able to store both the graph and

00:23:22.260 the bitmaps in memory through redis and

00:23:25.140 redis provides us with a really useful

00:23:27.179 database graph database module called

00:23:29.520 graph redis graph which is what we're

00:23:32.640 using to store those nodes and edges

00:23:35.299 reddish graph supports open decipher

00:23:37.679 which is one of the more popular query

00:23:39.480 languages for graph databases and so

00:23:42.179 this could be translatable to other

00:23:43.880 graph graph Technologies

00:23:46.860 um and for the bitmap store we're using

00:23:48.419 a a module called redis roaring which

00:23:51.360 was written by aviciano

00:23:53.760 um and we've been really impressed by

00:23:55.260 how fast the whole system has been

00:23:56.880 running so let's take a look at some of

00:23:59.520 the performance metrics

00:24:01.860 and to give a rough Benchmark for the

00:24:04.919 for the measure store I've fetched all

00:24:07.020 of the age measures for one of our

00:24:08.940 products where we've got 3 608 surveys

00:24:12.620 and this has a total of 1.5 million

00:24:15.419 respondents I ran this on the current

00:24:17.520 reporting platform which is doing that

00:24:19.440 whole fetching of the measure and

00:24:22.100 deserializing and concatenating of them

00:24:24.179 and it came back in 90 seconds

00:24:27.059 and I did the same thing in the measure

00:24:28.740 store which is scoping the query on the

00:24:30.780 product and fetching the age bitmap and

00:24:33.120 it took

00:24:34.760 495 milliseconds which is a 180x speed

00:24:38.220 up

00:24:39.059 I also did a test to see how fast it

00:24:40.980 could return all of the data for the age

00:24:43.260 measure across all of our products and

00:24:45.900 this just took nine milliseconds which

00:24:47.760 is for about 5 million respondents and

00:24:49.799 it's so much faster because we don't

00:24:51.780 have to Traverse the graph to find those

00:24:53.880 Scopes we can just go and fetch that

00:24:55.559 data straight from the bitmap store

00:24:58.919 and to compare the storage benefits of

00:25:01.140 the measure store I loaded a subset of

00:25:03.900 40 studies from the current reporting

00:25:05.940 platform into a SQL database and they

00:25:09.000 took up about 3.6 gigs and when I loaded

00:25:12.539 that same data into the measure store

00:25:13.980 the graph came up to 23.3 megabytes and

00:25:17.760 the bookmaps came up to just 4.46

00:25:20.159 megabytes so a total of 27 megabytes

00:25:23.580 total and that's about a 128x

00:25:26.880 compression ratio

00:25:30.360 um so what's next for the measure store

00:25:32.520 well we've just graduated the project

00:25:34.200 out of its r d phase and we're starting

00:25:36.779 to scale it up with a few products in

00:25:38.820 mind uh and in terms of our main

00:25:41.279 priorities right now we're doing some

00:25:42.779 battle testing to see how it will behave

00:25:45.059 under extreme load

00:25:46.620 and at the same time we've got folks who

00:25:48.539 are looking into how we can use the

00:25:49.860 measure store as a back end for modeling

00:25:52.380 data through the bitmaps and we hope

00:25:54.900 that one day we'll be able to open

00:25:55.980 source it so that if you also need to

00:25:57.960 harmonize your data with context you can

00:26:00.179 use it too

00:26:01.559 thank you for listening to me go on

00:26:03.179 about graphs and bitmaps and data enjoy

00:26:05.760 the rest of the conference