00:00:00.000
ready for takeoff
00:00:16.920
all right all right all right hi
00:00:18.840
everyone I'm super excited to be here
00:00:21.359
this morning at the first day of
00:00:23.460
rubyconf are you guys all excited
00:00:25.980
yeah nice you can feel the fresh first
00:00:29.960
first morning of the conference energy
00:00:32.940
I'm Benji I'm a software engineer at
00:00:35.219
zappy I'm originally from Cape Town in
00:00:37.620
South Africa but I currently live in
00:00:39.600
London just a little bit about me I love
00:00:42.660
traveling I love being in nature I love
00:00:45.000
cooking and eating all kinds of food so
00:00:46.980
you can imagine how much I've enjoyed
00:00:48.600
Houston
00:00:49.500
I also love technology coding and data
00:00:52.500
so if you're interested in any of these
00:00:54.420
come grab me afterwards for a chat
00:00:57.600
I'm going to be talking to you today
00:00:59.219
about a fun project that we've been
00:01:01.320
working on for the past year in the
00:01:03.359
zappy X team which is Zappy's r d unit
00:01:07.200
the project is all about a custom data
00:01:09.479
indexing system that we built using Ruby
00:01:11.820
graphs and bitmaps I hope you enjoy
00:01:15.540
before we dive into the nitty-gritty
00:01:17.340
here's just a quick overview of what
00:01:19.080
we'll be running through this morning
00:01:20.700
we'll start off with a bit of background
00:01:22.439
information and I'll paint a little
00:01:24.299
picture of the world before we had RGB
00:01:27.420
spoiler alert it was black and white
00:01:29.939
had to make that joke but we'll dig into
00:01:33.600
some of the problems that we had which
00:01:35.280
included how we're applying context onto
00:01:37.380
our data how we were storing our data
00:01:39.780
and how we lacked connections between
00:01:41.759
our data points
00:01:43.740
I'll then touch on I'll then touches on
00:01:47.100
what we needed and run through a quick
00:01:48.479
demo of what we came up with and after
00:01:50.640
that we'll get a bit nerdy and talk
00:01:52.500
about the composition of the measure
00:01:53.939
store and dive into some of the
00:01:55.619
technical details around the bitmaps and
00:01:57.960
graphs
00:01:59.040
and to finish off with I'll run through
00:02:00.720
some of the metrics of the final
00:02:01.799
solution and what the next steps are for
00:02:03.840
the project
00:02:05.280
I hope that sounds good let's get
00:02:07.500
cracking so at zappy we're all about
00:02:10.140
collecting survey data we've got Suites
00:02:12.900
of research products which have
00:02:14.819
collections of questions or surveys on
00:02:17.040
them and on those we perform some
00:02:19.020
modeling to get useful insights so we're
00:02:22.020
usually testing stimulus such as videos
00:02:24.239
or images and we do the whole thing from
00:02:27.000
ensuring that the right people answer
00:02:28.620
the survey and that they're rewarded for
00:02:30.840
answering that survey all the way to
00:02:33.120
executing the IP in our computation
00:02:35.520
engine to drive the insights that gets
00:02:37.739
get displayed in the charts to our users
00:02:40.379
so we'll run through what that looks
00:02:42.300
like in our system real quick
00:02:44.280
so we've got our respondents that answer
00:02:46.860
the survey in a question the the
00:02:48.959
questions in the survey
00:02:50.580
and these then get turned into what we
00:02:52.860
call measures now a measure is a digital
00:02:55.440
representation of a reading from The
00:02:57.780
Real World and I might use question and
00:03:00.660
measure interchangeably throughout the
00:03:02.340
talk but that's just because our reading
00:03:04.680
from The Real World world is coming in
00:03:06.540
from survey questions
00:03:08.760
but these these questions get passed
00:03:10.860
into our reporting platform where we can
00:03:13.140
start doing the some modeling on them
00:03:14.819
through our in-house computation engine
00:03:16.680
called Quattro uh Quattro essentially
00:03:19.560
allows us to use Python's pandas through
00:03:22.319
Ruby so that we can store and perform
00:03:24.300
operations on these measures through the
00:03:27.120
form of a data frame our CTO Brendan
00:03:29.879
goes into the real reasons why we did
00:03:31.680
this in his talk from rubyconf in 2014
00:03:33.739
but at least for me the best reason is
00:03:36.360
that we just love Ruby
00:03:39.239
so these modeled measures then get
00:03:41.220
stored into our SQL database in the form
00:03:43.560
of serialized or pickled as they call
00:03:45.599
them in Python data frames and when our
00:03:48.720
users come into the platform they can
00:03:50.640
select the surveys that they're
00:03:51.900
interested in and then they can dive
00:03:53.640
into the various charts that we provide
00:03:55.319
when pulling the data for these charts
00:03:57.959
out we're fetching the respondent level
00:03:59.760
data from SQL and this data is mostly
00:04:02.700
pre-aggregated and cached and then we
00:04:05.700
can do some additional computation on it
00:04:07.860
to derive the useful Insight that gets
00:04:10.019
shown in the chart
00:04:12.000
so these charts allow for filtering of
00:04:13.860
respondents so that you can get a better
00:04:15.239
understanding of how different
00:04:16.799
demographics respond into your simile
00:04:18.720
and they can also give our users
00:04:20.639
benchmarks to compare their numbers
00:04:22.199
against
00:04:23.520
the platform at the moment is incredibly
00:04:25.860
good at this type of analysis where a
00:04:28.259
user has selected a subset of their
00:04:29.880
studies and they want to do some kind of
00:04:31.919
cross-comparison between them and the
00:04:35.340
architecture of the platform is also
00:04:37.139
really good at storing the dependencies
00:04:39.000
behind the models and that we're
00:04:41.220
Computing and the computation engine is
00:04:43.320
optimized for processing processing
00:04:45.060
these models and their dependencies
00:04:48.540
but we want more we want to query all of
00:04:52.139
our data and in real time so we want to
00:04:55.500
store the connections on and
00:04:56.940
relationships between the different data
00:04:59.160
points that we run
00:05:00.780
all of this is so that we can do
00:05:02.940
meta-analysis over the whole data set so
00:05:05.520
that we can get an even deeper
00:05:06.720
understanding of the platform the data
00:05:09.120
in our platform
00:05:10.919
so as you would imagine nothing is ever
00:05:12.960
that simple when you want all the things
00:05:15.479
so let's take a look at some of the
00:05:17.460
problems that we're facing that were
00:05:19.020
stopping us from getting there
00:05:22.020
the first problem that we needed to
00:05:24.120
consider is context
00:05:25.800
so when fetching all of the data for
00:05:27.900
something we need to make sure that that
00:05:29.880
data that we're that we're fetching
00:05:31.740
actually represents the same thing and
00:05:34.440
the best way to think about this is
00:05:35.820
through an example so let's consider the
00:05:38.520
case where we want to find out how the
00:05:40.680
brand Yamaha is doing for a particular
00:05:43.139
metric something like ease of use so how
00:05:45.900
easy was this thing to use
00:05:48.840
so if we wanted to write a query for
00:05:50.820
this we'd say get me all of the data
00:05:53.100
where the measure is ease of use and the
00:05:55.919
brand is Yamaha
00:05:57.300
and we get all this data back and we
00:05:59.699
plot the distribution of it and we're
00:06:01.620
like hang on there's something funky in
00:06:03.240
the data we've got these two bumps where
00:06:05.580
some people thought that it was easy to
00:06:07.440
use and others thought that it was hard
00:06:09.180
to use which for the sake of this
00:06:11.100
example is unexpected
00:06:13.620
so we take a second to think about it
00:06:15.419
and we realize oh hang on a second
00:06:16.919
Yamaha make motorcycles and they also
00:06:19.560
make pianos this could be what's causing
00:06:22.080
this anomaly where motorcycles could be
00:06:24.180
pretty hard to use and pianos are maybe
00:06:26.759
pretty easy to use so now we need to
00:06:29.580
know in what context a given measure was
00:06:31.800
asked in that survey was it in the
00:06:33.840
vehicle category or was it in the
00:06:35.520
musical instruments category and this
00:06:37.860
additional level of context is really
00:06:39.479
key when running this meta-analysis and
00:06:42.360
for me the ease of use for both of these
00:06:44.699
categories would probably be zero I'd
00:06:46.620
crash the motorbike and probably flip
00:06:48.720
the piano out of Rage
00:06:50.940
um but but no let's get back to it so
00:06:53.100
the next problem that we had is storage
00:06:54.960
so as I mentioned before we're storing
00:06:57.840
all of our data as serialized data
00:07:00.180
frames in SQL so we're also storing them
00:07:03.000
each measure on the survey level so if
00:07:05.220
we run four surveys all with the same
00:07:07.560
measure we would end up with four
00:07:09.300
serialized measures which would need to
00:07:11.340
be concatenated together in order to get
00:07:14.580
back the data for those four measures
00:07:16.500
combined
00:07:18.419
so to put this into some code for a
00:07:20.819
given measure we need to go through each
00:07:22.500
of the surveys Fetch and deserialize the
00:07:25.259
measure and ultimately concatenate them
00:07:27.479
together at the end and you put this
00:07:29.580
onto a database where you've got tens of
00:07:31.560
thousands of surveys and that's a lot of
00:07:33.419
deserialization and concatenation so
00:07:36.000
it's pretty slow to put it lightly
00:07:39.539
the last thing that we had to consider
00:07:41.099
is the connecting of different data
00:07:43.259
points and we call this harmonization
00:07:45.680
harmonization is all about knowing what
00:07:48.120
data should be treated the same
00:07:50.039
and we can break this down into three
00:07:52.440
broad categories so you've got the
00:07:54.479
measure or the question the stimuli or
00:07:57.300
the context in which something was asked
00:07:58.979
and the audience or the people that
00:08:01.020
you're asking so we'll run through a
00:08:03.000
couple of examples of this
00:08:05.160
for measure-based harmonization let's
00:08:07.560
say we've got two questions uh question
00:08:10.319
one asked on a scale of one to ten how
00:08:12.539
easy is this to use
00:08:14.580
and question two asked after that
00:08:17.340
amazing experience how how would you
00:08:19.500
rate its ease of use out of 10. and
00:08:22.259
let's say that we these these two
00:08:23.879
measures mean the same thing so we get
00:08:26.160
this little bi-directional arrow
00:08:28.080
indicating that they're the same now
00:08:30.300
when fetching all of the data for
00:08:31.620
question one I should get the data for
00:08:33.360
both question one and question two
00:08:34.979
together and these measures are now what
00:08:37.380
we'd call harmonized
00:08:39.000
it's worth mentioning that these
00:08:41.339
probably shouldn't have been considered
00:08:42.599
the same because the second question is
00:08:45.000
putting in some additional bias so this
00:08:47.279
might skew the results so in fact
00:08:49.320
question two can only call itself the
00:08:51.660
same as question one but not the other
00:08:53.700
way around so now we have a one
00:08:55.860
directional error and question two
00:08:58.560
should have this data both question one
00:09:00.240
and question two and question one should
00:09:02.100
only have question one
00:09:04.080
so it's a lot of questions
00:09:06.360
um the same principle applies to other
00:09:07.980
factors of harmonization where in the
00:09:10.080
case of the stimuli you would treat the
00:09:11.760
metadata of the study as the question in
00:09:14.040
this example so something like this
00:09:15.959
where in one survey we asked motorcycles
00:09:18.720
and in another one we asked motorbikes
00:09:20.700
we know that these two things should be
00:09:22.380
the same thing going forward so their
00:09:24.300
data can be harmonized
00:09:28.140
so we needed something special something
00:09:30.240
that will allow us to query our entire
00:09:32.160
data set with both context and
00:09:34.320
harmonization and it also needs to be
00:09:36.720
fast like real time fast
00:09:39.600
so before I continue I forgot to mention
00:09:42.420
at the start that there's a prize to
00:09:43.860
ever can guess the number of times I've
00:09:45.720
said data in this talk so come grab me
00:09:48.000
afterwards
00:09:49.320
but let's bring it back in we've gone
00:09:50.940
through what the world looked like
00:09:52.320
before the measure store and hopefully
00:09:54.240
we have an idea of the kinds of problems
00:09:56.160
that we were facing but now that's over
00:09:58.440
let's get ready for the cool stuff
00:10:01.080
welcome to the measure store and as the
00:10:03.779
name suggests we're storing measures the
00:10:06.779
measure store was built with the core
00:10:08.279
principle that the API needs to be very
00:10:10.440
simple and easy to understand when
00:10:12.720
making a request to the measure store
00:10:14.160
you can give it three things you can
00:10:16.019
give it the context or the scope
00:10:18.660
you can give it the measure and the
00:10:20.940
dimensions that you're interested in if
00:10:22.680
you're interested in them that last
00:10:23.940
one's optional but let's apply that to
00:10:26.040
the Yamaha example
00:10:27.779
so you can see over here that we're
00:10:29.640
scoping the query measure store.scoped
00:10:31.980
and we're scoping it to ask for surveys
00:10:34.680
that have been asked in a category of
00:10:36.720
motorcycle and brand of Yamaha and we're
00:10:40.260
also asking for the ease of use measure
00:10:42.240
at the end of it as seen in those Square
00:10:44.459
braces
00:10:46.320
and over here we're doing the same thing
00:10:48.300
but we're saying that we're only
00:10:49.800
interested in people that answered
00:10:51.300
between a rating between 7 and 10 so 7 8
00:10:54.839
9 10.
00:10:56.040
we'll dig into this a bit further after
00:10:57.779
the demo but you can also perform basic
00:10:59.880
operations on these queries such as
00:11:01.560
counting the number of respondents or
00:11:03.540
printing out the indexes
00:11:05.700
um so with us seeing a little bit how to
00:11:07.440
interface with the measure store let's
00:11:09.300
see it in action
00:11:11.820
and so for the first demo we're going to
00:11:14.519
see how a particular measure has
00:11:16.019
performed across all of the countries
00:11:17.579
that we run ads in so the measure that
00:11:19.620
we're going to look at is add
00:11:20.820
distinctiveness on a scale of one to
00:11:22.920
five how distinctive was this ad
00:11:25.380
so what we'll do is we'll Loop over all
00:11:27.600
the countries that we run ads in and
00:11:29.339
we'll get back the distribution for that
00:11:31.680
measure on a given country and I haven't
00:11:33.660
shown it in this code snippet but we'll
00:11:35.760
print these out in an ASCII chart so
00:11:37.620
that we can visualize the data
00:11:40.380
um and here we go we can see it running
00:11:43.200
and there you get back all of the data
00:11:46.320
and so for the sake of this example
00:11:48.000
let's say we just want to compare the
00:11:49.620
United States versus the UK
00:11:52.019
so here is the data for the United
00:11:53.820
States and we can see that it's all come
00:11:56.100
back in 17 milliseconds this is for 800
00:11:59.160
000 respondents and we can see that the
00:12:01.380
trend is pretty good and that ads in the
00:12:03.060
state seem to be pretty distinctive
00:12:04.800
about 240 000 people are giving it a
00:12:07.620
full 5 out of five and only a few people
00:12:09.959
giving it a one or two out of five
00:12:13.019
now let's take a look at the UK and this
00:12:15.839
has come back in just six milliseconds
00:12:17.760
but there's slightly less respondents
00:12:19.680
with only 180 000 people in the study uh
00:12:23.040
sorry that have been run in the UK so we
00:12:25.019
can already tell that we've run more ads
00:12:26.459
in the in the US than in the UK but the
00:12:29.459
result has got an interesting Trend we
00:12:31.140
can immediately see that there are less
00:12:33.060
people giving the ad five out of five
00:12:35.160
and it's pretty flat between three and
00:12:37.440
five and and quite a few people giving
00:12:39.839
it a two out of five so the ads in the
00:12:42.300
UK in the UK are less distinctive than
00:12:44.820
those in the states
00:12:46.500
um or might or more likely the people in
00:12:48.360
the UK are slightly more pessimistic
00:12:50.100
than people in the states which is
00:12:52.079
probably due to the weather
00:12:55.740
um but now let's take a look at another
00:12:57.180
type of analysis which involves Crossing
00:12:59.579
of two different measures and this is
00:13:01.620
really popular when trying to see how
00:13:03.300
two measures will relate to each other
00:13:05.459
and what we'll do is we'll cross the
00:13:07.260
persuasion persuasion measure with the
00:13:09.779
watched full add measure and this will
00:13:11.880
tell us if people who watch the full ad
00:13:13.560
find it more persuasive
00:13:16.200
so watchful ad only has two Dimensions
00:13:18.480
it's either a yes or a no and we'll Loop
00:13:20.940
over those and and do this cross product
00:13:22.860
between persuasion and that dimension
00:13:26.639
and we'll run that real quick
00:13:29.339
and we get the data back we get two
00:13:31.019
histograms
00:13:32.279
but let's dig into that so here's the
00:13:34.620
distribution for persuasion and those
00:13:36.540
who watch the full ad and this data came
00:13:39.180
back in 23 milliseconds and it's doing a
00:13:41.579
cross between 1.5 million respondents
00:13:44.220
who asked who asked the persuasive
00:13:46.860
persuasion question and 1.3 million
00:13:49.320
respondents who watched the full ad
00:13:51.420
and here's the same thing for those who
00:13:53.519
didn't watch the full ad which only took
00:13:55.139
eight milliseconds but that's just
00:13:57.240
because there's far less respondents and
00:13:59.519
we can see that generally those who
00:14:00.899
watched the full ad found it more
00:14:02.579
persuasive which we probably could have
00:14:04.380
predicted
00:14:05.279
because they actually watched the whole
00:14:06.839
thing
00:14:08.040
but the last Quick demo that I'll do is
00:14:10.079
is for harmonization and a classic in
00:14:12.420
this problem is looking at regions and
00:14:15.180
over here we've got all the regions for
00:14:16.920
the United States and we can see that
00:14:18.959
there's a bit of a translations issue
00:14:20.459
we've we've got studies that have been
00:14:22.079
run in Spanish but also in English so
00:14:24.779
less harmonizes data so that we can
00:14:27.139
analyze them all together and to start
00:14:29.880
off with we'll look at the counts before
00:14:31.860
harmonization for the Northeast region
00:14:34.019
and norester and please excuse my
00:14:36.839
pronunciation I don't really speak
00:14:38.760
Spanish
00:14:40.500
um but we've got 360
00:14:42.360
000 respondents from the Northeast
00:14:44.279
region and 89 000 from norester
00:14:47.639
and we can go and harmonize that by
00:14:50.820
using this semantically same EQ function
00:14:53.880
for semantic equality and for the sake
00:14:56.579
of this example we're going to give it
00:14:58.199
the bi-directional truth so that means
00:15:00.240
the two-directional arrow
00:15:03.480
and we run that and we can see that when
00:15:05.880
you count the number of respondents for
00:15:07.440
that they both have the same thing
00:15:10.079
so these these two points are now
00:15:11.760
harmonized and we'll do the same for
00:15:13.860
South and so
00:15:15.899
um so we can see the counts before doing
00:15:17.820
it over here
00:15:19.860
and we can do this harmonization but
00:15:22.079
this time we're not going to say that
00:15:23.339
it's bi-directional
00:15:24.720
which is the default option as as you
00:15:26.940
can see and we run that and here we get
00:15:30.300
the data back and we can see that South
00:15:32.160
now includes Sur but sir doesn't include
00:15:35.339
South so the state has been harmonized
00:15:37.800
in One Direction
00:15:40.199
all right so we've seen the measure
00:15:43.019
store in action and we understand a bit
00:15:44.940
about the API and how to interact with
00:15:46.860
it now let's go into what's happening
00:15:49.139
under the hood
00:15:50.459
so we can break down the measure store
00:15:52.199
into two key components you've got the
00:15:54.899
part that stores the raw data and you've
00:15:56.940
got the part that stores the context
00:15:58.620
onto that data
00:16:00.420
so we'll start off by talking about the
00:16:02.160
way that we store the data and the best
00:16:03.959
way to think about the raw data store is
00:16:05.940
as an index with a bunch of columns so
00:16:09.000
in our application the index is
00:16:11.100
respondents and the columns are going to
00:16:13.740
be the measures broken down into
00:16:15.240
dimensions
00:16:16.560
so each Dimension is a bitmap and if a
00:16:20.040
respondent answered yes for a particular
00:16:21.779
Dimension they'll get a one and if not
00:16:24.060
they'll get a zero so let's take a look
00:16:26.579
at what that looks like so we've got
00:16:28.199
four respondents and they were all asked
00:16:30.120
the ease of use question and we know
00:16:32.279
that this question is a 10 point scale
00:16:33.899
so you've got 10 Dimensions going across
00:16:36.420
respondent number one answer to seven so
00:16:39.060
they've got a one four seven and zero
00:16:41.339
for everything else
00:16:43.139
um and responding two answer to five so
00:16:45.000
their row looks like this and respondent
00:16:47.519
three gave an eight and four gave a
00:16:49.560
seven
00:16:50.880
uh so there we have it this is how we
00:16:52.920
store our ease of use measure for all of
00:16:55.740
our respondents and when we first
00:16:58.259
implemented this it was really fast to
00:17:00.180
return all of the data for a measure but
00:17:02.699
we noticed that it was getting quite big
00:17:03.899
in terms of size and we can see it in
00:17:06.179
this example that we need 10 bits to
00:17:08.579
store a single respondent which is
00:17:10.919
arguably better than 32 or 64 but we
00:17:15.299
still wanted to get this more compressed
00:17:16.740
this data is really sparse and we found
00:17:18.959
a compression algorithm called roaring
00:17:20.880
bitmaps which essentially allows us to
00:17:23.520
discard the zeros so our bitmap actually
00:17:26.220
looks something like this
00:17:28.799
so now we only need one bit to store a
00:17:31.140
single respondent which is pretty neat
00:17:33.419
and so we've optimized the store to get
00:17:35.700
all of the data for a given measure
00:17:37.380
regardless of the survey so no more
00:17:39.720
iteration through surveys no more
00:17:41.460
deserialization and no more
00:17:43.440
concatenation to pull it all together
00:17:45.660
yay
00:17:47.400
um but let's look at take a look at what
00:17:48.780
that would be like if we did in fact
00:17:50.340
have some surveys
00:17:52.919
so we'll add two surveys into the mix
00:17:54.720
we've got respondents one and two who
00:17:57.059
have answered survey number one and
00:17:58.919
three and four who have answered survey
00:18:00.660
number two
00:18:01.740
in terms of the query that we saw
00:18:03.419
earlier these would be the scope part of
00:18:06.000
their query so we've added two bitmaps
00:18:08.520
one for each survey and if we want to
00:18:10.200
fetch the data for survey number one for
00:18:12.720
the ease of use measure we can fetch all
00:18:14.940
of the respondents that's all survey
00:18:16.380
number one and we can take the ease of
00:18:19.080
use measure and do a bitwise and between
00:18:21.299
those two so it would look something
00:18:22.919
like this
00:18:24.480
and voila we have all of the data for
00:18:27.179
Server number one and the ease of use
00:18:28.980
measure and all of that in a few CPU
00:18:31.200
cycles and the same applies for survey
00:18:33.780
number two
00:18:35.760
um and this principle goes for any type
00:18:37.559
of data that we're storing in the bitmap
00:18:39.780
store so we can combine different
00:18:41.700
measures or Scopes using these bit Ops
00:18:46.140
but now that we understand the way that
00:18:47.400
we store in the data let's take a look
00:18:49.440
at how we connect it all together
00:18:52.200
so initially when thinking about the
00:18:54.299
problem of storing the relationships
00:18:56.160
between the data points we looked at
00:18:58.200
using a many-to-many relationship in a
00:19:00.059
SQL table
00:19:01.320
but when we started to think about it we
00:19:03.240
realized that this wasn't the right tool
00:19:04.799
for the job and the main reason for that
00:19:07.320
was because we needed to do what we call
00:19:09.419
multi-hop traversal so it turned into a
00:19:12.059
graph problem and to show you what I
00:19:13.919
mean by this I'll show you the structure
00:19:16.140
of the graph
00:19:17.760
and the first node that I'll introduce
00:19:20.039
is called the scope node and it
00:19:22.320
essentially stores a entity attribute
00:19:24.360
value triple as well as a storage key so
00:19:27.299
in this case The Entity is survey the
00:19:29.820
value is the attribute is category and
00:19:32.039
the values motorbike and we can do this
00:19:34.440
for the other variants of the scope so
00:19:36.539
we've got motorcycle and two wheel we
00:19:39.660
can now put an equals Edge between them
00:19:41.400
to indicate that they should be
00:19:42.720
harmonized and it's a one directional
00:19:44.760
Arrow as you can see
00:19:46.799
so now when I want to fetch all of the
00:19:48.780
data for the scope motorbike I can I now
00:19:52.500
know to fetch that of motorcycle and for
00:19:54.840
two wheels even though there isn't this
00:19:56.520
direct link between motorbike on the
00:19:58.740
left and two wheel on the right and
00:20:00.539
that's just multi-hop traversal
00:20:03.900
um and at this point it's worth
00:20:05.100
mentioning that we're using a property
00:20:06.960
graph so we can store the node type
00:20:09.179
which is scope and we can store some
00:20:11.460
properties to that which are that is
00:20:13.679
that hash of entity attribute value
00:20:16.080
and the storage key is a pointer to the
00:20:18.600
bitmap that stores the respondents for
00:20:20.940
that particular scope a bit more on this
00:20:23.580
to come
00:20:24.780
and the next node that we have is the
00:20:27.120
measure node which has a value property
00:20:29.340
to say what the measure name is so ease
00:20:32.220
of use and we can scope this measure by
00:20:34.799
adding a scoped Edge to indicate that
00:20:37.020
this measure has been asked in the
00:20:38.820
context of the connected scopes
00:20:42.200
and a measure node needs to have
00:20:44.520
measurements and these measurements have
00:20:46.980
a storage key and these storage Keys
00:20:49.799
point to the bitmaps that hold the raw
00:20:51.780
data for that particular Dimension or
00:20:54.120
measurement and in the example of the
00:20:56.400
bitmaps that we saw earlier each of
00:20:58.500
these would be a single column of the
00:21:00.600
dimensions for the ease of use measure
00:21:03.419
the measurement nodes capture a measure
00:21:05.940
so we can add that edge to them and
00:21:08.580
we've got this capture's Edge
00:21:11.700
and finally we've got the constant node
00:21:13.919
which contains the details of the
00:21:15.780
measurement and this has a value
00:21:17.580
property to say what the dimension was
00:21:19.740
so 5 6 or 7.
00:21:22.320
uh and if you if and these are connected
00:21:25.679
to the measurement nodes through a
00:21:27.000
measures Edge
00:21:28.320
and these measures edges have a
00:21:30.659
dimension property to indicate the type
00:21:32.460
of constant that it is so in this case
00:21:34.200
it was a choice
00:21:36.299
so sorry that's a lot to take in
00:21:38.820
um and if I lost you in that process let
00:21:40.620
me quickly summarize how it all connects
00:21:42.840
together
00:21:44.039
so we have the graph with the nodes and
00:21:46.919
edges which store the Scopes measures
00:21:49.500
and the relationships between them and
00:21:52.080
then we've got the bitmaps which hold
00:21:53.580
the raw data so when I run a query in
00:21:56.280
the measure store with the scope of
00:21:58.020
category as motorbike
00:21:59.880
the graph will go and fetch that scope
00:22:02.820
and all of its harmonized Scopes which
00:22:05.460
are connected by that equals Edge and
00:22:07.740
from these Scopes we can take that
00:22:09.600
storage key property which is the
00:22:11.520
pointer to those bitmaps
00:22:13.980
and there we have our Scopes and that
00:22:16.620
part of the query is sorted so now we
00:22:19.200
need to grab the measure which is the
00:22:21.000
ease of use measure so we find the
00:22:23.280
measure node which has the value ease of
00:22:25.320
use and we grab all of its measurements
00:22:28.080
and from those measurements we can get
00:22:30.419
the bitmaps from the storage key
00:22:32.700
and there we have all the data that's
00:22:35.580
needed to fetch uh to fetch our final
00:22:38.640
measure
00:22:40.260
um so to get that final measure we'll do
00:22:42.480
a bitwise all of all the scope bitmaps
00:22:45.120
and a bitwise or over all the measure
00:22:47.340
bitmaps and do an and between those to
00:22:49.919
get the final result
00:22:52.679
so you can see that we're using the
00:22:54.720
graph to determine what data to fetch
00:22:56.760
and we fetch the corresponding data from
00:22:59.039
that bitmap store
00:23:00.600
and there we have it our data index
00:23:02.880
using Ruby graphs and bitmaps
00:23:05.820
but I haven't mentioned how it's running
00:23:07.620
in production or what tools we're using
00:23:09.960
to implement it so let me introduce to
00:23:12.480
you our good friend redis
00:23:14.460
and because of how compressed the data
00:23:16.620
is with that roaring compression
00:23:18.539
and more on this in the next section
00:23:19.860
we're able to store both the graph and
00:23:22.260
the bitmaps in memory through redis and
00:23:25.140
redis provides us with a really useful
00:23:27.179
database graph database module called
00:23:29.520
graph redis graph which is what we're
00:23:32.640
using to store those nodes and edges
00:23:35.299
reddish graph supports open decipher
00:23:37.679
which is one of the more popular query
00:23:39.480
languages for graph databases and so
00:23:42.179
this could be translatable to other
00:23:43.880
graph graph Technologies
00:23:46.860
um and for the bitmap store we're using
00:23:48.419
a a module called redis roaring which
00:23:51.360
was written by aviciano
00:23:53.760
um and we've been really impressed by
00:23:55.260
how fast the whole system has been
00:23:56.880
running so let's take a look at some of
00:23:59.520
the performance metrics
00:24:01.860
and to give a rough Benchmark for the
00:24:04.919
for the measure store I've fetched all
00:24:07.020
of the age measures for one of our
00:24:08.940
products where we've got 3 608 surveys
00:24:12.620
and this has a total of 1.5 million
00:24:15.419
respondents I ran this on the current
00:24:17.520
reporting platform which is doing that
00:24:19.440
whole fetching of the measure and
00:24:22.100
deserializing and concatenating of them
00:24:24.179
and it came back in 90 seconds
00:24:27.059
and I did the same thing in the measure
00:24:28.740
store which is scoping the query on the
00:24:30.780
product and fetching the age bitmap and
00:24:33.120
it took
00:24:34.760
495 milliseconds which is a 180x speed
00:24:38.220
up
00:24:39.059
I also did a test to see how fast it
00:24:40.980
could return all of the data for the age
00:24:43.260
measure across all of our products and
00:24:45.900
this just took nine milliseconds which
00:24:47.760
is for about 5 million respondents and
00:24:49.799
it's so much faster because we don't
00:24:51.780
have to Traverse the graph to find those
00:24:53.880
Scopes we can just go and fetch that
00:24:55.559
data straight from the bitmap store
00:24:58.919
and to compare the storage benefits of
00:25:01.140
the measure store I loaded a subset of
00:25:03.900
40 studies from the current reporting
00:25:05.940
platform into a SQL database and they
00:25:09.000
took up about 3.6 gigs and when I loaded
00:25:12.539
that same data into the measure store
00:25:13.980
the graph came up to 23.3 megabytes and
00:25:17.760
the bookmaps came up to just 4.46
00:25:20.159
megabytes so a total of 27 megabytes
00:25:23.580
total and that's about a 128x
00:25:26.880
compression ratio
00:25:30.360
um so what's next for the measure store
00:25:32.520
well we've just graduated the project
00:25:34.200
out of its r d phase and we're starting
00:25:36.779
to scale it up with a few products in
00:25:38.820
mind uh and in terms of our main
00:25:41.279
priorities right now we're doing some
00:25:42.779
battle testing to see how it will behave
00:25:45.059
under extreme load
00:25:46.620
and at the same time we've got folks who
00:25:48.539
are looking into how we can use the
00:25:49.860
measure store as a back end for modeling
00:25:52.380
data through the bitmaps and we hope
00:25:54.900
that one day we'll be able to open
00:25:55.980
source it so that if you also need to
00:25:57.960
harmonize your data with context you can
00:26:00.179
use it too
00:26:01.559
thank you for listening to me go on
00:26:03.179
about graphs and bitmaps and data enjoy
00:26:05.760
the rest of the conference