00:00:00.000
Ready for takeoff.
00:00:16.920
All right, all right, all right. Hi everyone, I'm super excited to be here this morning at the first day of RubyConf. Are you guys all excited?
00:00:18.840
Yeah, nice! You can feel the fresh energy of the first morning of the conference.
00:00:21.359
I'm Benji, I'm a software engineer at Zappy. I'm originally from Cape Town, South Africa, but I currently live in London.
00:00:23.460
Just a little bit about me: I love traveling, being in nature, cooking, and eating all kinds of food, so you can imagine how much I've enjoyed Houston.
00:00:25.980
I also love technology, coding, and data. So if you're interested in any of these topics, come grab me afterwards for a chat.
00:00:29.960
Today, I'm going to talk to you about a fun project that we've been working on for the past year in the Zappy X team, which is Zappy's R&D unit.
00:00:32.940
The project is all about a custom data indexing system that we built using Ruby, Graphs, and Bitmaps. I hope you enjoy it.
00:00:35.280
Before we dive into the nitty-gritty, here's a quick overview of what we'll be covering this morning.
00:00:37.380
We'll start with some background information, painting a picture of the world before we had RGB (Ruby, Graphs, and Bitmaps). Spoiler alert: it was black and white.
00:00:39.780
I had to make that joke! We'll dig into some of the problems we faced, such as how we applied context to our data, how we stored it, and how we lacked connections between our data points.
00:00:41.759
I'll also touch on what we needed and run through a quick demo of the solution we came up with. After that, we'll get a bit nerdy and talk about the composition of the measure store, diving into some technical details about the Bitmaps and Graphs.
00:00:45.000
Finally, I'll run through some metrics of the final solution and discuss the next steps for the project.
00:00:46.980
I hope that sounds good! Let's get cracking.
00:00:49.500
At Zappy, we're all about collecting survey data. We have suites of research products that contain collections of questions or surveys, and we perform some modeling to obtain useful insights.
00:00:52.500
Typically, we test stimuli such as videos or images. We handle everything from ensuring that the right people answer the surveys to rewarding them for participating.
00:00:55.000
Then, we execute the IP in our computation engine to drive insights, which are displayed in charts for our users.
00:00:57.600
Let's quickly run through what that looks like in our system.
00:01:01.320
We gather responses from participants who answer questions in the surveys. These responses are transformed into what we call measures. A measure represents a digital reading from the real world.
00:01:05.500
I might use "question" and "measure" interchangeably throughout the talk, but that's just because our readings from the real world come from survey questions.
00:01:09.300
These questions are then passed into our reporting platform where we start doing some modeling on them using our in-house computation engine, called Quattro.
00:01:10.860
Quattro essentially allows us to utilize Python's pandas through Ruby, enabling us to store and perform operations on these measures in the form of a data frame.
00:01:14.520
Our CTO, Brendan, explains the real reasons behind this in his talk from RubyConf in 2014; but for me, the best reason is that we simply love Ruby.
00:01:16.680
These modeled measures are then stored in our SQL database as serialized or 'pickled' data frames, as they call them in Python.
00:01:19.800
When our users access the platform, they can select the surveys they're interested in and dive into the various charts that we provide.
00:01:23.520
When pulling data for these charts, we're fetching respondent-level data from SQL, which is usually pre-aggregated and cached.
00:01:25.680
We then perform additional computations to derive the useful insights displayed in the charts.
00:01:28.200
These charts allow filtering of respondents, providing a clearer understanding of how different demographics respond to the stimuli.
00:01:31.020
They also give users benchmarks to compare their numbers against.
00:01:34.920
At the moment, the platform excels at this type of analysis, where a user selects a subset of their studies and wants to conduct cross-comparisons.
00:01:36.480
The architecture of the platform stores the dependencies behind the models we compute, and the computation engine is optimized to process these models and their dependencies.
00:01:39.960
However, we want more—we want to query all of our data in real time, establishing connections and relationships between different data points.
00:01:42.780
All of this is aimed at enabling meta-analysis over the entire dataset, allowing us to gain a deeper understanding of the data in our platform.
00:01:45.300
As you can imagine, nothing is ever that simple when you want all the things, so let's take a look at some of the problems we're facing that are stopping us from getting there.
00:01:48.120
The first problem we needed to consider is context.
00:01:50.700
When fetching all of the data, we need to ensure that it accurately represents the same thing.
00:01:53.640
The best way to illustrate this is through an example: suppose we want to evaluate how the brand Yamaha is doing in terms of a particular metric like 'ease of use.'
00:01:56.280
We would write a query asking for all the data where the measure is 'ease of use' and the brand is 'Yamaha.'
00:01:59.700
After retrieving this data, we might notice some unexpected anomalies.
00:02:03.120
For example, some people rated it easy to use, while others found it hard to use.
00:02:05.040
It turns out this could be due to Yamaha manufacturing both motorcycles and pianos.
00:02:07.560
Hence, distinguishing the context—whether the measure was asked in the vehicle category or the musical instrument category—is crucial for running meaningful meta-analyses.
00:02:09.679
To me, the ease of use for both categories would probably be zero; I’d likely crash the motorbike and flip the piano out of rage!
00:02:12.000
But now, let's return to the topic.
00:02:14.520
The next problem we faced was storage.
00:02:17.520
As I mentioned earlier, we store all our data as serialized data frames in SQL.
00:02:20.700
We also store each measure at the survey level, meaning if we conduct four surveys containing the same measure, we end up with four serialized measures.
00:02:23.520
These measures then need to be concatenated together to retrieve combined data.
00:02:26.520
Calculating this in code means we have to go through each of the surveys, fetch and deserialize the measures, and combine them at the end.
00:02:29.100
And when you put this in a database with tens of thousands of surveys, that's a lot of deserialization and concatenation, which makes it quite slow.
00:02:32.220
The last problem we need to tackle is the connection between different data points—a process we call harmonization.
00:02:34.920
Harmonization involves identifying which data should be treated as equivalent.
00:02:37.200
We can break this down into three broad categories: the measure or question, the stimuli, or the context in which the measure was asked, and the audience—who you're asking.
00:02:40.560
Let’s run through a couple of examples.
00:02:42.840
In the case of measure-based harmonization, let’s say we have two questions: question one asks, 'On a scale of one to ten, how easy is this to use?' and question two asks, 'After an amazing experience, how would you rate its ease of use out of ten?'
00:02:46.320
If we conclude that these two measures mean the same, we would establish a bi-directional arrow indicating their equivalence.
00:02:50.040
So, when fetching data for question one, we would also receive data for question two.
00:02:52.920
These measures would now be considered harmonized.
00:02:55.740
It's important to note that question two might introduce additional bias, skewing results.
00:02:58.260
In fact, question two can only consider itself equivalent to question one, and not vice versa.
00:03:01.860
The same principle applies to the harmonization of other factors, such as stimuli, where we treat metadata of the study as equivalent.
00:03:04.680
Let’s say in one survey, we ask about 'motorcycles,' while in another survey, we ask about 'motorbikes.' We know these refer to the same entity, allowing us to harmonize this data.
00:03:08.220
Thus, we needed something special. We needed a system that allowed us to query our entire dataset with both context and harmonization.
00:03:11.580
It also needed to be fast—real-time fast.
00:03:14.340
Before continuing, I forgot to mention at the start that there's a prize for anyone who can guess how many times I've said 'data' in this talk, so come grab me afterwards!
00:03:16.800
Bringing it back, we've discussed what the world looked like before the measure store, and hopefully, we now understand the kinds of problems we were facing.
00:03:20.580
Now that we've covered the challenges, let's get ready for the cool stuff—the measure store!
00:03:23.940
As the name suggests, we're storing measures.
00:03:26.520
The measure store was built on the core principle that the API needs to be straightforward and easy to understand.
00:03:29.160
When making a request to the measure store, you provide three elements: the context/scope, the measure, and the dimensions of interest (the last one is optional).
00:03:32.760
Let’s apply that to the Yamaha example: we scope the query.
00:03:36.360
We're scoping to request surveys within the motorcycle category by the brand Yamaha and then asking for the 'ease of use' measure.
00:03:39.960
We can also filter the respondents interested in those who rated it between 7 and 10.
00:03:43.680
Now that we've seen a little about how to interface with the measure store, let’s see it in action.
00:03:46.680
In the first demo, we'll examine how a particular measure performed across all the countries where we run ads.
00:03:48.900
The measure we'll be looking at is 'ad distinctiveness' on a scale of one to five.
00:03:50.820
We’ll loop over all the countries, retrieving the distribution for that measure.
00:03:53.520
I won't get into specifics here, but we'll print our visualization in an ASCII chart.
00:03:54.540
As we run the demo, you'll see how fast it generates the data.
00:03:56.959
Let’s compare the United States versus the UK.
00:03:59.760
Here’s the data for the United States, returning it in just 17 milliseconds for 800,000 respondents.
00:04:01.740
The results show a strong trend indicating that ads in the States are considered distinctive.
00:04:03.600
About 240,000 respondents rated it a full 5 out of 5, while few scored it a 1 or 2.
00:04:06.240
Let’s now take a look at the UK.
00:04:09.000
That came back in just 6 milliseconds, but we had slightly fewer respondents, 180,000.
00:04:10.680
This suggests we’ve run more ads in the U.S. than in the UK.
00:04:13.200
Interestingly, the trend shows fewer people in the UK rated the ad as 5 out of 5.
00:04:15.360
It's interesting to note that respondents from the UK appear more critical than those from the States, possibly due to the weather!
00:04:18.060
Now, let’s analyze another type of relationship by crossing two different measures.
00:04:20.760
This process is popular when analyzing how two measures relate to each other.
00:04:23.520
We'll cross the 'persuasion' measure with the 'watched full ad' measure and see if those who watched the ad found it more persuasive.
00:04:26.340
The 'watched full ad' has two dimensions: it's either a yes or a no.
00:04:29.520
We’ll loop through these responses and perform the cross-product between persuasion and that dimension.
00:04:31.560
As we run that, the data is returned, and we can see two histograms.
00:04:34.080
Here's the distribution for persuasion and the responses from those who watched the full ad.
00:04:36.300
This data returned in 23 milliseconds, encompassing 1.5 million respondents.
00:04:39.060
Here's how it looks for those who didn't watch the full ad, returning in just 8 milliseconds.
00:04:40.920
Interestingly, those who watched the full ad tended to find it more persuasive, as anticipated.
00:04:43.920
Now let's tackle a classic harmonization problem focusing on regions.
00:04:46.560
We have all the regions for the United States, some studies being conducted in English and others in Spanish.
00:04:48.420
To harmonize this data, we'll analyze the counts before harmonization for the Northeast region and 'norester.'
00:04:50.520
We have 360,000 respondents from the Northeast region and 89,000 from 'norester'.
00:04:53.040
Now we'll harmonize these results using the semantic equality function.
00:04:55.079
For this example, we'll enable bi-directional harmony.
00:04:57.360
After running that, we find both regions now count as the same.
00:04:59.040
We’ll also need to apply the same logic for the 'South' region, where we can see the counts before harmonization.
00:05:02.580
This time, however, we’ll harmonize without bi-directional relationships.
00:05:05.300
The result shows that while 'South' now incorporates 'Sur', 'Sur' does not include 'South'.
00:05:07.380
The state has therefore been harmonized in only one direction.
00:05:09.540
Having seen the measure store in action, we now understand some of the API's functionality and how to interact with it.
00:05:12.000
Let's move on to what happens under the hood.
00:05:15.120
We can break down the measure store into two key components: the section that stores the raw data and the part that maintains the contextual relationships of that data.
00:05:17.760
First, we’ll discuss how the raw data is stored. The best way to conceptualize this is as an index with a variety of columns.
00:05:20.760
In our application, the index is made up of the respondents, while the columns correspond to the measures broken down into dimensions.
00:05:22.260
Each Dimension is represented as a bitmap.
00:05:24.420
If a respondent answers 'yes' for a dimension, they'll receive a '1'; if 'no,' a '0.'
00:05:26.640
We can illustrate this with a few respondents asked about 'ease of use' with a 10-point scale.
00:05:29.910
For each respondent, their corresponding bitmap would show a '1' at the point they rated their experience.
00:05:32.580
For example, respondent number one rated it a 7, so their bitmap would look like this: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0].
00:05:35.520
Respondent number two rated it a 5 with a similar bitmap representation.
00:05:37.950
This approach efficiently stores measures for our respondents, giving us quick retrieval options.
00:05:40.620
However, we noticed the size of data began to grow significantly, implying we needed a better storage approach.
00:05:43.200
We found a compression algorithm called roaring bitmaps, which allows us to discard excess zeros.
00:05:45.300
With this, we only require one bit to store a respondent, optimizing our storage solutions.
00:05:49.260
We have enhanced our measure store to ensure fast access to all relevant data.
00:05:52.680
This obviates the need for cumbersome iterations across surveys, deserialization processes, or concatenation of data.
00:05:56.220
Now, let’s visualize what this might look like if we had multiple surveys.
00:05:59.700
Imagine we have two surveys: respondents one and two fill out survey number one, while respondents three and four fill out survey number two.
00:06:04.380
When querying for the data from survey number one about 'ease of use,' we can simply retrieve the indexed bitmaps corresponding to that survey.
00:06:07.080
Performing a bitwise AND operation on the two bitmaps will return the relevant responses.
00:06:09.300
Thus, we effectively obtain all necessary data from survey number one.
00:06:11.760
This applies equally for survey number two, thereby enhancing flexibility.
00:06:14.820
With this understanding of our storage methodology, let's explore how we connect the data points.
00:06:18.300
Initially, we considered using a many-to-many relationship within an SQL table to store those connections.
00:06:21.840
However, we soon realized this approach wouldn't suffice, as we needed to perform what is called 'multi-hop traversal.'
00:06:25.200
This requirement transformed our problem into one of graph processing.
00:06:27.120
Let me illustrate this structure using nodes.
00:06:30.720
The first node I'll introduce is the scope node, which represents an entity-attribute-value triple and a storage key.
00:06:32.640
In this scenario, the entity is the survey, the attribute is the category, and the value is 'motorbike.' We can establish connections between variants of scope, such as 'motorcycle' and 'two wheel,' using a one-directional edge.
00:06:37.680
This property allows us to apply harmonization. Therefore, when querying the data for the scope 'motorbike,' we can also account for 'motorcycle' and 'two wheel.'
00:06:40.620
We're using a property graph, allowing us to store the node type (which is 'scope') with associated properties.
00:06:43.620
Next up is the measure node, which includes a value property to denote the measure name, for example, 'ease of use.'
00:06:46.200
We can scope this measure by adding an edge to indicate that the question is associated with the connected scopes.
00:06:48.840
A measure node must have several measurements which reference the storage key pointing to the raw data bitmaps.
00:06:52.500
Referring to earlier, each measurement corresponds to a specific dimension for the 'ease of use' measure.
00:06:56.280
Afterward, we have the constant node, which delineates the requirements of each measurement with value properties indicating the measurement dimension, such as 5, 6, or 7.
00:06:59.940
These are linked to measurement nodes using measurement edges, which indicate the type.
00:07:03.240
I know this is a lot to digest, so let me summarize how it all interconnects.
00:07:06.300
We have nodes and edges in the graph, storing the scopes, measures, and their relationships.
00:07:09.600
Additionally, we house the bitmaps which hold the raw data.
00:07:12.420
When we run a query in the measure store, the system identifies the relevant scopes and connectively fetches the associated bitmaps.
00:07:15.600
And there we have it—our data indexed using Ruby, graphs, and bitmaps!
00:07:18.660
Now, let's discuss how this operates in production and the tools we use.
00:07:21.960
Meet our trusted friend, Redis!
00:07:24.540
Thanks to the compression we achieve through roaring bitmaps, we can store both the graph and the bitmaps in memory using Redis.
00:07:27.480
Furthermore, Redis provides a powerful graph database module called RedisGraph, which allows us to manage and store our nodes and edges.
00:07:31.800
RedisGraph supports openCypher, a popular graph query language, making it easier for scalability.
00:07:34.500
For the bitmap store, we utilize a module called Redis Roaring, developed by Aviciano.
00:07:37.080
The entire system has performed exceptionally well, so let's look at some performance metrics.
00:07:39.720
I ran tests on a measure store by fetching all age measures for one of our products.
00:07:42.540
With 3,608 surveys and a total of 1.5 million respondents, the previous reporting platform delivered results in 90 seconds.
00:07:45.180
In contrast, the measure store executed the same query—scoping on product and fetching the age bitmap—in just 495 milliseconds; that's a speed up of 180x!
00:07:48.840
I also tested the ability to extract data for the age measure across all products, achieving a stellar processing time of merely 9 milliseconds.
00:07:52.100
This was for an evaluation of about 5 million respondents—certainly much faster, as we’re not traversing the graph to locate scopes; we access the bitmap store directly.
00:08:04.920
When it comes to evaluating the storage advantages, I compared a subset of 40 studies loaded into an SQL database.
00:08:08.880
That took up roughly 3.6 gigs of space compared to our measure store data, with the graph totaling 23.3 megabytes and bitmaps at just 4.46 megabytes.
00:08:12.360
Overall, this amounts to around 27 megabytes total and an impressive compression ratio of 128x.
00:08:15.000
So what's next for the measure store?
00:08:18.720
Well, we've just graduated the project out of research and development, and we’re moving forward to scale it with a few products in mind.
00:08:22.800
Our immediate priorities involve thorough battle testing to see how the system behaves under extreme loads.
00:08:25.920
Simultaneously, we have team members investigating the potential of leveraging the measure store as a backend for modeling data through the bitmaps.
00:08:28.680
We aspire to eventually open-source it to share the ability to harmonize data with context.
00:08:31.680
Thank you for listening to me discuss graphs, bitmaps, and data. Enjoy the rest of the conference!