Data indexing with RGB (Ruby, Graphs and Bitmaps)

by Benjamin Lewis

In this talk delivered at RubyConf 2022, Benjamin Lewis from Zappy presents the development of a custom data indexing system called RGB (Ruby, Graphs, and Bitmaps), aimed at improving the accessibility and speed of querying survey data. The presentation details the journey beginning from a disjointed data set with serialized data frames to an in-memory index that allows real-time querying of data.

Key Points Discussed:

Background Context: Benjamin introduces the necessity for a custom measure store by outlining the limitations of the existing system which utilized serialized data frames stored in SQL, hindering real-time analysis and connection between data points.
Main Challenges:
- Context: Challenges in ensuring data semantic accuracy while querying specific measures.
- Storage: Inefficiencies due to repeated data storage across multiple surveys, leading to slow data retrieval.
- Harmonization: The need for establishing equivalencies between different measures and stimuli to enable meaningful comparisons.
The Measure Store: The solution introduced, which allows for straightforward querying involving context, measures, and dimensions. The measure store is built to handle large volumes of data and provides a user-friendly API.
Demonstrations: Benjamin illustrates the effectiveness of the measure store through live examples, showcasing its speed and efficiency in querying large datasets (e.g., querying 800,000 respondents in 17 milliseconds).
Storage Optimization: Discussion about the use of roaring bitmaps for efficient data storage and retrieval, significantly reducing the size of data stored compared to traditional SQL databases.
Graph Database Utilization: The integration of RedisGraph for managing semantic relationships in data, enabling complex queries and connections between data points.
Performance Metrics: Benjamin presents impressive performance improvements, with queries completing in a fraction of the time previously required, highlighting a 180x speed increase in some queries.
Future Directions: Zappy plans to scale the system, conduct further testing, and eventually open-source the measure store to share the harmonization capabilities with the wider community.

Conclusions:

Benjamin concludes by emphasizing the importance of having a robust data indexing system that combines Ruby, graphs, and bitmaps to overcome traditional data querying limitations. He encourages sharing and engaging further on the topic at the conference.

Through this innovative solution, Zappy aims to leverage the full potential of their datasets, allowing for rapid and contextually aware insights.

00:00:00.000 Ready for takeoff.

00:00:16.920 All right, all right, all right. Hi everyone, I'm super excited to be here this morning at the first day of RubyConf. Are you guys all excited?

00:00:18.840 Yeah, nice! You can feel the fresh energy of the first morning of the conference.

00:00:21.359 I'm Benji, I'm a software engineer at Zappy. I'm originally from Cape Town, South Africa, but I currently live in London.

00:00:23.460 Just a little bit about me: I love traveling, being in nature, cooking, and eating all kinds of food, so you can imagine how much I've enjoyed Houston.

00:00:25.980 I also love technology, coding, and data. So if you're interested in any of these topics, come grab me afterwards for a chat.

00:00:29.960 Today, I'm going to talk to you about a fun project that we've been working on for the past year in the Zappy X team, which is Zappy's R&D unit.

00:00:32.940 The project is all about a custom data indexing system that we built using Ruby, Graphs, and Bitmaps. I hope you enjoy it.

00:00:35.280 Before we dive into the nitty-gritty, here's a quick overview of what we'll be covering this morning.

00:00:37.380 We'll start with some background information, painting a picture of the world before we had RGB (Ruby, Graphs, and Bitmaps). Spoiler alert: it was black and white.

00:00:39.780 I had to make that joke! We'll dig into some of the problems we faced, such as how we applied context to our data, how we stored it, and how we lacked connections between our data points.

00:00:41.759 I'll also touch on what we needed and run through a quick demo of the solution we came up with. After that, we'll get a bit nerdy and talk about the composition of the measure store, diving into some technical details about the Bitmaps and Graphs.

00:00:45.000 Finally, I'll run through some metrics of the final solution and discuss the next steps for the project.

00:00:46.980 I hope that sounds good! Let's get cracking.

00:00:49.500 At Zappy, we're all about collecting survey data. We have suites of research products that contain collections of questions or surveys, and we perform some modeling to obtain useful insights.

00:00:52.500 Typically, we test stimuli such as videos or images. We handle everything from ensuring that the right people answer the surveys to rewarding them for participating.

00:00:55.000 Then, we execute the IP in our computation engine to drive insights, which are displayed in charts for our users.

00:00:57.600 Let's quickly run through what that looks like in our system.

00:01:01.320 We gather responses from participants who answer questions in the surveys. These responses are transformed into what we call measures. A measure represents a digital reading from the real world.

00:01:05.500 I might use "question" and "measure" interchangeably throughout the talk, but that's just because our readings from the real world come from survey questions.

00:01:09.300 These questions are then passed into our reporting platform where we start doing some modeling on them using our in-house computation engine, called Quattro.

00:01:10.860 Quattro essentially allows us to utilize Python's pandas through Ruby, enabling us to store and perform operations on these measures in the form of a data frame.

00:01:14.520 Our CTO, Brendan, explains the real reasons behind this in his talk from RubyConf in 2014; but for me, the best reason is that we simply love Ruby.

00:01:16.680 These modeled measures are then stored in our SQL database as serialized or 'pickled' data frames, as they call them in Python.

00:01:19.800 When our users access the platform, they can select the surveys they're interested in and dive into the various charts that we provide.

00:01:23.520 When pulling data for these charts, we're fetching respondent-level data from SQL, which is usually pre-aggregated and cached.

00:01:25.680 We then perform additional computations to derive the useful insights displayed in the charts.

00:01:28.200 These charts allow filtering of respondents, providing a clearer understanding of how different demographics respond to the stimuli.

00:01:31.020 They also give users benchmarks to compare their numbers against.

00:01:34.920 At the moment, the platform excels at this type of analysis, where a user selects a subset of their studies and wants to conduct cross-comparisons.

00:01:36.480 The architecture of the platform stores the dependencies behind the models we compute, and the computation engine is optimized to process these models and their dependencies.

00:01:39.960 However, we want more—we want to query all of our data in real time, establishing connections and relationships between different data points.

00:01:42.780 All of this is aimed at enabling meta-analysis over the entire dataset, allowing us to gain a deeper understanding of the data in our platform.

00:01:45.300 As you can imagine, nothing is ever that simple when you want all the things, so let's take a look at some of the problems we're facing that are stopping us from getting there.

00:01:48.120 The first problem we needed to consider is context.

00:01:50.700 When fetching all of the data, we need to ensure that it accurately represents the same thing.

00:01:53.640 The best way to illustrate this is through an example: suppose we want to evaluate how the brand Yamaha is doing in terms of a particular metric like 'ease of use.'

00:01:56.280 We would write a query asking for all the data where the measure is 'ease of use' and the brand is 'Yamaha.'

00:01:59.700 After retrieving this data, we might notice some unexpected anomalies.

00:02:03.120 For example, some people rated it easy to use, while others found it hard to use.

00:02:05.040 It turns out this could be due to Yamaha manufacturing both motorcycles and pianos.

00:02:07.560 Hence, distinguishing the context—whether the measure was asked in the vehicle category or the musical instrument category—is crucial for running meaningful meta-analyses.

00:02:09.679 To me, the ease of use for both categories would probably be zero; I’d likely crash the motorbike and flip the piano out of rage!

00:02:12.000 But now, let's return to the topic.

00:02:14.520 The next problem we faced was storage.

00:02:17.520 As I mentioned earlier, we store all our data as serialized data frames in SQL.

00:02:20.700 We also store each measure at the survey level, meaning if we conduct four surveys containing the same measure, we end up with four serialized measures.

00:02:23.520 These measures then need to be concatenated together to retrieve combined data.

00:02:26.520 Calculating this in code means we have to go through each of the surveys, fetch and deserialize the measures, and combine them at the end.

00:02:29.100 And when you put this in a database with tens of thousands of surveys, that's a lot of deserialization and concatenation, which makes it quite slow.

00:02:32.220 The last problem we need to tackle is the connection between different data points—a process we call harmonization.

00:02:34.920 Harmonization involves identifying which data should be treated as equivalent.

00:02:37.200 We can break this down into three broad categories: the measure or question, the stimuli, or the context in which the measure was asked, and the audience—who you're asking.

00:02:40.560 Let’s run through a couple of examples.

00:02:42.840 In the case of measure-based harmonization, let’s say we have two questions: question one asks, 'On a scale of one to ten, how easy is this to use?' and question two asks, 'After an amazing experience, how would you rate its ease of use out of ten?'

00:02:46.320 If we conclude that these two measures mean the same, we would establish a bi-directional arrow indicating their equivalence.

00:02:50.040 So, when fetching data for question one, we would also receive data for question two.

00:02:52.920 These measures would now be considered harmonized.

00:02:55.740 It's important to note that question two might introduce additional bias, skewing results.

00:02:58.260 In fact, question two can only consider itself equivalent to question one, and not vice versa.

00:03:01.860 The same principle applies to the harmonization of other factors, such as stimuli, where we treat metadata of the study as equivalent.

00:03:04.680 Let’s say in one survey, we ask about 'motorcycles,' while in another survey, we ask about 'motorbikes.' We know these refer to the same entity, allowing us to harmonize this data.

00:03:08.220 Thus, we needed something special. We needed a system that allowed us to query our entire dataset with both context and harmonization.

00:03:11.580 It also needed to be fast—real-time fast.

00:03:14.340 Before continuing, I forgot to mention at the start that there's a prize for anyone who can guess how many times I've said 'data' in this talk, so come grab me afterwards!

00:03:16.800 Bringing it back, we've discussed what the world looked like before the measure store, and hopefully, we now understand the kinds of problems we were facing.

00:03:20.580 Now that we've covered the challenges, let's get ready for the cool stuff—the measure store!

00:03:23.940 As the name suggests, we're storing measures.

00:03:26.520 The measure store was built on the core principle that the API needs to be straightforward and easy to understand.

00:03:29.160 When making a request to the measure store, you provide three elements: the context/scope, the measure, and the dimensions of interest (the last one is optional).

00:03:32.760 Let’s apply that to the Yamaha example: we scope the query.

00:03:36.360 We're scoping to request surveys within the motorcycle category by the brand Yamaha and then asking for the 'ease of use' measure.

00:03:39.960 We can also filter the respondents interested in those who rated it between 7 and 10.

00:03:43.680 Now that we've seen a little about how to interface with the measure store, let’s see it in action.

00:03:46.680 In the first demo, we'll examine how a particular measure performed across all the countries where we run ads.

00:03:48.900 The measure we'll be looking at is 'ad distinctiveness' on a scale of one to five.

00:03:50.820 We’ll loop over all the countries, retrieving the distribution for that measure.

00:03:53.520 I won't get into specifics here, but we'll print our visualization in an ASCII chart.

00:03:54.540 As we run the demo, you'll see how fast it generates the data.

00:03:56.959 Let’s compare the United States versus the UK.

00:03:59.760 Here’s the data for the United States, returning it in just 17 milliseconds for 800,000 respondents.

00:04:01.740 The results show a strong trend indicating that ads in the States are considered distinctive.

00:04:03.600 About 240,000 respondents rated it a full 5 out of 5, while few scored it a 1 or 2.

00:04:06.240 Let’s now take a look at the UK.

00:04:09.000 That came back in just 6 milliseconds, but we had slightly fewer respondents, 180,000.

00:04:10.680 This suggests we’ve run more ads in the U.S. than in the UK.

00:04:13.200 Interestingly, the trend shows fewer people in the UK rated the ad as 5 out of 5.

00:04:15.360 It's interesting to note that respondents from the UK appear more critical than those from the States, possibly due to the weather!

00:04:18.060 Now, let’s analyze another type of relationship by crossing two different measures.

00:04:20.760 This process is popular when analyzing how two measures relate to each other.

00:04:23.520 We'll cross the 'persuasion' measure with the 'watched full ad' measure and see if those who watched the ad found it more persuasive.

00:04:26.340 The 'watched full ad' has two dimensions: it's either a yes or a no.

00:04:29.520 We’ll loop through these responses and perform the cross-product between persuasion and that dimension.

00:04:31.560 As we run that, the data is returned, and we can see two histograms.

00:04:34.080 Here's the distribution for persuasion and the responses from those who watched the full ad.

00:04:36.300 This data returned in 23 milliseconds, encompassing 1.5 million respondents.

00:04:39.060 Here's how it looks for those who didn't watch the full ad, returning in just 8 milliseconds.

00:04:40.920 Interestingly, those who watched the full ad tended to find it more persuasive, as anticipated.

00:04:43.920 Now let's tackle a classic harmonization problem focusing on regions.

00:04:46.560 We have all the regions for the United States, some studies being conducted in English and others in Spanish.

00:04:48.420 To harmonize this data, we'll analyze the counts before harmonization for the Northeast region and 'norester.'

00:04:50.520 We have 360,000 respondents from the Northeast region and 89,000 from 'norester'.

00:04:53.040 Now we'll harmonize these results using the semantic equality function.

00:04:55.079 For this example, we'll enable bi-directional harmony.

00:04:57.360 After running that, we find both regions now count as the same.

00:04:59.040 We’ll also need to apply the same logic for the 'South' region, where we can see the counts before harmonization.

00:05:02.580 This time, however, we’ll harmonize without bi-directional relationships.

00:05:05.300 The result shows that while 'South' now incorporates 'Sur', 'Sur' does not include 'South'.

00:05:07.380 The state has therefore been harmonized in only one direction.

00:05:09.540 Having seen the measure store in action, we now understand some of the API's functionality and how to interact with it.

00:05:12.000 Let's move on to what happens under the hood.

00:05:15.120 We can break down the measure store into two key components: the section that stores the raw data and the part that maintains the contextual relationships of that data.

00:05:17.760 First, we’ll discuss how the raw data is stored. The best way to conceptualize this is as an index with a variety of columns.

00:05:20.760 In our application, the index is made up of the respondents, while the columns correspond to the measures broken down into dimensions.

00:05:22.260 Each Dimension is represented as a bitmap.

00:05:24.420 If a respondent answers 'yes' for a dimension, they'll receive a '1'; if 'no,' a '0.'

00:05:26.640 We can illustrate this with a few respondents asked about 'ease of use' with a 10-point scale.

00:05:29.910 For each respondent, their corresponding bitmap would show a '1' at the point they rated their experience.

00:05:32.580 For example, respondent number one rated it a 7, so their bitmap would look like this: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0].

00:05:35.520 Respondent number two rated it a 5 with a similar bitmap representation.

00:05:37.950 This approach efficiently stores measures for our respondents, giving us quick retrieval options.

00:05:40.620 However, we noticed the size of data began to grow significantly, implying we needed a better storage approach.

00:05:43.200 We found a compression algorithm called roaring bitmaps, which allows us to discard excess zeros.

00:05:45.300 With this, we only require one bit to store a respondent, optimizing our storage solutions.

00:05:49.260 We have enhanced our measure store to ensure fast access to all relevant data.

00:05:52.680 This obviates the need for cumbersome iterations across surveys, deserialization processes, or concatenation of data.

00:05:56.220 Now, let’s visualize what this might look like if we had multiple surveys.

00:05:59.700 Imagine we have two surveys: respondents one and two fill out survey number one, while respondents three and four fill out survey number two.

00:06:04.380 When querying for the data from survey number one about 'ease of use,' we can simply retrieve the indexed bitmaps corresponding to that survey.

00:06:07.080 Performing a bitwise AND operation on the two bitmaps will return the relevant responses.

00:06:09.300 Thus, we effectively obtain all necessary data from survey number one.

00:06:11.760 This applies equally for survey number two, thereby enhancing flexibility.

00:06:14.820 With this understanding of our storage methodology, let's explore how we connect the data points.

00:06:18.300 Initially, we considered using a many-to-many relationship within an SQL table to store those connections.

00:06:21.840 However, we soon realized this approach wouldn't suffice, as we needed to perform what is called 'multi-hop traversal.'

00:06:25.200 This requirement transformed our problem into one of graph processing.

00:06:27.120 Let me illustrate this structure using nodes.

00:06:30.720 The first node I'll introduce is the scope node, which represents an entity-attribute-value triple and a storage key.

00:06:32.640 In this scenario, the entity is the survey, the attribute is the category, and the value is 'motorbike.' We can establish connections between variants of scope, such as 'motorcycle' and 'two wheel,' using a one-directional edge.

00:06:37.680 This property allows us to apply harmonization. Therefore, when querying the data for the scope 'motorbike,' we can also account for 'motorcycle' and 'two wheel.'

00:06:40.620 We're using a property graph, allowing us to store the node type (which is 'scope') with associated properties.

00:06:43.620 Next up is the measure node, which includes a value property to denote the measure name, for example, 'ease of use.'

00:06:46.200 We can scope this measure by adding an edge to indicate that the question is associated with the connected scopes.

00:06:48.840 A measure node must have several measurements which reference the storage key pointing to the raw data bitmaps.

00:06:52.500 Referring to earlier, each measurement corresponds to a specific dimension for the 'ease of use' measure.

00:06:56.280 Afterward, we have the constant node, which delineates the requirements of each measurement with value properties indicating the measurement dimension, such as 5, 6, or 7.

00:06:59.940 These are linked to measurement nodes using measurement edges, which indicate the type.

00:07:03.240 I know this is a lot to digest, so let me summarize how it all interconnects.

00:07:06.300 We have nodes and edges in the graph, storing the scopes, measures, and their relationships.

00:07:09.600 Additionally, we house the bitmaps which hold the raw data.

00:07:12.420 When we run a query in the measure store, the system identifies the relevant scopes and connectively fetches the associated bitmaps.

00:07:15.600 And there we have it—our data indexed using Ruby, graphs, and bitmaps!

00:07:18.660 Now, let's discuss how this operates in production and the tools we use.

00:07:21.960 Meet our trusted friend, Redis!

00:07:24.540 Thanks to the compression we achieve through roaring bitmaps, we can store both the graph and the bitmaps in memory using Redis.

00:07:27.480 Furthermore, Redis provides a powerful graph database module called RedisGraph, which allows us to manage and store our nodes and edges.

00:07:31.800 RedisGraph supports openCypher, a popular graph query language, making it easier for scalability.

00:07:34.500 For the bitmap store, we utilize a module called Redis Roaring, developed by Aviciano.

00:07:37.080 The entire system has performed exceptionally well, so let's look at some performance metrics.

00:07:39.720 I ran tests on a measure store by fetching all age measures for one of our products.

00:07:42.540 With 3,608 surveys and a total of 1.5 million respondents, the previous reporting platform delivered results in 90 seconds.

00:07:45.180 In contrast, the measure store executed the same query—scoping on product and fetching the age bitmap—in just 495 milliseconds; that's a speed up of 180x!

00:07:48.840 I also tested the ability to extract data for the age measure across all products, achieving a stellar processing time of merely 9 milliseconds.

00:07:52.100 This was for an evaluation of about 5 million respondents—certainly much faster, as we’re not traversing the graph to locate scopes; we access the bitmap store directly.

00:08:04.920 When it comes to evaluating the storage advantages, I compared a subset of 40 studies loaded into an SQL database.

00:08:08.880 That took up roughly 3.6 gigs of space compared to our measure store data, with the graph totaling 23.3 megabytes and bitmaps at just 4.46 megabytes.

00:08:12.360 Overall, this amounts to around 27 megabytes total and an impressive compression ratio of 128x.

00:08:15.000 So what's next for the measure store?

00:08:18.720 Well, we've just graduated the project out of research and development, and we’re moving forward to scale it with a few products in mind.

00:08:22.800 Our immediate priorities involve thorough battle testing to see how the system behaves under extreme loads.

00:08:25.920 Simultaneously, we have team members investigating the potential of leveraging the measure store as a backend for modeling data through the bitmaps.

00:08:28.680 We aspire to eventually open-source it to share the ability to harmonize data with context.

00:08:31.680 Thank you for listening to me discuss graphs, bitmaps, and data. Enjoy the rest of the conference!