Benji Lewis

Data Indexing with RGB (Ruby, Graphs and Bitmaps)

A talk from RubyConfTH 2023, held in Bangkok, Thailand on October 6-7, 2023.
Find out more and register for updates for our next conference at https://rubyconfth.com/

RubyConf TH 2023

00:00:07.000 I'm in Bangkok for the first time. I'm Benji, a software engineer at Zappy.
00:00:09.519 Originally, I'm from Cape Town in South Africa but currently living in London.
00:00:12.400 A little bit about me: I love traveling, being in nature, cooking, and enjoying different kinds of food. I hope you can imagine how much I'm loving Bangkok.
00:00:18.760 I also have a strong passion for technology, coding, and data. If you're interested in any of these topics, come and grab me at the afterparty for a chat.
00:00:26.279 Today, I’ll be talking about a fun project I worked on last year with our Zappy X team, the research and development unit of Zappy.
00:00:34.879 In this team, we are able to explore some intriguing moonshot ideas that we usually wouldn't have the time to pursue.
00:00:36.879 Before diving into the project, I want to give credit to our Chief Innovation Officer at Zappy, Dave Burch, for his leadership and determination on this project.
00:00:45.079 This project is a custom data indexing system built using Ruby, graphs, and bitmaps, which Dave conceptualized.
00:00:50.879 I hope you enjoy it! Here’s a quick overview of what we’ll be covering this morning. We'll start with some background information, painting a picture of the landscape before we had this system.
00:01:06.000 Then we’ll discuss some problems we faced. This includes how we were storing context in our data, how we were managing our data, and how we lacked connections between different data points.
00:01:18.760 After that, we’ll get a bit technical as I go through details about the bitmaps and the graph.
00:01:21.720 To conclude, I’ll present some performance metrics and architecture overviews. Sounds good? Let’s get cracking!
00:01:25.240 At Zappy, we focus on collecting survey data. We have suites of research products on our platform that hold collections of questions or surveys.
00:01:35.240 From these surveys, we perform modeling to extract useful insights, which our customers use for decision-making in advertising.
00:01:44.880 We usually test stimuli such as videos or images. We handle everything from ensuring that the right people answer the surveys and get rewarded, down to executing algorithms through our computation engine.
00:01:53.600 Let’s explore what this process looks like in our system. We have respondents answering questions in our surveys.
00:01:58.240 These questions are transformed into what we call measures. A measure is a digital representation of something in the real world.
00:02:03.480 I might use the terms question and measure interchangeably, as at this moment, we’re only collecting readings from the real world through our questions.
00:02:14.120 Once these questions are input into our reporting platform, we model the data through our in-house computation engine called Quattro.
00:02:26.200 Quattro allows us to leverage Python's pandas through Ruby, enabling us to store and perform operations on these measures as pandas data frames.
00:02:32.520 Our CTO, Brendan, explains the underlying reasons for this choice in his talk from RubyConf in 2014. For me, it simply comes down to our love for Ruby.
00:02:40.720 These modeled measures are stored in our SQL database as serialized or pickled data frames, as referred to in Python. When users come to our platform, they can select the surveys of interest and dive into various charts.
00:02:50.720 When fetching data for these charts, we retrieve the respondent-level data from the SQL database, which is mostly pre-transformed or cached. We then perform additional computations to provide meaningful insights.
00:03:02.320 The charts allow for filtering, enabling us to analyze how different demographics responded to stimuli. They also give us norms or benchmarks, helping people gauge whether things are performing well.
00:03:10.480 As it stands, the platform excels in this type of analysis, allowing users to precisely define which subset of studies they wish to analyze and compare.
00:03:22.320 The architecture is also robust at storing the dependencies behind the computational models, and Quattro is optimized for processing these models and their dependencies.
00:03:32.160 However, we wanted more. Our goal was to query all of our data in real-time, and also to store the connections and relationships between different data points to create a richer dataset for providing more useful insights.
00:03:47.400 As you might imagine, achieving this isn’t simple. Let's review some of the problems we faced.
00:03:55.040 The first issue was 'context'. When fetching data, we needed to ensure that all data points represented the same concept. For example, if we want to know how the brand Yamaha performs on a particular metric like 'ease of use', we need to ensure the data we retrieve is relevant.
00:04:07.000 If a query were to retrieve all data where the measure is 'ease of use' and the brand is Yamaha, we may return an odd distribution if the dataset includes measures from motorcycles and pianos.
00:04:19.680 Knowing the context in which a given measure was asked is crucial for accurate meta-analysis. Was the survey focusing on vehicles or musical instruments?
00:04:31.480 The second problem was related to storage. We store our data as serialized data frames in SQL, but we're storing each measure on a survey level. If we've run four surveys with the same measure, we'd have four separate serialized measures.
00:04:46.040 At the end of the day, we need to concatenate these measures together to obtain a comprehensive dataset.
00:04:55.360 This requires iterating through all the surveys to fetch and deserialize that measure, which becomes cumbersome when dealing with substantial datasets ranging from 10,000 to 100,000 surveys.
00:05:10.640 The final issue we faced was connecting different data points, or what we refer to as harmonization. Harmonization is about understanding which data should be treated equivalently.
00:05:20.560 We can break harmonization down into three broad categories: the measure (or question asked), the stimulus (the context), and the audience (who answered the question).
00:05:36.520 For measure-based harmonization, consider this scenario: one survey asks, 'On a scale of 1 to 10, how easy is this to use?' In another survey, we may ask, 'After that amazing experience, what would you rate its ease of use out of 10?' If both questions essentially mean the same thing, they should be treated equivalently.
00:05:45.200 However, it's noteworthy that the second question introduces a bias by implying that the experience was amazing, which those answering the first question might not agree with. We can then adjust our harmonization approach accordingly.
00:06:03.160 Now, in order to query our entire dataset considering both context and harmonization, we needed a solution that was also fast, real-time fast.
00:06:16.080 Let’s step back and summarize where we've been. We’ve explored the challenges we encountered before the use of the measure store, and now let's introduce the innovative solution: the measure store itself.
00:06:40.000 The measure store was built with the core principle that the API needs to be simple and easy to understand. To make requests to the measure store, you can provide three pieces of information: the context or scope, the measure, and any dimensional filters you might want to include.
00:07:04.640 Let’s look at how this applies to our Yamaha example. In this query, we scope the request to surveys categorized as motorcycles and brand Yamaha, asking for the ease of use measure.
00:07:14.080 The magic happens when we include dimensions! If we want to filter down to only those responses that rated ease of use between 7 and 10, we can add those specific dimensions.
00:07:34.880 Okay, we’ve seen how to interface with the measure store. Now, let’s visualize it in action. I intended to perform a live demo, but I didn’t trust the internet connection here. So, instead, we’ll go through some GIFs.
00:07:55.760 In the first demo, we’ll examine how a particular measure performed across the different countries where we've run advertisements. The measure in question is ad distinctiveness, which may be formulated as asking respondents, 'On a scale of 1 to 5, how distinctive was this ad?'
00:08:25.600 We’ll loop over the countries where we've run ads and retrieve the distribution of the measure for each respective country. I won’t delve into the snippets, but we’ll also generate a chart to visualize the results.
00:08:45.920 Here’s a GIF showing our results in real-time. Running the command produced results within 17 milliseconds for 700,000 respondents, indicating that the trend is favorable. The ads in the United States appear to be quite distinctive.
00:09:07.120 Now, let’s check the data from the United Kingdom, which returned in just 6 milliseconds. However, we had fewer respondents, about 155,000, indicating more measures for ad distinctiveness in the US.
00:09:31.360 From this, we can infer that advertisements in the UK are perceived as less distinctive than those in the US, or perhaps that audiences from the UK possess a more critical eye, which can often be attributed to the weather!
00:09:57.040 Now, we’ll dive into another type of analysis that involves cross-referencing two different measures. This is particularly popular for seeing how two measures relate to one another, such as measuring persuasion against ad watchfulness.
00:10:12.560 If a respondent watched the entire ad, do they find it more persuasive or not? The watchfulness measure only contains two responses: either a yes or a no. We’ll iterate through those options, testing the relationship between them.
00:10:43.600 Running that query yields results that come back to us in just 23 milliseconds for around 1.5 million respondents who were surveyed regarding the persuasion question and the ad watchfulness measure.
00:11:00.960 We’ll do the same process for those who didn’t watch the full ad, which only took 8 milliseconds. This is due to a higher number of respondents being incentivized to watch the full ad.
00:11:15.680 The results illustrate an interesting trend: those who watched the full ad found it more persuasive.
00:11:32.960 The last demonstration I’ll cover is harmonization, which addresses a classic problem of variable semantics in regions. Here we see regions for the studies we ran across the United States.
00:11:50.560 We need to resolve a translation issue due to some studies being conducted in Spanish. Our task is to harmonize that data to analyze it collectively.
00:12:10.240 We start by looking at the counts from the Northeast region and a region with Spanish-speaking respondents. In the Northeast, there are 360,000 respondents, compared to just 9,000 from the other region.
00:12:28.600 We can execute a semantic equality command, which harmonizes the data and runs the counts again, demonstrating how respondents from both regions can now be included.
00:12:49.440 We’ll replicate this for the Southern and Sur regions, analyzing the counts pre-harmonization. We can enact this process without a bidirectional command.
00:13:02.920 This means that data from the Southern region can now include responses from the Sur region, marking a one-directional connection.
00:13:11.000 This explanation covers how the measure store functions. We can see how the API operates and how interconnections are made. The underlying processes are key to the measure store.
00:13:34.600 Let’s review: the measure store architecture consists of components that manage both raw data and data context. We'll begin by examining how we store the actual data.
00:13:45.240 To visualize our data storage, think of an index with multiple columns. In our application, the index represents respondents, and the columns are measures, which are further divided into dimensions.
00:14:00.400 Each dimension is represented as a bit map. If a respondent answered 'yes' for a specific dimension, their bit will be set, and if not, it will remain unset.
00:14:09.440 For instance, if respondents were asked to rate ease of use on a scale from 1 to 10, we would have 10 dimensions, with only the corresponding dimension for each respondent set to true.
00:14:20.080 When we first implemented this storage format, it allowed for fast data retrieval. However, as our dataset rapidly increased in size, we discovered that it was sparsely populated with zeros.
00:14:36.440 We implemented a smart compression algorithm known as Roaring bitmap compression, successfully compressing our bit maps by eliminating those zeros.
00:14:50.960 This optimization means we only need a single bit to store information for respondents linked to a measure or dimension.
00:15:03.320 Thus, our data storage is now efficient for querying a specific measure without the need for iteration or deserialization of large datasets.
00:15:16.960 Now, let’s explore how we connect this data. Initially, we conceived relationships between data points using a many-to-many relationship in a PostgreSQL database.
00:15:30.960 However, we realized that a different structure was necessary to manage a multi-hop effect, making it apparent that a graph would be an ideal solution.
00:15:45.760 Let me briefly outline the graph structure we employed. The first node introduced is the scope node, and we opted for a property graph where nodes and edges can contain associated properties.
00:16:02.960 Each scope node includes properties like entity, attribute, and value, allowing us to connect measures and their scopes, ensuring we can harmonize metrics accordingly.
00:16:20.680 Next, we have the measure node, which holds the measure’s name. This node is also scoped by adding an edge to represent the context in which it has been asked, enabling us to create a consistent measurement system.
00:16:50.760 We continue this process by ensuring that our data connections are clear, integrating everything necessary to harmonize our measures efficiently. This leads to robust links between related data points.
00:17:05.600 Wrapping up, we have the capability to execute queries in the measure store through scoped measures to find corresponding data based on user queries.
00:17:15.960 The last detail I want to share is regarding the technology stack implementing this measure store. We rely heavily on Redis, utilizing its graph database module to enhance our storage capabilities.
00:17:47.440 Thanks to how compressed our dataset became, we can store both the graph and bit maps in memory on Redis. We're currently employing Redis Graph to manage our nodes.
00:18:07.680 For the bit map storage, we utilize a Redis module called 'Redis Roaring', which is specifically designed to handle our coding needs.
00:18:21.760 The measure store gem is integrated into all client applications. In these apps, it's just a matter of configuring the Redis instance location to manage retrieval commands.
00:18:38.800 We have been pleasantly surprised by how fast Redis is. To illustrate this, let’s review some of the performance benchmarks from our measure store.
00:19:01.600 I sought to retrieve all age measures from one of our products, which included 3,700 surveys, resulting in a total of 1.56 million respondents. On our reporting platform, fetching this data took a frustrating 90 seconds.
00:19:22.880 However, when I repeated the request using the measure store, it was completed in just 495 milliseconds—a stunning improvement of 180 times.
00:19:38.720 To benchmark further, I measured how quickly we could return all data for just the age measure, which took a mere 9 milliseconds for 3.6 million respondents.
00:19:51.760 This efficiency stemmed from avoiding iterative processes, emphasizing the speed benefits of the measure store.
00:20:04.480 In terms of storage, I compared existing SQL database measures to those in the measure store. I loaded a subset of 40 studies into SQL, spanning 3.5 gigabytes, while the same data in the measure store occupied only 208 megabytes.
00:20:16.560 This represents a staggering compression ratio of 128 times, showcasing how efficient the measure store is for handling large datasets.
00:20:43.920 In closing, I want to share what lies ahead for the measure store. We’ve recently graduated this project from the R&D phase, forming several teams focused on taking these core concepts to production.
00:20:55.040 Our goal is to ensure we do not jeopardize our data integrity while leveraging its power in memory on Redis. These developments will culminate in a new reporting platform that we have been actively working on.
00:21:12.320 Personally, I am refining the data engine and modeling aspects to ensure the computation engine can function effectively with this new data storage system.
00:21:38.640 Thank you for listening to my insights on graphs, bitmaps, and data indexing. Enjoy the rest of the conference!
00:22:05.680 Conference