RailsConf 2017

Reporting on Rails - ActiveRecord and ROLAP Working Together

Reporting on Rails - ActiveRecord and ROLAP Working Together

by Tony Drake

Summary

The video titled "Reporting on Rails - ActiveRecord and ROLAP Working Together" by Tony Drake, presented at RailsConf 2017, explores the integration of ActiveRecord, a key component of Rails, with Relational Online Analytical Processing (ROLAP) to build efficient reporting and dashboard systems for web applications.

Key Points Discussed:

  • Defining Reporting Needs: The journey of creating dashboards starts by identifying the questions stakeholders want answered from the data. The speaker emphasizes that understanding these needs from the outset helps prevent wasted resources on irrelevant data queries.
  • Understanding OLAP: OLAP is introduced as a standard for data warehousing and analytics. Drake explains that ROLAP is particularly useful for dynamic queries against traditional SQL databases. This allows applications to generate insightful reports from both historical and transactional data.
  • Fact and Dimension Modeling:
    • Fact Tables: Represent the primary data of interest (e.g., support tickets or billing records).
    • Dimension Tables: Used to slice and dice the data through relationships and attributes (e.g., grouping by cost centers or carriers).
  • Star Schema: The use of a star schema for organizing database tables optimizes reporting. This structure helps avoid complex joins and duplicate row issues common with has-many relationships.
  • Implementation Challenges with ActiveRecord: ActiveRecord is sufficient for simple data queries but has limitations. Issues discussed include difficulties in grouping and aggregating data without complex queries or logical workarounds.
  • Active Reporting DSL: To address these challenges, Drake introduces a lightweight Domain-Specific Language (DSL) named "active reporting" that utilizes ActiveRecord to streamline the reporting process. This DSL allows developers to ask intricate questions of their database while receiving clean and organized results.
  • Key Features of Active Reporting:
    • It retains the context of OLAP terminology (e.g., metrics, filters) and allows for dynamic report generation based on user input.
    • Filters and dimensions are whitelisted to maintain safety against erroneous user inputs.
  • Best Practices: The video concludes with best practices for database reporting, including:
    • Avoiding complex joins
    • Utilizing indexing effectively
    • Considering read-only database slaves for better performance as the application scales.

Conclusion

The presentation highlights the importance of translating business needs into actionable data insights. By integrating ActiveRecord with ROLAP concepts and employing the active reporting DSL, developers can create flexible and efficient reporting solutions that enhance decision-making processes.

This approach allows teams to quickly pivot as requirements evolve while maintaining the integrity and performance of their data queries.

00:00:12.349 All right, let's get this party started! Whoo! So, this is Reporting on Rails—ActiveRecord and ROLAP working together.
00:00:21.029 My name is Tony, and I am a senior developer at Mobe in Indiana. We manage corporate cellphone accounts for Fortune 500 companies, everything from bill optimization to device procurement, including tier one tech support.
00:00:33.930 I work specifically on the billing and reporting team, where I handle the over 1 million devices that we have under management, along with all the billing data that goes back several years.
00:00:46.500 What I essentially do is take information from various sources—carrier billing, tickets to a call center, and orders placed from carriers—shove them into Postgres, sprinkle some magic on top, and out pops pretty charts and graphs that our clients love. Yes!
00:01:00.829 To better explain what reporting is, I can use my current experience with Mobe. Way back, around 2012, ownership came to the dev team and uttered the dreaded "D" word—they wanted a bunch of dashboards to report on all the data we have.
00:01:23.460 Not just one dashboard, but dashboards for our client administrators, dashboards for our internal support staff, and dashboards for ownership to actually see how the company is doing. Oh, and they wanted user-defined filtering—being able to slice and dice data any way they wanted—along with other bits of scope creep that arose over the years.
00:01:46.200 So, where do we begin on this? The dev team laid out a couple of notes: we didn't want bulk and bloat in our result set. This means we didn't want actual ActiveRecord objects coming back from the database, as they can be quite large, and we didn't really need all that data.
00:02:00.829 What we really wanted was generic, uniform data coming back from whatever system we built. This means a plain old bunch of arrays with rows of hashes containing the information we want, so we can call JSON on it, shove it into the flavor of the month’s JavaScript front-end framework for charts, and we'd be on our way.
00:02:19.260 So, what is reporting, anyway? I'm going to assume that most of our apps run against a relational database. A relational database holds the data, but data is completely worthless to you and me—humans can't really work with data; computers do.
00:02:40.200 What humans can work with is information. Reporting and analytics take data and convert it into information so humans can make decisions from it. More importantly, reporting answers questions, and this is probably the most important step in preparing your app to generate dashboards: figuring out upfront what questions you want to answer.
00:03:17.849 This means going to every primary user and asking, "What do you want to know?" or "What information do you need?" Because if you don't do that, you'll start randomly throwing queries against Postgres or MySQL, and you'll just end up throwing spaghetti at the wall without yielding a usable result.
00:03:54.260 Here's some examples of questions that we want to answer for each of the three stakeholder groups at Mobe. We want to answer all of these questions and more effectively.
00:04:07.169 Fortunately, there is an industry standard term called OLAP (Online Analytical Processing) that is built for data warehousing and analytics. In traditional OLAP products, everything is rolled up into memory in the form of data cubes, where all the information is pre-sliced and diced, pre-grouped together, allowing for very fast queries.
00:04:40.530 OLAP commonly deals with aggregates such as counts, maximums, minimums, and averages. Typically you don't deal with individual rows from your data; you care about the grand picture. However, traditional OLAP is often tied into Oracle products and MySQL products, which are enterprise-level solutions, and when you think enterprise, you think money. Unfortunately, we didn't have money, so we needed a better way: ROLAP to the rescue.
00:05:13.440 ROLAP (Relational Online Analytical Processing) runs with SQL, which is the language our normal databases talk. It also allows for dynamic queries to be generated on the fly, letting us get any information we want very quickly with just a little setup.
00:05:31.770 A nice feature of ROLAP is that it can work with both historical data (like billing information) and transactional data (like support tickets that come through our support center), which changes daily. ROLAP can handle both types of data, while traditional OLAP is more suited for historical information that doesn’t change.
00:06:03.650 When working with reporting, you want to think in OLAP terms, not just simple SQL. Let’s go through some terminology, starting with the fact model, also known as a fact table. This is the primary table that holds the information you want to extract from the database.
00:06:55.030 In Rails, the fact model is typically a standard model. By looking at the questions our users want to answer, we can define what a fact model is. For our purposes, the support tickets table and the lines table represent our fact models.
00:07:22.839 A dimension is a way to take your data and slice and dice it into various chunks, commonly through relationships with other tables in the database. These often correspond to foreign keys linking to other tables, but can also be columns on the fact model itself.
00:07:46.460 For example, in Mobe, we have a cost center living on our lines table, so we can group by this cost center. We also have a carriers table, allowing us to join it to get additional information.
00:08:01.720 When focusing on reporting questions, I always look for the word "by." For instance, if I want to sum up support tickets by type, I would group by type. If I have active lines of service, I would group by carrier, joining against the carriers' table.
00:08:32.859 Next, we have dimension hierarchy. This allows for navigating up and down a hierarchy of data. A common example is dates; if I want to group orders by day, I can group by day. However, I might want to go up a level and request everything grouped by month, quarter, or year.
00:09:05.010 In Mobe, for example, devices comprise various hierarchies, including model number, manufacturer, operating system, and wireless technology, allowing for different levels of grouping.
00:09:30.080 Dimension members, or dimension labels, are actual pieces of information that you display in a report. For instance, if a line has a carrier ID, we can group by that. However, to present it in a table or a pie chart, we should use human-readable labels like the name of the carrier.
00:09:56.250 Another concept is filters. While OLAP typically is set up for static queries, ROLAP allows us to use a 'where' clause in SQL to further refine our data set. In SQL, filters correspond to the 'where' clause, and in Rails, this is achieved through the 'where' method, standard ActiveRecord scopes, or the Ransack gem.
00:10:48.580 Now, a measure in OLAP terms refers to any aggregate function, such as average, sum, max, min, or count. A measure could be the sum of total charges for an order; it is the column that you plug into the aggregate function.
00:11:32.260 On the other hand, the metric is essentially the report itself. The metric represents the result of the question you want answered. In SQL, this typically consists of the entirety of the query, while in Rails, it encompasses everything built with ActiveRecord to execute against the database.
00:12:28.280 I know that’s a lot of information, but here's how everything ties together. In this case, the entire metric is the question we want to answer. The sum for mobile charges could represent a fact model, with the last billing period as a filter and grouping by cost center as a dimension.
00:12:56.560 Another thing to consider is that the more filters and dimensions you add, the more complex your queries become. This is essentially the gist of ROLAP, paired with its SQL and Rails equivalents.
00:13:36.480 So, how do you implement this? First, your data must be organized in a way conducive to reporting. The most common schema is called a star schema, which is simpler and easier to visualize than the snowflake schema.
00:14:00.340 Essentially, when you chart all your tables, you will have fact models in the center, branching out into each possible dimension.
00:14:07.639 However, one important note is that reporting on "has many" relationships can be quite difficult. While it's technically possible, using the 'distinct' keyword can lead to invalid SQL in databases like Postgres due to how aggregations affect row duplication.
00:14:30.860 To illustrate this with Mobe as an example, we identified three fact tables: support tickets, line of service, and a row on a billing statement, along with various dimensions they can relate to. Notice that in the lines table we have a carrier field that can relate to another carrier table.
00:15:17.720 To avoid multiple joins whenever possible, in ROLAP, we often want to denormalize things by putting data directly on the fact model instead of relying on multiple joins. This approach makes it easier to filter and group data.
00:15:56.790 Now, creating a date dimension is essential. Instead of making a date column on support tickets, we create a created_at ID that links to a date dimension. This allows us to execute queries such as, 'Give me all support tickets broken down by quarter,' by joining the date dimension and grouping by the quarter.
00:16:30.140 Since each row in the date dimension includes several dates broken down into parts, you can navigate through hierarchies as you see fit. ActiveRecord can indeed handle all these functionalities.
00:16:59.420 However, the underlying reason I'm here discussing this is that, while ActiveRecord provides the basis for constructing ROLAP queries, it does have limitations. For example, there's no effective way to programmatically group by aggregate columns without manually specifying them.
00:17:43.910 Moreover, ActiveRecord isn't particularly adept at describing models in ROLAP terms, like stating that these fields are dimensions, nor does it have an efficient method for storing pre-made queries.
00:18:02.480 What could we do instead? One option would be to hard-code all of our queries, but this would make it difficult to define custom 'where' clauses and joins. This brings us to the solution we implemented.
00:18:35.170 We ended up defining our own reporting library, which I recently adapted into an open-source gem called Active Reporting. This provides a DSL-like system that allows you to define queries using ROLAP terms.
00:19:00.830 This lightweight DSL interacts with ActiveRecord to acquire information about the database and tables, helps build the necessary queries, and directly executes them. Instead of returning ActiveRecord objects, you receive simple arrays of hashes containing the data you need.
00:19:24.030 This keeps the integration with ActiveRecord lightweight, with only a couple of new methods added to the model. While I haven't yet built a demo app, the gem's version 0.1.1 is released, and the supporting DSL is largely operational.
00:19:41.760 Every model in your app that you want to report on should have a corresponding fact model. By naming the fact model appropriately, you ensure that it links to the correct model and separates reporting functionality from standard model logic.
00:20:03.530 Each fact model defines its own dimensions, allowing for a whitelist of dimensions for user input. This provides flexibility in changing how we group and filter reports without cluttering the main model.
00:20:25.750 This user interface can allow for dynamic reports; for example, if I want my lines grouped by carrier account instead of carrier, I can change that easily in the reporting layer.
00:20:50.380 Dimension filters work as where clauses, and the fact model contains a whitelist of acceptable filters so users don’t call unsafe methods. This precaution ensures control over what can be modified, maintaining data integrity.
00:21:29.560 The execution within the gem involves creating a metric that defines the question we want answered. The metric is associated with a fact model and can include dimensions and filters applied dynamically as needed.
00:22:01.190 This allows us to transform a simple metric into a more complex one as user input can change it. Then, you take that metric and shove it into a report object, generating the SQL query and executing it.
00:22:28.790 The report object allows merging in user input for dynamic reports, while the SQL generated targets those specifications. The power of ROLAP shines here because it supports defining where clauses on the fly.
00:22:54.300 Here's an example of the SQL we built, which includes selecting specific columns, aggregating total charges, and joining on dimensions while applying filters. At the end of the process, we get back simple arrays of hashes.
00:23:19.150 Ultimately, reporting is about acquiring the aggregates of information we care about, not merely returning rows of data. In conclusion, I'd like to share some pro tips for database reporting.
00:23:42.750 First, avoid 'double jumps' in queries as much as possible; try to normalize only when necessary. For instance, if you want active lines by carrier, keep the ID directly on the table to reduce complexity.
00:24:15.580 Another method to counteract challenges with 'has many' relationships is through counter caches—whether via a built-in plugin or pre-computed values handled by background jobs.
00:24:37.500 Make sure to index wisely—if you plan to dimension by something, index the foreign key. Use efficient filtering columns based on user input for faster retrieval.
00:25:00.760 As your application grows, consider read-only replication for reporting databases to offload read operations from the master. In particular, Postgres or sharding can help manage workload efficiently.
00:25:31.080 Thank you for listening! Feel free to check out the gem on GitHub, copy the slides if you're interested, and reach out to me on Twitter if you have any questions.