Data Warehousing
Lightweight Business Intelligence with Ruby, Rails, and MongoDB
Summarized using AI

Lightweight Business Intelligence with Ruby, Rails, and MongoDB

by Coraline Ada Ehmke

In the talk "Lightweight Business Intelligence with Ruby, Rails, and MongoDB" by Coraline Ada Ehmke, the speaker discusses an agile approach to business intelligence (BI) that enables companies to effectively utilize data for decision-making without relying on expensive data warehousing solutions. Ehmke emphasizes the importance of real-time data access, especially for businesses that must adapt quickly in a competitive landscape.

Key Points:
- Definition of Business Intelligence: Business intelligence is about organizing and analyzing mission-critical knowledge within a company to support historical perspective and real-time decision-making.
- Three Pillars of Technology: Ehmke identifies three key components that support business technology: infrastructure, applications, and data. She notes that while developers often focus on applications, engaging with data can significantly enhance business outcomes.
- Challenges in Traditional BI Approaches: Businesses often default to using transactional databases for reporting, which can create performance issues and outdated information. Ehmke criticizes the reliance on consultants to build complex data warehousing systems, which may lead to a disconnect from real-time data needs.
- Lightweight BI System: Ehmke advocates for a lightweight BI system that can be developed iteratively using existing developer resources. This system focuses on addressing data as an asset and creating real-time decision support tailored to business needs.
- Collaboration with Stakeholders: Developing effective BI solutions requires working closely with business stakeholders to identify critical questions that the data should answer, fostering a thorough understanding of both data structures and business motivations.
- Schema Design Focused on Facts: Ehmke emphasizes the importance of designing databases around meaningful facts rather than objects, which allows for better reporting and insights.
- Technology Recommendations: Ehmke endorses using familiar technologies, particularly Ruby, for BI development due to its Test-Driven Development capabilities and ease of deployment. She highlights the importance of keeping APIs flexible to allow multiple uses of data and suggests listening to data streams in real-time for dynamic BI solutions.

Conclusions:
Ehmke concludes that business intelligence does not have to be an insurmountable challenge. With the right approach, iterative development, and a focus on meaningful collaboration, companies can harness their data effectively to drive real value. She encourages developers to engage with their data actively and ensures continuous evaluation of their BI systems to meet changing organizational needs.

00:00:19.930 Alright, hi everybody! I am Coraline Ada Ehmke. I work at Apartments.com, and we're hiring. I'm a principal developer there. Like our previous speaker, I did not have a Computer Science degree; in fact, I'm a college dropout. But I'm an autodidact, which means that I know the word "autodidact." Basically, I taught myself everything that I care about. One more thing: I'm from Chicago. There are no fewer than five developers in Chicago who do Ruby, and I'm the one with the fuchsia hair.
00:00:50.270 So, what is business intelligence and what do I mean by it? This is what we're going to be talking about today: basically, the stages of adoption to business intelligence, whether or not we can build it ourselves, and whether we should build it ourselves. Of course, I have to start with the definition, right?
00:01:14.390 Business intelligence means taking mission-critical knowledge inside your company, organizing it to provide a historical perspective, and using it for real-time decision-making. It sounds pretty straightforward. It's reporting, essentially. Sometimes it's called data science, sometimes it's called data warehousing, and there are all sorts of names for it. In the end, it's about taking the data that is essential to running your company, making sure it's accessible to the people who need it, and putting it in a form that will support the business in making decisions moving forward.
00:01:36.920 I like to talk about the three pillars of technology supporting business: technology that's in business, which we're not always partners in, and we sort of take for granted now. A software developer said, "If you're a company, of course you have a development team," and of course you have these other things. So, I think there are three main parts of how technology supports business: the first is infrastructure, which includes everything from the hardware that the applications are running on to the accounting system, the HR system, and all the stuff that I find quite boring.
00:02:01.350 The second pillar is applications, and as developers, this is where we spend most of our time. We enjoy writing applications that deliver business value. These applications are what let the business scale and hopefully attract customers. The third tier is data. For a developer, data is often an afterthought; we use our data stores to store objects that we need in our applications. Once we're done with those objects, we often stop thinking about the data. Maybe you have a report to build for the accounting team or something similar, but once it's built, you might not think about it again.
00:02:34.470 This is recognized as a flaw, especially as businesses grow larger. Typically, the business will want to address how to collect and use the massive amounts of data that have built up and how to actually turn it into an asset. When developers make friends with data, though, they can turn it into something useful. I want to talk about how you go about adopting business intelligence systems, and I'll do it in three acts.
00:03:02.570 The first two acts are how most companies typically approach it, and unfortunately, they often stop after the second act, throwing up their hands in despair and wandering off in search of enlightenment. The third phase is the one that I hope all of you will build upon.
00:03:52.880 Now, this is a Fiji mermaid. I don't know if you're familiar with it. In the late nineteenth and early twentieth centuries, there were many traveling sideshows and circus displays. The Fiji mermaid was basically the torso of a monkey sewn onto the tail of a fish, with some paper mache to make the transition smoother. The legend was that it was caught by a fisherman in Fiji and is now on display for everyone to see. A lot of people believed in it. So, if you're the kind of person who believes in the Fiji mermaid, you're probably also the type who thinks you can report straight out of your transactional database.
00:04:28.970 Everyone has done this at some point in time. Don't feel embarrassed about it; I have done it many, many times. What happens when you do this? Our transactional data stores are built for transactions. We have distinct tables for each of our objects, with complex relationships between them, possibly including join tables or models. When it comes time to report on those, you get some really gnarly SQL, which brings with it performance issues. One of the workshops is actually addressing this topic a little later, so you might want to check into that.
00:05:36.980 To get around the performance issues, you might decide that you can write better SQL than someone else. Good luck! What you'll find is that when you're doing reporting, you're impacting your production resources. Your servers will slow down, so you might conclude that you should only run these reports at 1 AM. I hope you're not a global company, but in the end, you might just throw up your hands, thinking that stakeholders just don't know what they want. You might even give them access to the database, which I consider to be problematic.
00:06:03.820 In the end, after you lose all your data because someone in accounting messed up their connection to an Excel spreadsheet, you're going to enterprise and bring in specialists. When a company reaches a certain size, they often deploy what I call the blue-and-khaki army. I don't mean to offend anyone in blue-and-khaki today—it's not all blue- and-khaki people—but they'll bring in consultants to create a data warehousing and enterprise reporting system.
00:06:49.990 What gets built is generally an entirely separate stack from everything else, usually running on a Windows server, often using Java. Nightly background jobs load the data from your transactional database into your data warehouse. Generally, these are run with ETL scripts: Extract, Transform, and Load scripts. They have no tests, and they aren't in source control. I'm sure everything will be just fine.
00:07:27.830 The schema in a data warehouse is optimized for reporting, which is a good thing, right? But it's reporting on day-old data. Maybe that's acceptable if you're sewing, but for me? I want fresh data, and I don't know about you guys. Also, there was this old process, called waterfall data warehousing, which requires a waterfall approach. Due to the complexity of everything, you need to do all of your planning upfront. If you get it wrong, you may have to start all over again and pay the consultants for another three to six months.
00:08:01.190 These consultants are not cheap. In the end, you get something really enterprise-oriented and, if Edward Tufte were dead, he'd be spinning in his grave right now. So, don’t trust these guys with your data either. While they may be nice, I'm sure, and yes, they wear khaki and blue, their sandals are a mass demonstration of individuality that I've just never seen before.
00:08:54.840 Most businesses will give up at this point. They have a data warehouse that emails reports once a week, and they forget about it. They forget about the promise of it, but there is a third option, and I call it lightweight BI. When a company realizes that their data is an asset and too important to outsource, it may be time to revisit the build versus buy decision. The promise is real-time data, put together by people who understand the data. People who wrote the structures that the data is based on can actually provide real-time decision support using your existing development team and stack.
00:09:43.830 You won't have to maintain those Java packages on the Windows servers because everything is built by your existing dev team. You also get the ability to change your mind, investing resources that will understand how it works and be able to adapt when your business needs change. You don’t have to pay the army to come back and fix things for you. You can be iterative and agile. You won’t have to design every single report and enforce an entire schema from the start.
00:10:13.050 This is what being iterative and agile is all about. That's not what business intelligence or data warehousing is about. You can trust your developers. I realize it's hard to do because data warehousing, business intelligence, and data science sound like big, scary things. The concepts may be a little difficult to grasp at first, but in the end, it's something that any competent team of developers can wrap their heads around.
00:10:53.890 They can deliver value in this domain and ultimately enhance your company by going through that process. So maybe I’ve piqued your interest in the idea of lightweight business intelligence. How do you actually go about getting started?
00:11:25.570 You want to collaborate, and I don't mean the sort of collaboration we regularly do as agile developers, where we work with our stakeholders and agree on things, then run within an iteration. I mean sitting down, getting out of your comfort zone, stepping away from technology for a while, and figuring out what the business is really about. What are the important pieces of data? They're not your user model; they're your customer.
00:11:55.110 So, work hand in hand with the people who will be served by the data that you present to them. This is not only to understand their needs, but it also allows you to communicate back to them what sorts of things are possible. They may not know every aspect of what's being recorded. For example, they might not realize that we're automatically recording the last login time. You can actually ask how loyal your customers are by observing how many times they log in.
00:12:38.730 This data exists and may be of value to them, so this is a two-way conversation. Next, formulate your questions. I like to think data is there to answer questions. Some people see it as a historical record or an audit trail, but I think its main role is to answer questions. If you don't know what your questions are, there's no way you can find the answer.
00:13:04.930 Work with your stakeholders to determine what questions you want to answer with the data. Generally, these are the same questions managers are asking their employees every day. For instance, how many signups did we get? Or is the lifecycle of our customers getting shorter or longer for onboarding purposes? As a developer, you'll start creating a solution by thinking about the data you already have. You might consider: if I took data from one place and merged it with data from another, what conclusions could I draw?
00:13:47.310 You want to take the inferences you make and turn them into facts. A fact has a specific meaning in data warehousing and business intelligence. I think it can be somewhat convoluted. We all know what a fact is; a fact is an answer to a question—a truthful answer to a question. If you didn't state your questions, you won't be able to discern where the facts lie. Thus, you should base your database schema on these facts.
00:14:22.830 You're not storing objects and state anymore; you're storing facts. Ideally, you have a table or document collection that answers one question, which means you'll want to normalize your data. We're afraid of denormalization because we're taught from an early age not to denormalize, but to be successful in reporting it may actually be helpful. It's fine for data to be stored multiple times in different places as long as the facts stored are necessary. Skip the clever graphs. In my experience, graphs can be pretty distractions that deliver minimal value. The more ink that's used on a graph, the less informative it usually is.
00:15:45.270 I prefer columns of data side by side if you’re doing a comparison. When not comparing, grouping data logically can communicate your ideas; it invites you to look at them and draw conclusions. With graphs, you can easily overlook the importance of the data itself.
00:16:25.040 So, here's a recipe for success in business intelligence. Your mileage may vary. Many people will have their opinions on what technologies you should use, but the important thing is that you choose technology you're comfortable with—something you know inside and out. This does not need to be the only way. I love Ruby! I think you all do too; Ruby is awesome because it's built for Test-Driven Development (TDD). Remember when I mentioned ETL jobs and that they typically have no tests? With Ruby, you can test the assumptions about how the data is working and how the facts are being collected.
00:17:02.640 Ruby is straightforward to deploy thanks to platforms like Heroku, which the previous speaker discussed. Ruby excels in data munging, and I often say that my experience as a Perl hacker significantly improved Ruby's data manipulation capabilities. Great visualization libraries exist for Ruby, though many these days are on the JavaScript side. Nevertheless, they are straightforward to integrate, allowing you to create aesthetically pleasing charts that communicate information effectively.
00:17:47.200 I want to talk briefly about something called a statistical model. Essentially, with this model, a calculation is run every time data changes—not on demand. This means the system will be very fast, anticipating the user's question and having the answer available almost instantly. We're not discussing 24-hour delays; at worst, you might have a 30-second delay. Stop thinking about data storage as a way to throw around objects. Instead, think about it as storing information that answers questions, focusing on the denormalized schema, which optimizes for reporting.
00:18:39.740 ETL processes can also be a bit fragile. They typically run every 24 hours, but now there's no reason they can't be performed on-demand or streamed in real-time. One of the workshop topics I'll cover covers actual implementations involving those elements. Let's say you have a service-oriented architecture, which I call a small app ecosystem.
00:19:10.270 In this ecosystem, you maintain a messaging server in the middle. You can have your data collection application listen to every message, recording it and performing calculations on data changes as a result of incoming data. This ensures that when someone checks a report, it's up-to-date; relevant calculations have already been run and updated.
00:19:50.220 These calculations are easily testable as you have inputs and outputs. This eliminates the need for someone to analyze terabytes of data and declare it "okay." The calculations run in the background, making data ready when you need it. You've probably heard that in SQL., you store the recipe, not the cake; I personally find happiness in storing both the recipe and the cake since you can't eat a recipe book.
00:20:32.740 In this model, you'll want to build APIs for everything. You never know where the data might be used. Don’t just develop a collection application and storage database for your reporting application. You might discover a need to showcase real-time data on your website, such as how many transactions per minute occur on your site.
00:21:15.790 It's crucial to ensure your APIs lend themselves to novel uses of the data, which is a significant aspect of software development—finding innovative ways to leverage code. Experiment at home, as it's not as daunting as it may seem.
00:21:39.670 Now, let me walk you through an example of a lightweight BI system. I like to give my projects mythological names because when I was in school as an English major, I was fascinated by comparative mythology and religion. I don't find naming a difficult problem, but I know others might question the significance of the names I choose.
00:21:44.970 This is a typical small app ecosystem. On the left side, we have Rails applications, possibly some non-Rails applications. In the middle, we have APIs, a messaging queue, and on the far right, our transactional data store, using PostgreSQL, and an event data store that records everything that happens.
00:22:28.540 Whenever an event is triggered, these APIs send messages to the messaging queue. The event API captures a copy, writing it to the event data store for events like user sign-ups, purchases, account closures, or preference changes. Ultimately, every action in your ecosystem needs to be recorded since data storage is inexpensive.
00:23:05.550 This event model emphasizes storing events rather than business objects. I gave this talk previously at Ruby Midwest and Windy City Rails, where someone approached me saying they attempted to create a similar system but failed. When I probed about their approach, they mentioned trying to alter an existing event, suggesting they were trying to go back in time and change history, a situation famous for causing problems in science fiction.
00:23:41.650 So again, don’t think in terms of business objects; think in terms of events, which do not change. Here’s an example of an event model. In this model, the schema is clearly declared, making it easier to traverse and analyze data.
00:24:18.660 Let’s say you create an event labeled as the customer signing up. The application within the ecosystem responsible for triggering this action will label it along with details encapsulated as a hash. You can input any kind of information as key-value pairs, enabling you to search for records with extensive versatility. For example, you could search by user ID, first name, or invoke behaviors based on arbitrary details.
00:25:07.420 One of the winning features of this event model is its ability to provide exploratory filters using user-friendly UI elements. On the left side, you can find a variety of filtering options—like searching by first name—allowing you to interact with data, regardless of how the schema is structured.
00:25:52.660 You can save these filters, representing a collection of criteria, and create groups of users based on those matching criteria. This formula applies to events to showcase trends or behaviors, which can be analyzed side by side without introducing any confusing graphs.
00:26:30.000 To conclude, business intelligence is emphatically not rocket science, but it does require significant endurance and possibly multiple attempts to master it. If someone suggests to you that it's impossible, consider my experience—I've built such systems successfully.
00:27:06.310 If you are struggling, don't discredit the process; keep persevering. Just like with any programming challenge, you will encounter failures along the way. Use the tools you’re familiar with to extract maximum value, and maintain genuine collaboration with your team throughout this journey.
00:27:31.060 Don’t overlook the potential of iterative development, advocating for small progress steps, questioning all assumptions, and ensuring accuracy in every endeavor. Remember: make friends with your data—the better your understanding, the more value you can deliver to your organization.
00:27:54.530 Thank you very much. You can follow me on Twitter @CoralineAda, and on GitHub, thank you all for your attention!
Explore all talks recorded at Big Ruby 2014
+13