So who wants to be a Munger?

Ruby

Dana Grey

1 talk

So who wants to be a Munger?

by Dana Grey

In the video "So who wants to be a Munger?" presented by Dana Grey at the LoneStarRuby Conf 2009, the main topic discussed is the significance of data munging and effective data management within Ruby applications. Dana, who transitioned from a corporate background handling vast amounts of data to a Ruby-focused role, emphasizes the necessity of organizing and manipulating data to meet business demands. She outlines her approach using the 'rule of three' for data munging, which encompasses reading, transforming, and outputting data. This video presents several key points, including:

Importance of Data Munging: In a data-driven world, companies rely on structured reports and organized data to function efficiently.
The Rule of Three: The process Dana follows includes reading data into a manageable construct, transforming it (or 'munging'), and outputting it in a comprehensible format.
Separation of Reading and Munging: Emphasizing the distinction between reading data and munging it, Dana illustrates that maintaining separation allows for greater flexibility in data handling.
Creating a Munger Class: Dana discusses developing a Munger class to streamline the input and output processes, enhancing control over data manipulation.
Types of Data: The presentation addresses the differences between structured and unstructured data, with a focus on the challenges of managing unstructured information.
Real-World Example: She shares her experience working with a complicated report containing repetitive headers and misaligned columns. Dana describes strategies to clean and manage this data effectively, showcasing the need for thoughtful data processing.
Using Methods for Data Management: The introduction of various methods—like 'unpack'—demonstrates how to extract and organize relevant data from complex datasets, thereby facilitating easier analysis.

In conclusion, Dana highlights that by using the right processes and methodologies, such as those taught through Ruby, one can significantly ease the burden of handling data. Effective data munging not only improves clarity but also empowers clients to gain actionable insights from their data repositories. The importance of understanding data structures and their management emerges as a vital skill in the data-dependent landscape of modern business.

00:00:21.680 Okay, so who am I? I'm kind of a new face to Ruby. I'm Dana, just Dana—I don't get a last name, so y'all better remember that. Before I came to Ruby, I spent eight years in the corporate world.

00:00:30.240 I was responsible for handling an incredibly large amount of ridiculous data. I did that every day for eight years and decided I had enough of that, so I left the corporate world to come to Ruby, where I get to manage a ridiculous amount of insane data every day. I developed Rails applications; some of you might know my wacky husband, James Gray, who may also be known as Jake. He's just some guy.

00:00:48.480 So, why is data munging important? I mean, it can seem like a boring task, and it’s one of those insidious things that you have to do over and over again. The fact is, we live in a very data-driven society, and companies thrive on reports—they just love data. If you don't provide them with enough data, they start to go into some kind of weird data shock and keep coming back for more.

00:01:07.740 Clients have data that they want to access; they need it to be organized in databases or some other kind of structured format. So, it’s really important to know what data you have, what you need to do with it, and how you need to get it there.

00:01:23.939 There are kind of three parts to this. The more you know about your information, the more you understand your final output and what you need to do with it. This knowledge allows you to manipulate that data easily, which can make your life a lot simpler.

00:01:37.920 I'm going to discuss the process that I live by after years of working with data. I use the rule of three in data munging: you read data into some construct—and for me, I generally follow the principle that if it can be understood, you’re good. Then, I transform the data—I mung it, mix it up, and change it around. Finally, I output that transformed data using a format that can be understood.

00:01:57.360 So, that's my rule of thumb for managing and handling data. We're going to start with rule number one: reading, which is really the hardest concept to grasp.

00:02:10.020 Here's a really basic and simple munging script that includes all three parts: the output file, the input file, and the transformation. This script opens a file to write to, opens a file to read from, capitalizes the information, and then outputs it back out. While it's simple, it can get complicated when your client or boss comes back and says, 'Oh yeah, we got this other report that doesn’t look anything like the one you just worked on. We need it in this format, and oh, by the way, the format sucks. You destroyed this program, and now it doesn’t work.' So you have to write another one.

00:02:56.459 The first rule is: don't confuse reading with munging. They are not the same thing, and they should be separate processes. Reading information means you may or may not be reading data from the same kind of file. You could have various files that pull data from different places into one. You then transform it and spit it into one file, two files, or a database—whatever your output construct might be. By separating reading and munging, you gain the ability to handle these two ideas independently.

00:03:37.980 To make it simpler, I will create a method called munge, which is my actual transformation. I will pass in my input and output. I won't define those yet, but I will munge it. Since I control the munging process, I know what the input is going to be and how the output will work, allowing me to add each input to my transformation.

00:04:07.980 The flexibility I've gained becomes clear. I've got exactly the same task happening here in two different ways: the first one prints it out to the screen, while the second one writes to a file. I didn’t have to rewrite my munging method. All of a sudden, I've gained a lot of flexibility.

00:04:54.780 Since I control my system, each input doesn't have to be about writing to a file. I love Ruby! In fact, I didn't write that to a file; and if that's not interesting enough, I also work with Java.

00:05:40.020 You can control your input and output processes. You can write outputs as you need. Now, let’s take it a little further—let’s reach our ultimate munging power. I'm going to make a class called Munger and I’ll pass in that class both input and output, defining them later.

00:06:11.520 In my munge method, I’ll yield to this method what I want to do. I'll check to make sure that the input isn’t nil, and I shouldn't have to return to this munge method again, because I'm yielding my block to this method. You can then see how it plays out in action. I create a new Munger object, pass in the file I want to read from, and the output file I want to write to. I call munge on that object, and in this case, I'm manipulating the data in that file.

00:06:42.840 Now, let’s talk about data for a moment. There are so many different kinds of data that I could spend an hour just discussing the data structures you might encounter. But I don’t want to bore you to sleep before dinner, so let’s acknowledge that there are essentially two massive types of data: structured and unstructured.

00:07:21.760 Structured data is record-oriented, and that, in modern society, is the most common type you see—stuff from databases, Excel, etc. Unstructured data exists, but it’s often more challenging to work with because some rules you expect to be true may not hold. When discussing data and pattern matching, understand that about ninety percent of all munging involves pattern matching, which is why I enjoy data munging. I have fun playing with regular expressions.

00:08:02.699 Moving forward, let’s consider a real-world example: a small subset of a 55,000 line report with which I used to work daily in my previous job. Names have been changed to protect the innocent, but this is essentially the data I was dealing with.

00:08:15.300 At first glance, this file appears structured, with columns and headers. However, two headers repeat on each page, and we need to manage that. It’s not too bad at first, right? But then it becomes clear that the columns don’t align perfectly; we have these hierarchical categories that will require attention.

00:09:00.779 To make matters worse, the report doesn’t format thousands correctly, and the columns are too wide, resulting in letters appearing where numbers should be. Now, I can't total it because it's not an integer. So, this is really ugly data. If I said I had to type this data into Excel because my IT department wouldn't provide me with an FTP program, would anyone run screaming from the room? I certainly did for the first year.

00:09:51.720 Eventually, I told them they must get me an FTP program so I could play with it in Excel. I have my Munger ready, and I won’t worry about that part anymore. Now, I need to create a reader for this data. That’s a real chore, but it isn’t impossible. The trick is to break it down and determine what I care about in this report and what I can safely ignore.

00:10:14.039 To start, I can ignore anything with a report total or a subtotal, any blank lines without data, and any report that begins with a dash. This is useless information; we can discard it.

00:10:59.760 And wow! Suddenly, we have a much cleaner piece of data to analyze. While we stripped away quite a bit, the headers remain problematic—they still repeat and we need to extract categorical information from them as well. This indicates that we have two processes required: one to eliminate the headers and one to pull data from them.

00:11:40.600 To tackle this, I'd like to introduce you to the 'unpack' method. It's designed to break up binary code, but it’s fantastic for managing this kind of data since much of it is fixed width. Essentially, 'unpack' takes in a format string, allowing you to tell it what your data looks like.

00:12:28.060 For example, the format string could describe a snack like ‘cookies and cream,’ where there are seven ASCII characters followed by a skip space, followed by three ASCII characters, then another skip space, and followed by five ASCII characters. This method returns an array containing your headers.

00:13:16.740 To implement my reader class here, I’d add a couple of variables: headers and format. I’ll also need a method to parse the headers. To do this, I’ll pass in the first four lines of my report, knowing they don't ever change. I can count on this consistent structure to extract the data I need.

00:14:02.620 From there, I create a format for unpack to pull in the category data effectively. This allows me to gather headers into an array that I can use later. However, this doesn’t eliminate the headers at this point—I still have to tackle them to avoid them reappearing.

00:14:50.679 To manage this, I will set a tag to indicate whether I am currently within a header. Given that the data's structure is static, I can start to identify headers by observing lines beginning with ‘dash R’ and continue until I reach a line that begins with all dashes. This logic lets me exit header mode and return to processing actual data.

00:15:36.540 Now I can process the rest of my report and, boom, I’ve eliminated a significant amount of irrelevant data, resulting in something I can genuinely work with. But this is merely one of the two parts needed to address our header issue; the second part involves dealing with categories.

LoneStarRuby Conf 2009