00:00:21.680
Okay, so who am I? I'm kind of a new face to Ruby. I'm Dana, just Dana—I don't get a last name, so y'all better remember that. Before I came to Ruby, I spent eight years in the corporate world.
00:00:30.240
I was responsible for handling an incredibly large amount of ridiculous data. I did that every day for eight years and decided I had enough of that, so I left the corporate world to come to Ruby, where I get to manage a ridiculous amount of insane data every day. I developed Rails applications; some of you might know my wacky husband, James Gray, who may also be known as Jake. He's just some guy.
00:00:48.480
So, why is data munging important? I mean, it can seem like a boring task, and it’s one of those insidious things that you have to do over and over again. The fact is, we live in a very data-driven society, and companies thrive on reports—they just love data. If you don't provide them with enough data, they start to go into some kind of weird data shock and keep coming back for more.
00:01:07.740
Clients have data that they want to access; they need it to be organized in databases or some other kind of structured format. So, it’s really important to know what data you have, what you need to do with it, and how you need to get it there.
00:01:23.939
There are kind of three parts to this. The more you know about your information, the more you understand your final output and what you need to do with it. This knowledge allows you to manipulate that data easily, which can make your life a lot simpler.
00:01:37.920
I'm going to discuss the process that I live by after years of working with data. I use the rule of three in data munging: you read data into some construct—and for me, I generally follow the principle that if it can be understood, you’re good. Then, I transform the data—I mung it, mix it up, and change it around. Finally, I output that transformed data using a format that can be understood.
00:01:57.360
So, that's my rule of thumb for managing and handling data. We're going to start with rule number one: reading, which is really the hardest concept to grasp.
00:02:10.020
Here's a really basic and simple munging script that includes all three parts: the output file, the input file, and the transformation. This script opens a file to write to, opens a file to read from, capitalizes the information, and then outputs it back out. While it's simple, it can get complicated when your client or boss comes back and says, 'Oh yeah, we got this other report that doesn’t look anything like the one you just worked on. We need it in this format, and oh, by the way, the format sucks. You destroyed this program, and now it doesn’t work.' So you have to write another one.
00:02:56.459
The first rule is: don't confuse reading with munging. They are not the same thing, and they should be separate processes. Reading information means you may or may not be reading data from the same kind of file. You could have various files that pull data from different places into one. You then transform it and spit it into one file, two files, or a database—whatever your output construct might be. By separating reading and munging, you gain the ability to handle these two ideas independently.
00:03:37.980
To make it simpler, I will create a method called munge, which is my actual transformation. I will pass in my input and output. I won't define those yet, but I will munge it. Since I control the munging process, I know what the input is going to be and how the output will work, allowing me to add each input to my transformation.
00:04:07.980
The flexibility I've gained becomes clear. I've got exactly the same task happening here in two different ways: the first one prints it out to the screen, while the second one writes to a file. I didn’t have to rewrite my munging method. All of a sudden, I've gained a lot of flexibility.
00:04:54.780
Since I control my system, each input doesn't have to be about writing to a file. I love Ruby! In fact, I didn't write that to a file; and if that's not interesting enough, I also work with Java.
00:05:40.020
You can control your input and output processes. You can write outputs as you need. Now, let’s take it a little further—let’s reach our ultimate munging power. I'm going to make a class called Munger and I’ll pass in that class both input and output, defining them later.
00:06:11.520
In my munge method, I’ll yield to this method what I want to do. I'll check to make sure that the input isn’t nil, and I shouldn't have to return to this munge method again, because I'm yielding my block to this method. You can then see how it plays out in action. I create a new Munger object, pass in the file I want to read from, and the output file I want to write to. I call munge on that object, and in this case, I'm manipulating the data in that file.
00:06:42.840
Now, let’s talk about data for a moment. There are so many different kinds of data that I could spend an hour just discussing the data structures you might encounter. But I don’t want to bore you to sleep before dinner, so let’s acknowledge that there are essentially two massive types of data: structured and unstructured.
00:07:21.760
Structured data is record-oriented, and that, in modern society, is the most common type you see—stuff from databases, Excel, etc. Unstructured data exists, but it’s often more challenging to work with because some rules you expect to be true may not hold. When discussing data and pattern matching, understand that about ninety percent of all munging involves pattern matching, which is why I enjoy data munging. I have fun playing with regular expressions.
00:08:02.699
Moving forward, let’s consider a real-world example: a small subset of a 55,000 line report with which I used to work daily in my previous job. Names have been changed to protect the innocent, but this is essentially the data I was dealing with.
00:08:15.300
At first glance, this file appears structured, with columns and headers. However, two headers repeat on each page, and we need to manage that. It’s not too bad at first, right? But then it becomes clear that the columns don’t align perfectly; we have these hierarchical categories that will require attention.
00:09:00.779
To make matters worse, the report doesn’t format thousands correctly, and the columns are too wide, resulting in letters appearing where numbers should be. Now, I can't total it because it's not an integer. So, this is really ugly data. If I said I had to type this data into Excel because my IT department wouldn't provide me with an FTP program, would anyone run screaming from the room? I certainly did for the first year.
00:09:51.720
Eventually, I told them they must get me an FTP program so I could play with it in Excel. I have my Munger ready, and I won’t worry about that part anymore. Now, I need to create a reader for this data. That’s a real chore, but it isn’t impossible. The trick is to break it down and determine what I care about in this report and what I can safely ignore.
00:10:14.039
To start, I can ignore anything with a report total or a subtotal, any blank lines without data, and any report that begins with a dash. This is useless information; we can discard it.
00:10:59.760
And wow! Suddenly, we have a much cleaner piece of data to analyze. While we stripped away quite a bit, the headers remain problematic—they still repeat and we need to extract categorical information from them as well. This indicates that we have two processes required: one to eliminate the headers and one to pull data from them.
00:11:40.600
To tackle this, I'd like to introduce you to the 'unpack' method. It's designed to break up binary code, but it’s fantastic for managing this kind of data since much of it is fixed width. Essentially, 'unpack' takes in a format string, allowing you to tell it what your data looks like.
00:12:28.060
For example, the format string could describe a snack like ‘cookies and cream,’ where there are seven ASCII characters followed by a skip space, followed by three ASCII characters, then another skip space, and followed by five ASCII characters. This method returns an array containing your headers.
00:13:16.740
To implement my reader class here, I’d add a couple of variables: headers and format. I’ll also need a method to parse the headers. To do this, I’ll pass in the first four lines of my report, knowing they don't ever change. I can count on this consistent structure to extract the data I need.
00:14:02.620
From there, I create a format for unpack to pull in the category data effectively. This allows me to gather headers into an array that I can use later. However, this doesn’t eliminate the headers at this point—I still have to tackle them to avoid them reappearing.
00:14:50.679
To manage this, I will set a tag to indicate whether I am currently within a header. Given that the data's structure is static, I can start to identify headers by observing lines beginning with ‘dash R’ and continue until I reach a line that begins with all dashes. This logic lets me exit header mode and return to processing actual data.
00:15:36.540
Now I can process the rest of my report and, boom, I’ve eliminated a significant amount of irrelevant data, resulting in something I can genuinely work with. But this is merely one of the two parts needed to address our header issue; the second part involves dealing with categories.