Talks

Beneath the Surface: Regular Expressions in Ruby

Beneath the Surface: Regular Expressions in Ruby

by Nell Shamrell

In the video titled Beneath the Surface: Regular Expressions in Ruby, Nell Shamrell addresses the often intimidating topic of regular expressions (regex) and their integration with Ruby programming. This talk is geared towards demystifying regex by revealing its underlying mechanics and empowering developers to confidently implement regex in their code. Key points discussed include:

  • Understanding Regex: Many developers find regex cryptic and fearsome. Nell relates her personal journey of overcoming this intimidation by learning how regex works logically, helping her harness its capabilities effectively.
  • The Harmony of Ruby and Regex: The presentation emphasizes how Ruby integrates with the Oniguruma and Onigmo regex engines, explaining the parsing of regex into an abstract syntax tree (AST) that is easier for these engines to understand.
  • Finite State Machines (FSM): Nell introduces the concept of FSM, describing how regex travels through character states to match strings. She presents examples, such as a regex to match the word 'force'. This exploration illustrates how regex operates sequentially through states based on character-to-character matching.
  • Advanced Regex Features: The discussion moves into complex regex functionalities like alternation, quantifiers (greedy, lazy, and possessive), and how they affect matching processes. For example, it contrasts greedy quantifiers, which will match as much as possible, with lazy quantifiers, which stop at the first viable match, as well as possessive quantifiers that do not backtrack.
  • Practical Application: Nell shares her experience creating a regex to convert snake_case strings into camelCase, illustrating the iterative process of developing a regex through examples and testing. This section demonstrates the practical utility of regex in real-world coding scenarios.
  • Learning Approach: To handle regex development efficiently, Nell advises using tools like Rubular for testing and iterating on regex designs. She encourages developers to document and test their regex in small pieces before combining them into complete expressions.

In conclusion, Nell reaffirms that while regex is a challenging aspect of programming, gaining a deeper understanding of its core principles through practical application can transform it from a source of fear into a powerful tool. She invites developers to engage with regex through experimentation, positioning it as a valuable skill in their programming arsenal.

00:00:21.760 Okay, so who thinks they understand regular expressions? Who has no clue about regular expressions? Alright, so for our next talk, we have Nell Shamrell speaking about regular expressions. Nell is an engineer at Blue Box Group in Seattle. She's been doing Ruby for about two and a half years, but she's been working with regular expressions for twice that time. I want to thank Mel for being our backup speaker, covering us. Thank you very much for that, Nell. I saw her do a version of this talk last month in Austin at Lone Star and I liked it a lot.
00:01:10.960 So, Nell, take it away. Thank you.
00:01:22.320 My name is Nell, and I have been intimidated by regular expressions. Has anyone else here felt intimidated by them? There we go, I saw just about every hand in that audience go up. I used to look at a regular expression like this and feel a sense of dread in my heart. Now, what this regular expression does is validate Visa credit card numbers, and once I knew that context, I could see the clues and perhaps get an idea of what was going on. But I had no idea how I would ever write something like this.
00:01:40.560 I dreaded whenever I was forced to use a regular expression in my code. Sure, I could vaguely figure out something like a phone number validation or maybe an email validation, but anything beyond that seemed completely out of reach for me. It's human nature to fear what we don't understand. Now, it took time, but once I understood how regular expressions actually work, how they seem to do that magic of finding matches in my string, I realized it was simply a process—a logical process that I could grasp.
00:02:07.200 Then, I knew how to use a regular expression without fear, how to harness their power to match exactly what I wanted in exactly the way I wanted to. I'm here today to share this knowledge with you and help you move beyond your fear by understanding how regular expressions work beneath the surface. When it comes to regex, knowledge truly is power. Today, I'm going to show you how that power can be yours.
00:02:25.040 Ruby and regular expressions work together in harmony, in a symphony of code. If I was going to learn regular expressions anywhere, I'm glad I really came to understand them by using Ruby. What we see in Ruby, however, when we use methods like the match method, is just the tip of a very large iceberg. A lot more goes on beneath the surface in the enigma of the regular expression engine.
00:02:39.440 Let's take a journey beneath that surface. The Onigmo regex engine was introduced in Ruby 2.0. Ruby passes regular expressions and strings to Onigmo, which handles the actual matching. Now, Enigma is actually a fork of the Oniguruma regex engine used in Ruby 1.9. Both of these engines provide all the standard regex features you'll find in any engine, but they handle multibyte characters, such as Japanese text, very well. Now, Enigma added some new features that were introduced with Perl 5. Patrick Shaughnessy has a fantastic article entitled "Exploring Ruby's Regular Expression Algorithm." I'll be tweeting out a link to all the resources I used in this presentation after I finish up, and I encourage you to check it out.
00:03:26.959 Now in this article, he lays out the workflow of Onigmo. When Ruby first passes a regex to Enigma, Enigma reads that regex and parses it into an abstract syntax tree (AST). An abstract syntax tree simply represents some code, in our case a regular expression, in a tree form that is easier for Onigmo to compile. Onigmo then compiles this tree into a series of instructions for the regex engine to execute. Now these instructions can be represented by a finite state machine.
00:04:06.400 Now, what on earth is that? A finite state machine is a mathematical model that shows how something works. It's kind of like a diagram or a map that shows how something goes from one state to another. Let’s clarify this with an example. So let's go ahead and create one. I’m going to create a finite state machine representing a dog—specifically, my parent's dog, Annie. She's a very cute Irish Terrier-Whippet mix and, like most dogs, loves to go in and out of the house all day, every day.
00:04:52.160 Each of these two circles—these two nodes—represent a state that Annie can be in at any given time. She can either be in the state of being in the house or she can be in the state of being out of the house. So, how does she get from one state to another? Well, if she's in the state of being in the house, she can go through her doggy door and transition to the state of being out of the house. Once she gets bored being out of the house, she can go back through that doggy door and be in the state of being in the house.
00:05:47.360 So that's an example of a finite state machine. But even with this example, just the terms 'finite state machine' are quite a mouthful. Let's break them down. The 'machine' in our example is Annie the dog. 'State' means we're modeling a state that the machine can be in. In the case of Annie, she can either be in the state of being in the house or she can be in the state of being out of the house. 'Finite' means there are a limited number of states our machine can be in. States are often limited by physical reality; Annie really can only be in so many states. She can't suddenly be in space or suddenly be underneath the ocean, as much as she might like to try.
00:06:29.600 In a computer, the memory is also not infinite. There's only so much it can process before it crashes. Therefore, the number of states a computer process can be in is limited by physical memory. Now, before I move on, I want to mention that like many dogs, Annie loves to stand halfway in the house and halfway out of the house at the same time. When she does this, she's actually in multiple states simultaneously.
00:07:02.800 There are ways a computer process can be in multiple states simultaneously as well. Now, this is outside the scope of this presentation, but the article 'Regular Expression Matching Can Be Fun and Fast' by Russ Cox, which I'll also link to in my resource notes, provides much more information on an algorithm that allows for this. I highly encourage you to check it out.
00:07:14.240 So let's make a finite state machine for this regex. This regex looks for the word 'force' in any string that I pass to it. When I use a regex in Ruby, I declare my regular expression and my string, which in this case is 'use the force.' I then call match on my regex and pass it that string. After Onigmo reads the regex, it parses it into that abstract syntax tree and compiles it into that series of instructions.
00:08:02.560 So my finite state machine looks like this: A regular expression tries to match a string one character at a time, starting with the very first character in the string. So the first character it's going to see is that capital letter 'U.' Now that doesn't match the path to the next state, so it stays on that starting state. Next, it sees the character lowercase 's.' It still doesn't match, so it stays in that starting state.
00:08:43.680 Now it's going to go through a few different characters. So let’s go ahead and fast forward until we come to that lowercase 'f' in our string. This is where things start to get interesting. A character in my string matches that path, and my finite state machine can now move on to the next state. Then it sees the lowercase 'o,' and sure enough, that matches the path to the next state; same with the 'r,' the 'c,' and the 'e.' And we have a match!
00:09:04.080 Once my finite state machine reaches that final matching state, that means my match was successful. Back in Ruby, Onigmo will pass the information back to Ruby, and Ruby will return a match data object containing our match, which is the word 'force.' Now that was a pretty simple example.
00:09:32.560 Let's try something a little more complicated. Let's try a regular expression that uses alternation. This regular expression will match the capital letter 'Y' followed by the characters 'olk' or the characters 'oda.' I'm providing two alternate ways my regular expression can find a match. In Ruby, I'm going to declare my regex, declare my string, then call match on that string. My string is the word 'Yoda,' and this time my finite state machine looks a little bit different.
00:10:01.760 There are two paths that can lead to a successful match. After it matches that 'Y' in my string, it now has to make a choice: which path should it try first? In the case of alternates, a regex engine will always try the leftmost alternate first. But before it tries that 'olk' path, it's going to save both the point in the string where it is and the state it's on to what's called the backtrack stack.
00:10:27.680 Every time my regex chooses one path over the other, it saves the string and the state just in case the match fails and needs to go back and try the other choice. It's kind of like a 'choose your own adventure;' it's marking a place it could come back to, and it's a good thing it did. As soon as it sees that 'd' in the string, it has no way to get from its current state to the matching state.
00:10:51.040 Now normally, having no path to that finishing matching state would cause my regex to fail. However, because it has something in the backtrack stack, it can backtrack to the point where it shows which path to follow and try the other one. This time, things go better. After it matches that 'o,' it's now able to match the 'd' and then the 'a.' And hurrah! This time we have a match. Back in Ruby, I'll get back my match data object, which contains my match, which is the entire string in this case: 'Yoda.'
00:11:24.560 Now, finite state machines become even more interesting when we use quantifiers. It's easy to look at a regular expression like this with our human brains and see the word 'no' followed by a plus sign. However, Onigmo sees this as the capital letter 'n,' then a lowercase 'o,' and then a plus sign quantifier. That plus sign quantifier means that lowercase 'o' needs to appear one or more times.
00:11:56.640 In Ruby, I'll declare my regex and my string. This time, the string is the word 'no,' famously yelled by Luke Skywalker in 'The Empire Strikes Back.' Exactly that! That was a much better imitation out there in the audience. I'm going to call match on my regex and pass it my string. Now, this is what my finite state machine looks like: It's pretty simple at first; it matches that capital 'N,' then matches that first 'o,' and then my regex has a dilemma.
00:12:50.960 It could return the match here, but it sees another lowercase 'o.' It could also follow that curved 'o' path and loop back to the state matching more of the string. So, which should it choose? It keeps looping and matches that second 'o,' then matches that third 'o,' then the fourth 'o,' and the fifth 'o,' which is the last 'o' in the string, and we have a match! Back in Ruby, I'll get back my match data object containing my match, which is the entire string: all five of those lowercase 'o's.
00:13:34.800 But what if I want the opposite behavior? What if I want to match as little of the string as possible? I would use a lazy quantifier. Lazy quantifiers do the opposite of greedy quantifiers; they match the least number of characters possible. A lazy quantifier uses minimum effort for minimum return—they’re lazy. They do just enough to get the job done and then they stop.
00:14:07.680 I make a quantifier lazy by adding a question mark right after it. The plus sign is the quantifier and the question mark is a modifier on that quantifier that makes the quantifier lazy. So, back in Ruby, I'm going to call match on my regex and pass it my string. It's going to start off the same: it will match that capital 'N,' then match the lowercase 'o,' and then it has that same choice: should it keep looping and match more of the string, or should it return the viable match it has now?
00:14:36.080 Well, because this is a lazy quantifier, it chooses to go ahead and return that match. Back in Ruby, I get my match back, and this time it's only the capital letter 'N' and one lowercase 'o.' This choice—whether to keep looping or to return a match when you have the first viable one—is the essence of greedy and lazy quantifiers. The difference between them is that a greedy quantifier will always choose to keep looping when able, and a lazy quantifier will always return the match as soon as it has a viable one.
00:15:32.720 Even though greedy quantifiers are greedy, they're also reasonable. If a greedy quantifier matches an extra character, but then that character is needed later in the regex to make a successful match, it will give that character back. It will always choose making an overall match over holding onto any extra characters. Now let’s try another example, but this time I'm going to use the star quantifier.
00:16:05.280 Now, I should say briefly that you should always use the star quantifier with caution. When it appears after a character, it means that character can appear any number of times. And any number of times includes zero times. In this example, I placed the star quantifier right after a dot meta character. The dot meta character matches any character, so this regex will match any character appearing any number of times followed by the word 'moon.'
00:16:37.520 In Ruby, I declare my regex and I declare my string. In this case, it's the line 'that's no moon' from 'A New Hope.' Then I call match on my regex and pass it that string. In my finite state machine, it first sees the capital 'T' in my string. That matches the dot meta character path—the any character path—so it's able to move on to the next state.
00:16:58.560 Now, thank you very much! Can we have a round of applause for the sound person? He's been doing a great job this entire time. Once I get to this state, there are two paths it can take. It finds a lowercase 'm' in the string. If it finds a lowercase 'm' in the string, it can follow the path to the next state, or if it finds any character at all, it can follow that second dot meta character path—that any character path—and loop back to that next state.
00:17:40.560 Well, the lowercase 'h' in my string matches that any character path, so it goes ahead and loops back to the same state. It then sees that lowercase 'a' in the string, which also matches the any character path, so it loops again. This will continue throughout the string. So let's go ahead and fast forward till we get to that lowercase 'm' in the string.
00:18:12.280 Now my regex has a dilemma: it can either take that path that matches the lowercase 'm,' or it could take that looped any character path. Which should it choose? Well, that star quantifier on my dot character is greedy, so it's going to keep on looping over that any character path. It does this again for the 'o,' then the other 'o,' then the 'n,' and oh! Suddenly there are no more characters in my string for it to match and it's still not at that finishing matching state.
00:19:06.960 So, now it has to make another choice: should it backtrack and give back some of the characters that dot with a star quantifier matched, or should it declare the match a failure? Well, remember, greedy quantifiers are reasonable. That star quantifier surrenders some of the characters that matched, starting with the most recent one. Now that doesn't make things better.
00:19:54.240 So, it surrenders that 'o.' We still have no match for that lowercase 'm' path. So it surrenders the next 'o.' Still no match. Things are looking grim, but wait! As soon as it backtracks to that lowercase 'm' in the string, we have a match and we can move on to the next state! When it matches that lowercase 'o,' that second lowercase 'o,' and the 'n,' we have a match.
00:20:37.520 Back in Ruby, we get back our match data object containing our match, which is the entire string. Now, with backtracking, we were able to ultimately find a successful match, but backtracking tends to be slow and it uses up a lot of memory. When you hear someone complain about how regular expressions are slow, chances are they're probably complaining about backtracking.
00:21:22.560 Now, it's great when backtracking allows my regex to find an overall match, but when it doesn't, all that backtracking work, all those resources used, are for nothing. When backtracking is too slow, there's a third kind of quantifier that lets you control backtracking in your regex. It's called the possessive quantifier.
00:21:56.640 Possessive quantifiers do not backtrack. These are all or nothing: they either make a match on the first try or they fail the match. So let's look at an example of this. I make a quantifier possessive by adding a plus sign after it. Now this regex matches a capital letter 'N' followed by lowercase 'o' appearing one or more times, and now I've added a second plus sign to that lowercase 'o.' One plus sign is the actual 'one or more' quantifier and the other is a modifier to that quantifier which makes it possessive.
00:22:37.680 Now, after that lowercase 'o,' my regex also looks for lowercase 'w' appearing at least one time. Now, my finite state machine does this: my regex first matches the capital letter 'N,' then it matches that lowercase 'o,' then it matches the next lowercase 'o,' and the next one, and the next one, and uh-oh! I'm back with the same problem: it hasn't yet found that lowercase 'w' appearing one or more times.
00:23:22.720 It didn't find it on the first try. It now has to decide whether it should backtrack, give back some of those extra 'o's that the quantifier matched, and maybe find the lowercase 'w' in them, or it needs to give up and fail. Right now, a possessive quantifier always chooses to fail rather than give back any of the characters that it matched.
00:24:09.840 A possessive quantifier saves both time and memory by making a regex fail faster. You use a possessive quantifier when you know there's a point in your regex where backtracking would be pointless—the match will fail no matter how many permutations it tries. You use possessive quantifiers with caution; they can potentially cause unexpected failures.
00:24:48.240 Generally, the best place I've found to use them is with a smaller sub-expression within my regular expression or when I have a nested quantifier within another quantifier. When used carefully, this can significantly speed up your regular expression matches.
00:25:34.720 So far, I've taken you through the bits and pieces of how a regex works. It's good information to know and great theory to have, but it doesn't explain how to practically use regular expressions in your everyday code. Crafting a regular expression for a specific need is as much an art as it is a science.
00:26:22.560 In this last section, I'm going to take you through crafting a regular expression from scratch for use in real, functioning code. Back in May, Octagram tweeted a regular expression challenge. It was to create some Ruby code using the gsub method and a regex that would convert a snake case string into a camel case string. I was away on vacation and unplugged at this time, so I didn't see this tweet until much later.
00:27:02.240 Now, I'd like to present my solution and take you through how I developed it. The first thing I did was to whiteboard the requirements for my solution. First, I need to be able to find the first letter in any string and capitalize that letter. Next, I need to find any character that follows an underscore and capitalize it. Finally, I need to remove the underscores from my string. These steps will transform a snake case string into a camel case string.
00:27:49.360 So let's start with that first step. I need to find the first letter of my string and capitalize it. Now, I'm a developer who uses test-driven development, and I develop my regular expressions through that same red-green-refactor process. So first I write a spec where I define the basic structure of my program. I'm creating a class called 'Converter.' Then I'm adding a method to that class called 'upcase_cars.' I expect that when I pass a lowercase string to 'upcase_cars', it will return that same string with that first character capitalized.
00:28:24.240 Next, I draft a regular expression to capture that first character in my string. I'm going to use this caret symbol to anchor my match to the beginning of the string. Next, it will need to find that first letter. My first graph of this regex uses the \/w shorthand; this will match any word character. Let's plug this into the 'upcase_cars' method in my actual class. I define the regex, then I use the gsub method on my string and pass it to the regex. Every character that regex finds is then capitalized using the 'upcase' method.
00:29:17.280 So when I run this spec, my spec passes. But there's a problem with this first draft in my regular expression: I want to capitalize the first letter of my string even if there's an underscore in front of it. So, in this spec, I state that when I pass the string 'underscore_method' to the 'upcase_cars' method, the 'm' in my method is lowercase. I expect to receive that string back with the 'm' capitalized.
00:29:36.560 Now when I run this with my current code, my spec fails. Let's take a look at the error message from that spec. I expected to get back a string with the lowercase 'm' capitalized, but I got back the lowercase string instead. Something's not right. There's a problem with the \/w shorthand. It matches all word characters. But in its mind, word characters include all capital letters, all lowercase letters, and underscores.
00:30:17.680 If I pass it a string that starts with an underscore, it will match the underscore, not the first letter. My Ruby code will then call upcase on the underscore and, naturally, nothing will happen. You really can't upcase an underscore. I need to be more specific. Instead of the \/w shorthand, I'm going to use a character class. This character class will match any lowercase letter from a to z, which is exactly what I need to capitalize and nothing more. Furthermore, I'm going to allow for one non-lowercase character at the beginning of the string.
00:31:24.640 The caret in the character class has a different meaning than the one at the very beginning of my regex. When it's in a character class, it negates the characters that are in that class. It means match anything but the characters that are listed. Finally, I'm going to add a question mark after that character class. This means the character class is optional. This regex will match both a string with an underscore at the beginning of it and a string without an underscore at the beginning of it. Now when I use this regex in my code, my spec passes.
00:32:09.440 So I'm ready to move on to the next step. I need to find any character that directly follows an underscore and capitalize that character. Again, I'm going to define a spec. I expect that when I pass the string 'some_underscore_method' with all lowercase letters, it will return that same string but with the letter 's' and the letter 'm' capitalized.
00:32:38.080 I now need a regex that will match both the first lowercase letter of my string and any lowercase letter following an underscore. I take my current regex and add an alternate to it. As it stands right now, it will match the first lowercase letter of a string or any lowercase letter in the string. Now I need to make that alternate specific to letters that directly follow underscores.
00:33:03.920 I add in a lookbehind. A lookbehind defines the context for my match; it means only match a lowercase letter if it directly follows an underscore. When I add this regex to my code and run my spec, my spec passes. This brings me to the final requirement for my solution: remove any underscores from my string.
00:33:51.840 Again, I create a spec for this requirement. I'm going to have a separate method in my case converter class called 'remove_underscores.' When I pass it a string with an underscore in it, I expect to receive that same string back but with the underscore removed. Now my regex for this part is actually pretty easy: I just need to match a literal underscore.
00:34:24.640 So, when I plug this into my code, create that method, and clear my regex, then I pass both the regex and an empty string to the gsub method. This effectively tells gsub to find any underscore in the string and replace it with that empty string, effectively removing it from the overall string. And my spec passes.
00:34:57.440 I now have two separate methods for my solution, but I need one method that I can call to combine the results of both of them. I create another spec, this one for a method called 'snake_to_camel' in my case converter class. I expect when I provide a string with all lowercase letters and an underscore, I expect to receive that same string with the capital letter 'S' and the capital letter 'M' and the underscore removed.
00:35:30.240 My 'snake_to_camel' method will first call the 'upcase_cars' method on the passed-in string, then call 'remove_underscores' on the result of the 'upcase_cars' method. I run my spec, and my spec passes. The code I created is available here on GitHub, and I'll tweet out a link to it after this presentation. I should say there are many different ways of finding a solution to this problem, and I'd love to see more solutions from people in the audience.
00:36:18.080 Life with regex is a journey—a journey that is at first fraught with peril but becomes much easier as you learn and understand what happens beneath the surface. Here are a few tips to help you along your way: Powerful elegant regular expressions are not developed all at once. You need to develop your regex in small pieces and make sure those individual pieces work, then combine the pieces into a larger regular expression.
00:37:08.800 When you write a regex, you're effectively programming in another language—the language of the regex engine. Like any program, regexes need to be developed iteratively; they come in drafts. Whenever I'm crafting a regular expression for use in Ruby, I first develop it using a tool called 'Rubular.' Rubular is a fantastic site that allows you to easily create and test regular expressions against test strings. It's an extremely valuable tool.
00:38:03.760 Now a tip I picked up from Myron Marston on the Rogue's Parlay list is whenever I develop a regular expression in Rubular, I make a permalink of the Rubular page and paste it into a comment in my code. I've started using this in my day-to-day coding, and it makes things so much easier. Whenever I need to come back to a regular expression, it's an extremely valuable tip, and I highly recommend it.
00:38:51.360 Regular expressions are a programming language of their very own. Like any programming language, they can be learned. They're a logical system and process. At their core, they're no different from any program that takes in a certain input and returns a certain result. They're powerful—so powerful that they inspire fear in many of us. But that fear can be overcome through understanding.
00:39:36.480 Here's a call to action: fire up your Ruby environment, experiment with greedy, lazy, and possessive quantifiers, and have fun with them! Explore, move past your fear, and take a dive beneath the surface. I think you'll be amazed at what you find.
00:39:59.680 I'm Nell Shamrell. I work for Blue Box, and we have a table out in the hallway. I'm going to be standing out there for the rest of the day in between sessions. Please feel free to say hi and ask me any questions. Thank you very much! This is a link to all my resources; I'll tweet out that link after the presentation as well. Thank you!