s/regex/DSLs/: What Regex Teaches Us About DSL Design

by Betsy Haibel

The video titled "s/regex/DSLs/: What Regex Teaches Us About DSL Design" presented by Betsy Haibel at RubyConf 2015 delves into the design principles behind regular expressions (regex) and how they can inform the development of Domain-Specific Languages (DSLs). The talk begins with a primer on regex, explaining its structure and functionality, and highlights the balance between usability and aesthetics in DSLs, particularly in the Ruby programming environment.

Key Points Discussed:
- Introduction to Regex: Haibel introduces regex as an indispensable yet often complicated tool in programming, explaining its basic components like wildcards and capture groups.
- Historical Context: The presentation outlines the evolution of regex from a mathematical concept to its implementation in programming languages, emphasizing its utility despite its complexity.
- What are DSLs? Haibel defines DSLs as programming languages tailored for specific problem domains and contrasts them with general-purpose languages.
- Tight Domain Integration: A core principle emphasized is that effective DSLs should deeply integrate with the problem domain they are designed to address. This aspect is illustrated through a micro-DSL example focused on querying Twitter.
- Composability: Another significant principle involves the easy combination of language components. Haibel discusses three common DSL styles in Ruby: class macro DSLs, method chaining, and block structures, each contributing to composability in different ways.
- Challenges of DSL Design: Haibel contrasts good DSL design with poor implementations, underscoring the importance of defined boundaries in domain integration to avoid inappropriate use of structure and functionality.
- Best Practices: The talk concludes with recommendations for creating effective DSLs, such as maintaining focused domain definitions and providing extension APIs to enhance usability without complicating the grammar of the language.

Main Takeaways: The essential lesson from the discussion is that while regex may appear confusing and ugly, its effectiveness and long-standing use in programming offer valuable insights for designing DSLs. By prioritizing functionality over superficial elegance, developers can craft useful and intuitive DSLs that align closely with their specific problem domains.

00:00:14.530 My name is Betsy Haibel, and this afternoon we’re going to be speaking about regexes and specifically their DSL design and what we can learn from it when we’re designing our own DSLs. To keep everyone on the same page, we’ll start with a quick introduction to regular expressions for anyone who’s not familiar with them, or for anyone in the audience who could use a refresher since they haven’t worked with them in a while.

00:00:22.099 Here’s the simplest regex I can think of: it searches a given text for the letters 'd', 'o', 'g' in that order and with no characters between them. So, it’ll match any of these strings here. And here’s a less trivial example: in this one, we use the period as a wildcard to match any character. Since this wildcard matches any character, the regular expression 'b.g' can match the strings 'dog', 'b g', or 'bag', among a lot of other things.

00:00:50.780 There are a lot of other wildcards that can match more specific things as well, such as word characters, whitespace, and even a thing called a word boundary, which is the first or last character of any given word. Both characters and wildcards can be grouped, and if the default groupings aren’t powerful enough, you can specify the number of characters to be matched with other wildcards like '*' and '+'. The specifics matter less right now than the mere fact that there are a lot of things you can do.

00:01:16.880 Getting a little more complex, you can use capture groups to single out specific subsets of your match for special treatment and use backreferences to refer to previously captured groups. For example, 'Peter Piper picked a peck of pickled peppers' can be represented within a single regular expression. So we’ve got all of these building blocks, and individually they’re pretty simple. But I’m not going to pretend that all regexes are simple; this for example is an email validation regex that someone, somewhere, for some reason, recommended that other programmers use in production. These simple elements that make up regexes can be combined in ghastly, hieroglyphic-esque ways, and often are.

00:02:20.290 At this point, you may be wondering whether it is possible to learn about designing DSLs, or indeed about designing anything, from something that produces screenfuls of mess and that can’t even fully parse an email address in the process. Because, of course, that email validation regex I just showed you did not actually work. The answer is that regexes are old; like C or shell scripts, regexes are gawky and horrible, and everyone has used them for decades anyway. They are too bloody useful to erase or give up, no matter how much we try to replace them with tools that are not only aesthetically prettier. Anything that bloody useful has to teach us design lessons, whether its surface seems polished or not. The biggest goal in software design, over and above how elegant things are, is getting the damn thing to work. And regexes, bless them, do that, if nothing else.

00:03:11.470 So, how old are regexes anyway? Their first definition is a mathematical concept that dates back to 1958. They were an outgrowth of set theory used for describing the grammar of regular languages. A decade later, they were implemented as a simple, independent programming language. Note that this first implementation treated them as a programming language in their own right. A few years after that, they began to see wider use when they were embedded into a concrete tool: the UNIX 'grep'. They then became embedded in more and more powerful tools such as 'sed' and 'awk' and were embedded into the programming language Perl in 1987 as a first-class language concept.

00:04:15.100 In other words, regular expressions got a lot more powerful and useful—and therefore a lot more used—when they became a domain-specific language for string processing embedded within a more general-purpose language. In the 28 years since Perl came on the scene, regex implementations have been baked into countless other programming languages. We’re at the point where they’re considered a language feature rather than a language in their own right. Most programmers have forgotten that or never knew. When I frame regex historically like that, contrasting their early days as a programming language in their own right with their modern days as an embedded DSL, it naturally raises the question: what are DSLs anyway?

00:05:02.440 Are they appreciably different from programming languages? I don’t necessarily think that the C2 wiki is an authoritative source; it’s just a place where a lot of smart people have had a number of informed opinions over the years. They define DSLs in a consensus that was reached through a sheer spending amount of debate as programming languages designed specifically to express solutions to problems in a specific domain. There are a lot of spirited discussions about the merits of this pattern because two programmers can have three opinions, but it’s universally agreed upon by all of these programmers with all of these opinions that both the potential beauty and the potential horror of DSLs stem from their place as languages in their own right. Because languages are difficult to design.

00:06:01.179 They also talk about whether regexes are actually a DSL. Fascinatingly enough, a lot of people don’t think they’re complex enough to count as a language. To each their own, but I am the kind of person who will die on the hill that CSS and SQL are also programming languages. Regexes have far more complex control structures, even if these control structures are not actually powerful enough to avoid the kinds of email validation regexes we’re talking about or to let you express those ideas in a more concise fashion. But like that cautionary tale aside—which is absolutely what we think of when we think of regex in fear in the wild—most production regexes are a lot closer to those basic examples. And while they may seem dodgy, they’re not necessarily bad. It’s a perfectly valid regular expression and exemplifies one of regexes' genuine intuitive strengths.

00:06:48.960 It’s not just that it’s a far leap to figure out that a regex containing the letter 'd' will match on the letter 'd'. More generally, we can call this feature of regular expressions tight domain integration. Remember, DSLs are programming languages designed specifically to express solutions to problems in a specific domain. When DSLs tie themselves closely to the corpus and structure specific to a domain, they get a leg up in solving domain-specific problems. This is something that goes a bit deeper than the ordinary programming superpower of needing things; you’re not just importing concepts from the problem domain into your DSL.

00:07:29.660 Instead, you’re replacing the logic of Ruby with the logic of that problem domain. Regexes get to cheat a bit when it comes to this type of domain integration because they are text-processing languages, and they’re written using text. Most DSLs we write don’t get that automatic cheat, but we can express this type of domain integration with a little more work to figure out. For example, we’re going to build a Twitter language; a pre-language that runs targeted Twitter searches. We’ll start with the simplest query possible, which is searching my Twitter feed for photos with my cat. You’ll see why that’s the simplest thing possible in about ten seconds. Right now, you don’t really need a DSL to express that; a simple hash interface would convey my intent clearly, and implementing that interface would be far more straightforward.

00:09:35.190 But what if you want photos of cats in my general social circle? Suddenly, a more complex query language starts to make sense. These two examples are roughly comparable, but when we start to add more complicated logic around the network diagram of my Twitter friends, our 'simple' hash interface starts to look a lot less simple. This hash below would be difficult for the search function to parse and difficult to actually use. It would also be difficult to document and remember. This is happening because we’re defining our API in Ruby terms rather than in our domain’s terms. It starts to look like a bad DSL, actually, and specifically one without tight domain integration.

00:10:42.810 In the first example, by admitting that we were writing a DSL, we were able to maintain a tight focus on the core domain concepts, which ultimately led to a smoother design. Now, you’ll note one thing that I am NOT saying here: a lot of people talk about syntaxes like this as examples of successful API design because they’re ‘English-y’. What’s actually happening, though, is more complex. The two examples we’re going to be looking at in about five seconds are both RSpec from different years of the framework. They’re both, I suppose, English-y in the loose way that we’re using the term before; that is to say, they both use English words to name things. Their grammar occasionally causes those English words to flow together in a way that apes an English sentence. The top example is definitely the English-y one of the two; it’s pretty much a sentence in its own right. However, it’s been supplanted by the second style as RSpec has evolved.

00:12:13.720 This is against what we’d be thinking if English-y was always the goal of API design; it’s supplanted by that, for a lot of reasons. Among them, a much cleaner implementation and, in fact, it actually isn’t any harder to work with in practice. This goes against the idea that English should be the goal. The mark of a good DSL isn’t how closely it approaches English; it’s whether it enables programmers to write programs. The RSpec DSL neatly encapsulates domain concepts like test cases and assertions, achieving the same tight and necessarily intuitive domain integration that regex achieves by having dogmas drive its usage.

00:13:07.580 In RSpec, tight domain integration comes from choosing good names for things; the vocabulary of the DSL makes sense. But languages are made of grammar as well as vocabulary. This brings us to our next big principle of good DSL design, namely composability. If I want to make a regex that searches for either 'dog' or 'cat', the answer is pretty easy. Regex’s grammar is simple and, for the most part, intuitive. Combining and backreferencing is really as complicated as it ever gets since all it’s doing is providing a facility for simple text matching. Because it’s made out of text, it once again gets to cheat; for the most part, it leans on its own structure to develop its grammar.

00:14:36.579 Since most domains aren’t quite such natural fits for one character after the next, they need to develop more complex composition rules. When we build Ruby DSLs, we are building languages that are implemented in Ruby and that lean on the Ruby parser, and because of that, we’re constrained by Ruby’s grammar in deciding which composition rules to adopt in practice. This leads us toward three basic shapes. The simplest is the class macro DSL, specifically the class and actor with a lot of configuration options. This sort of example is useful as a top-level hook interface between a library and classes that want to use its features. It’s how a lot of the Rails framework, for example, is expressed, as well as a lot of image attachment libraries.

00:15:32.680 It’s not necessarily that expressive because you can only build concepts with it that can be expressed in a configuration hash, but it’s easy to read, easy to implement, and hard to screw up. The next most complex of the DSL styles we’re going to talk about is method chaining. In this style, you use a series of methods that return 'self' to build code sequences that you define when the object means before using that object. This is a very common JavaScript DSL structure, but in the Ruby world, I’ve mostly only seen it in test libraries like Mocha mocks or RSpec matchers.

00:16:12.130 Honestly, I wish it were used much more often since it’s designed around the idea of continuously modifying objects. It’s easy to manipulate and reason about, and it can be bent to match a lot of different coming models. In our example Twitter DSL, our composition rules focus on the shapes of the relationships that people have with each other. In Mocha, they focus on the different properties of mock objects. In each case, the grammar that defines how elements can be composed also echoes the domain structure. In other words, tight domain integration matters at both vocabulary and grammar levels of a domain-specific language.

00:17:08.120 The last common Ruby DSL style is the block structure. In its simplest form, the one-level block DSL is a common choice for tiny configuration DSLs. It provides a really pretty interface with a minimum of implementation costs. You can also build nested block DSLs since this style encourages you toward code that takes on a tree or nested structure. It’s a strong choice when the pattern opposes the landscape of that domain. In the Rails routing DSL, for example, the tree-shaped structure echoes the directory structures that web routes visually imitate.

00:18:06.370 This block structure is common in Ruby DSLs. It defines a grammar that feels removed from the ordinary rhythm of Ruby, and so it feels 'DSL-y' in the same way that arranging things in sentences feels English-y. It’s not that hard to implement, necessarily, from the lines of code perspective. But because it relies on passing blocks of code between different contexts, it can sometimes be difficult to reason about when things go wrong. It can be difficult to intuit the context in which any given line of code is executing.

00:19:01.530 This leads to one of my most common frustration points with other people’s DSLs: namely their inappropriate use of block structures. Demonstrating my point should appear in a few seconds, but in the interest of time, the abstractions they try to implement with these inappropriate block structures don’t neatly fall into nested structures. Consequently, when I write code that tries to fit what I’m trying to express within this nested structure that doesn’t fit very well, I end up needing to pass around proxies a lot or use a bunch of instance variables or both in order to get things done in a DRY way.

00:20:12.640 Worse yet, because I’m passing around all of these blocks that are evaluated in various contexts that I know very little about immediately, I need to read the DSL library’s code and really know a lot about what context these blocks are being evaluated in. I need to care about the internals in a way that I wouldn’t necessarily need to care in a more conventional abstraction. To be frank, this talk was inspired by a DSL that made me go through that process. It was also designed in a way that wasn’t easy to extend or modify, and so I wound up needing to monkey-patch it a lot. It was a perfect storm of frustration.

00:21:01.910 I was trying to write a talk to figure out why I hated that entire process and the project I was working with on that DSL. I wanted big things from it; I wanted it to be easily accessible with ordinary object-oriented techniques so that I didn’t need to monkey-patch all the time. I wanted it to allow me to merge blocks of code written in that DSL. In other words, I wanted its grammar to allow for better composability. When I started working on this talk, I figured that those two were the same thing. I really did think that I was going to wind up proving that DSLs were irrelevant, and I was wrong.

00:22:01.840 Here’s why: regexes are made of strings. You can build a regex with Ruby using a perfectly ordinary string manipulation. You don’t need to use classes and feel dirty about it the way I did in the regex examples I was showing before. I figured that as long as I was going to say that you could do stuff like this with your DSL, it was going to be perfectly fine—it was pretty great! And this talk was going to just be about how to make it possible to do that stuff. But if we accept that domain-specific languages are just languages, then what actually is the difference between combining regex fragments with Ruby and intermixing Ruby with other languages?

00:23:55.410 What’s the difference between the regex with embedded Ruby up top and the JavaScript with embedded Ruby below? There isn’t all that much of one, and if we poke at our instinctive reaction to that JavaScript with embedded Ruby, we can figure out why. In this example, we’re initializing a JavaScript array and then using embedded Ruby to manually build up a set of literal push calls that reassembles a Ruby array in the JavaScript world. When I’ve seen this first example in the wild, and yes, I have seen it in three different production codebases (God help me), it’s generally done in the context of web application views. In other words, the developer was writing that code to transfer an in-memory Ruby array on the server to an array on the client.

00:24:54.920 But of course, there’s another more widely accepted way to do that. It’s the example below: you just write an API endpoint on the server that returns the array and then the client-side JavaScript accesses that using an ordinary Ajax call. In writing the embedded Ruby, we’re ignoring an existing, well-defined interface for transferring information between the client side and the server side. In ignoring that interface, we can figure out what’s going wrong. It’s not just that we’re ignoring the interface, by the way. When I first had this eerie reaction to the array push, I didn’t actually know enough JavaScript to understand there was an accepted way to do that.

00:26:59.640 But if there’s a fine interface for us to ignore, then that means we must have two objects that the interface is between—in this case, the objects are the Ruby server and the JavaScript client. But we can as easily think about that as the Ruby and the JavaScript. We can think of the languages as kind of objects in the CS meta sense. This is a little easier to understand when we look at the regex example. It’s very clear that the two different objects are the languages themselves. And if a chunk of any given language is an object in its own right, we’re doing something very interesting when we use Ruby to compose a regex or assemble a JavaScript array: we are crossing those object boundaries.

00:28:03.020 Those interpolated Ruby strings are not actually spiritually different from an instance variable calling a private method. They are reaching into the JavaScript’s business and messing around with it, which is part of why code generated using this method is so very hard to understand and debug. Suppose we’ve got that mental framework in place. What’s the difference between interpolating Ruby into JavaScript like the example above and interpolating Ruby into RSpec? I know I just said a really weird thing. RSpec is written using Ruby so it sounds funny to talk about interpolating Ruby into RSpec.

00:29:12.440 But again, in order for a DSL to be useful, it needs to be a language in its own right. We need to give it that respect. And we need to accord RSpec that respect; RSpec is kind of weird in this way because it expects you to embed Ruby into it, but expects you to embed this Ruby in specific, coordinated, well-defined places. When you embed Ruby in a place that isn’t one of those—like using an 'each' loop to find a group of similar examples—then you’re crossing language boundaries.

00:30:00.000 It feels icky in the way that it always does and should. If I were to try to use ordinary object-oriented techniques to try to extend RSpec—like I wanted to be able to do with that bad DSL I mentioned earlier—that would also be crossing those boundaries. When was the last time you tried to extend the class that all describe blocks build instances of? For that matter, when was the last time, outside of Sam’s talk earlier, that you thought about the fact that describe blocks instantiate an object?

00:30:28.220 RSpec’s language design successfully hides these implementation details from you, just like a good library and a good language should. You don’t think about C when you’re writing Ruby unless you’re doing good optimization. More than that, it obscures its own Ruby-ness. We nearly forget using it that it was written in Ruby and therefore must be made up of the objects and classes that make up all Ruby implementations. We get to do that because RSpec has removed the need to think about it. Instead of asking users to use ordinary object techniques to extend RSpec, its maintainers provide a defined set of extension APIs, such as a shared example API and the matcher API.

00:31:23.590 For matters connected to the actual purpose of RSpec—namely the structure of example groups, examples, and expectations—you’re expected to still not interfere. In other words, any language’s rules of composition stay within the language. Composability is not about how easy it is to cross language boundaries to do whatever you want; it’s about how easy it is to do what you want in a sensible way while staying within the balance of the language. That’s great and all, but it doesn’t solve one of the problems I had with that other DSL, a terrible one that I’m deliberately omitting by name.

00:32:16.160 I couldn’t do all the things I wanted to do with it, period. Never mind sensibly while staying within the bounds—that’s why I needed to monkey-patch its internals. How do we avoid that problem in our DSL designs? Well, we can provide a small, defined extension API like RSpec does. That lets us define new words in the language without bending its grammar out of shape. But there’s another way, and I like this one better, and it’s very simple.

00:32:56.960 One of the beautiful things about regular expressions is that they search within text, and they can occasionally replace text. They do not try to do anything more. They do not claim to do anything more. They have chosen one specific problem space and don’t try to solve any other problems. As Stack Overflow's funniest answer is quick to remind us, regular expressions can only parse regular languages, and those are a very small subset of all the languages in the world. They have their limits; they are not a complete parsing engine for anything, especially not HTML. And also, again, not to beat a dead horse: email validations.

00:33:42.230 And that is totally okay because they do not need to do anything but search text. I’m going to call this closed domain integration. It’s not enough to integrate deeply with a domain; you just need to go to the limits of that domain and no further. In order to get there, you need the flip side of this coin: namely constraining the domain definition so that you know where those limits are.

00:34:30.490 It’s okay to define these limits with big red placeholder boxes like RSpec does and say ‘user code goes here,’ but you need to have that really specific definition. You need to know where those boxes lie; if you do that, it makes the problem of covering the domain completely one that is even solvable in the first place. So, I’ll start wrapping up now. As Rubyists, we are not going to stop operating DSLs anytime soon. It’s one of the things everyone jokes about us, but actually, it’s our strength because DSLs are very powerful and kind of cool when they’re done right.

00:35:07.200 So, the question then becomes: how do we write the good ones rather than the ones that Aaron is having feelings about? You can treat your DSL like you would any other API. You can expose what people need. You can close off the other stuff. You can stay close to the domain you’re describing and have sensible composition rules. You can keep everything small enough to complete. Getting there, though, is again a very hard problem. While a good DSL is often more usable than a good vanilla library API, a bad DSL is much less usable, as we’ve all experienced, than a bad vanilla API.

00:36:06.780 I’m not saying right now that you’re doomed to screw up; obviously, you’ve seen this talk, and every DSL you design from now on is going to be perfect. But a good DSL is a lot more work than a decent vanilla API, and that’s something you need to respect. You’re going to need to write that decent vanilla API anyway in order to implement the DSL. I suggest that you do that first and figure out if you need more, and let things lie.

00:36:42.160 That’s everything I need to say right now. I’m Betsy Haibel. Again, I’m @soup_muffin on Twitter, which is going to pop up on the screen shortly. I’m very sorry about the AV issues; I’m not entirely sure what’s going on with Google Docs. This talk is going to be up on my website at the URL on the screen shortly after this talk, probably sometime during the lightning talks or dinner, whenever I can get a decent lock on GitHub. With the conference internet, I tweet about books, code, my cat, and feminism. I co-organize a meet-up back home called 'Learn Ruby in DC.' This is an informal space for newbies to ask questions and find mentorship.

00:37:47.920 If you are interested in making a meet-up like that in your own hometown, or if you also run a meet-up like that and want to talk shop, then please talk to me. I think it’s a really good model for building the community, and I would love to share nice stories and also pitfalls so that you can avoid them. I work for a great little organization called ActBlue that builds fundraising tech for Democratic candidates and causes, where we focus on small rural donations, which is a surprisingly powerful thing.

00:39:05.260 Our average donation size is around thirty dollars, and we’ve raised nearly 815 million over the approximately a decade we’ve been in business. This really helps those donors’ voices be heard in a way that keeps the party accountable to people who only have thirty dollars to spare at a time, which is something that means a lot to me. We’re also committed to building sustainably at the kind of scale that can bring in that much money over time. We have a modern tested stack, and we have a focus on maintaining a culture that—well, my third day was one of our biggest days of all time.

00:39:51.560 Pretty much everyone in my team hip chatted me over the course of that day saying, ‘By the way, it’s the end of the quarter: you’re going to close your laptop at 5:30, you’re going to have dinner, and you’re going to do everything but be on call.’ We’re also hiring Rails, UX, and DevOps people right now, so if the values I just outlined sound good to you, if they resonate, then please talk to me. I’d love to work with you.

00:40:16.220 Many thanks to Nil, Rappin Kenzie, Connor, Chris Hoffman, Tina Waste, and the entire membership of the Arlington Ruby Users Group for invaluable feedback while I was developing this talk.

00:40:23.330 But I personally have not built enough things that actually required a DSL. I really do take the responsibility to go up to those bounds and no further, quite seriously. I’ve built some tempting stuff that I’m pretty proud of, but other than that, I haven’t worked in any problem spaces that I feel require that level of power. Unfortunately, that’s all been proprietary stuff, so I can’t point you to a GitHub repo.

00:41:11.440 The question from Walter is whether I have any mental litmus test for when something does warrant a DSL. So for that, let’s go back a few slides. If you can see in that second example at the bottom, we’re getting an increasingly complex hash interface.

00:41:30.550 One of the things about that is, as you apply more and more options for what any given library does, that access point—let’s call it that even though it sounds fancy and it’s not a fancy concept—when any given method call is at the front edge of your API, it starts taking a lot of parameters. You should start thinking about ways to encapsulate all of those parameters within an object. A nice simple method-chaining DSL is a great way to actually build that parameter object in a way that’s clean and readable.

00:42:46.250 It’s one of the questions I kind of anticipated getting: was someone calling me on the differences between RSpec and MiniTest because they’re very different stylistically in terms of implementation? But in terms of the way the MiniTest API has evolved over the years and the RSpec DSL has evolved over the years, one of the interesting things is that they’ve actually evolved towards each other. I think that it’s valid to want something like the full-on test unit prefix everything with ‘test’ style—that drives me bonkers. And through the years we’ve seen a lot of things like RSpec and MiniTest, RSpec syntax and Shoulda, that attempt to impose more structure than the test case magic API gives you.

00:43:47.120 You know, there are no hard and fast rules in programming, so these are largely matters of taste. But the outer edges of the RSpec spec API, with ‘describe’ blocks and ‘it’ blocks, seem to be something that a lot of different things ultimately decide works for test cases, even if that’s not where they start out.

00:43:56.020 Cool, wonderful! Well, I will let you all get to the lightning talks. Thank you so much!