Regular Expressions: Amazing and Dangerous

by Martin J. Dürst

Regular Expressions: Amazing and Dangerous

The presentation titled "Regular Expressions: Amazing and Dangerous" by Martin J. Dürst focuses on the profound yet potentially hazardous aspects of regular expressions (regex) used predominantly in Ruby programming. The central theme is the duality of regex as a powerful tool for text processing while also being a source of performance issues and vulnerabilities if not employed judiciously.

Key Points Discussed:

Motivation for the Talk: The talk highlights the proposal to add support for regular expression timeouts. Dürst aims to educate on how to detect and prevent the dangers tied to regex, which can lead to severe slowdowns on certain inputs.
Dangerous Aspects: Regular expressions can cause performance lags known as Regular Expression Denial of Service (ReDoS). This is particularly true when certain regex patterns result in excessive backtracking.
Proposed Solutions: Dürst discusses two main approaches to tackle these issues: 1. Implementing a timeout system for regex operations and 2. Introducing a backtrack limit to control how many times a regex can backtrack before failing.
Background Research: The topic is grounded in active research, notably James Davis's Ph.D. thesis, from which Dürst derives insights to guide his presentation.
Common Problems Encountered: Dürst engages the audience by asking them to reflect on their experiences with slow regex operations and emphasizes a common oversight in regex literature where performance pitfalls are often neglected.
Examples of Regex Usage: He describes examples of regex in action, including string matching and extraction, showcasing its capacity to handle tasks like splitting strings and Unicode normalization.
Practical Risks: Through amusing anecdotes, Dürst warns that reliance on regex without clear structure and proper design can result in unexpected performance issues, also noting a quote reflecting this comedic yet serious concern.

Important Conclusions and Takeaways:

Regular expressions, while extremely powerful, can be dangerous if misapplied. Dürst advises:
- Utilize regex carefully, especially in contexts with user input to prevent security vulnerabilities.
- Always examine the structure of regex patterns, using options that can clarify their functions, such as the X option.
- Adopt testing practices to thoroughly validate regex operations and ensure comprehensive coverage.

In conclusion, Martin J. Dürst’s talk provides valuable insights into the mechanics of regular expressions in Ruby, urging programmers to appreciate their strengths while being mindful of their potential risks, thereby fostering both effective and secure use of this powerful tool in programming.

00:00:03.360 Good afternoon, everybody, or good morning, or good evening, good night, wherever in the world you are.

00:00:13.759 My name is Martin J. Dürst; in Japanese, just call me Martin. I will be talking about regular expressions, which are both amazing and dangerous.

00:00:20.320 Regular expressions can be extremely convenient, but they can also be very dangerous. My talk today will focus more on the dangerous aspects rather than the amazing ones.

00:00:24.720 So, what's the motivation for this talk? The motivation is a feature proposal to add support for regular expression timeouts. The essential question is how we can detect or prevent dangerous regular expressions, and why are they dangerous? Well, they can be very slow on some inputs. It's not that the regular expressions themselves are slow, but some can get extremely slow, particularly on specific inputs.

00:01:00.480 Currently, there are two ideas for addressing this issue. One is implementing a timeout system specifically for regular expressions. The other is introducing a backtrack limit, which means counting how many times a regular expression has to backtrack and then setting a limit on that number. This is what I'm going to talk about during this presentation.

00:01:45.600 The background involves what is known as Regular Expression Denial of Service, which is an active research topic. It was particularly highlighted in James Davis's Ph.D. thesis, which consists of 456 pages; I've read all of them so you don’t have to! I hope to share some of the insights and findings described within it, along with some work I have done myself based on this thesis.

00:02:51.680 Now, speaking of conventions, if you're a Ruby programmer, you have likely used regular expressions. But did you encounter any problems? What kind of problems? Who here has experienced or at least knows that regular expressions can be very slow? For instance, in an introductory book about regular expressions, I found no mention of the fact that a mistake in a regular expression, or even a lack of careful thought, can lead to very slow performance.

00:04:44.640 Of course, Jeffrey Friedl's famous book about regular expressions does discuss speed as well, but it's a common oversight. A bit about myself: I'm from Switzerland, but sadly, I'm not in sunny Switzerland right now; perhaps the background might suggest that. In actuality, I am in rainy Tokyo.

00:05:19.320 My main contributions to Ruby come from various areas, with one of my upcoming tasks being to update to Unicode 14, expected in about a month or so. Now, why are regular expressions amazing? Regular expressions are a domain-specific language (DSL) well-suited for stream processing. While we often talk about domain-specific languages in Ruby itself — with classes, methods, and so on — regular expressions are very specialized languages with distinct syntax.

00:06:22.400 Ruby is one of the languages where regular expressions are frequently used. They are powerful due to their compact and intuitive syntax. If you write a regex matching 'perl' or 'python' or 'pascal', it will literally match those terms. This syntax is concise and intuitive, though it can become complex for more intricate patterns.

00:07:02.080 Next, let’s look at some examples of regular expressions in action. We have the usual tasks: simple string matching, extracting pieces from longer strings, and replacing parts of strings with different segments. Regular expressions excel at splitting strings into various components, which is one of the fundamental applications of regex in Ruby.

00:07:45.760 The next example involves Unicode normalization in Ruby, which uses regular expressions efficiently. This example may seem unusual, but in practice, it works quite well. Many years ago, I provided an implementation of the Unicode bi-directional algorithm in Perl, but you could achieve the same in Ruby with a few lines of code.

00:10:50.720 To showcase a surreal example, consider finding the smallest prime factor of a number greater than one. The operation starts with a string of the number's length and involves matching and extracting the prime factor via patterns established by the regex.

00:12:05.680 Let's look at the practicalities of regex. The way regex is processed can lead to slow performance on larger inputs, especially with a complex structure featuring nested quantifiers. This kind of inefficiency is particularly dangerous, illustrating the potential severity of regex misuse. There's a famous quote: 'Some people, when confronted with a problem, think, I know! I'll use regular expressions!' therefore they end up with two problems instead of one. I hope my talk will help you avoid that outcome.

00:18:39.120 In summary, regular expressions are powerful but can be perilous without adequate precautions. I recommend being mindful of the contexts in which you utilize regex, ensuring its structure is clear via the X option, and utilizing tests for comprehensive coverage. Avoid using regex as input from end users to prevent exposure to Denial of Service attacks. Finally, I would like to open the floor for any questions regarding today’s presentation. Thank you for listening!