Talks

Regular Expressions: Amazing and Dangerous

Many Ruby programmers use regular expressions frequently. They are an amazingly powerful tool for many different kinds of text processing. However, if not used carefully, they can also be dangerous: They may not exactly match what their writer thinks they match, and they may execute very slowly on certain inputs. This talk will help you understand regular expressions better, so that you can make good use of their amazing power while avoiding their dangerous sides. It will also discuss recent changes to Ruby in the area of regular expressions.

RubyKaigi Takeout 2021: https://rubykaigi.org/2021-takeout/presentations/duerst.html

RubyKaigi Takeout 2021

00:00:03.360 Good afternoon, everybody, or good morning, or good evening, good night, wherever in the world you are.
00:00:13.759 My name is Martin J. Dürst; in Japanese, just call me Martin. I will be talking about regular expressions, which are both amazing and dangerous.
00:00:20.320 Regular expressions can be extremely convenient, but they can also be very dangerous. My talk today will focus more on the dangerous aspects rather than the amazing ones.
00:00:24.720 So, what's the motivation for this talk? The motivation is a feature proposal to add support for regular expression timeouts. The essential question is how we can detect or prevent dangerous regular expressions, and why are they dangerous? Well, they can be very slow on some inputs. It's not that the regular expressions themselves are slow, but some can get extremely slow, particularly on specific inputs.
00:01:00.480 Currently, there are two ideas for addressing this issue. One is implementing a timeout system specifically for regular expressions. The other is introducing a backtrack limit, which means counting how many times a regular expression has to backtrack and then setting a limit on that number. This is what I'm going to talk about during this presentation.
00:01:45.600 The background involves what is known as Regular Expression Denial of Service, which is an active research topic. It was particularly highlighted in James Davis's Ph.D. thesis, which consists of 456 pages; I've read all of them so you don’t have to! I hope to share some of the insights and findings described within it, along with some work I have done myself based on this thesis.
00:02:51.680 Now, speaking of conventions, if you're a Ruby programmer, you have likely used regular expressions. But did you encounter any problems? What kind of problems? Who here has experienced or at least knows that regular expressions can be very slow? For instance, in an introductory book about regular expressions, I found no mention of the fact that a mistake in a regular expression, or even a lack of careful thought, can lead to very slow performance.
00:04:44.640 Of course, Jeffrey Friedl's famous book about regular expressions does discuss speed as well, but it's a common oversight. A bit about myself: I'm from Switzerland, but sadly, I'm not in sunny Switzerland right now; perhaps the background might suggest that. In actuality, I am in rainy Tokyo.
00:05:19.320 My main contributions to Ruby come from various areas, with one of my upcoming tasks being to update to Unicode 14, expected in about a month or so. Now, why are regular expressions amazing? Regular expressions are a domain-specific language (DSL) well-suited for stream processing. While we often talk about domain-specific languages in Ruby itself — with classes, methods, and so on — regular expressions are very specialized languages with distinct syntax.
00:06:22.400 Ruby is one of the languages where regular expressions are frequently used. They are powerful due to their compact and intuitive syntax. If you write a regex matching 'perl' or 'python' or 'pascal', it will literally match those terms. This syntax is concise and intuitive, though it can become complex for more intricate patterns.
00:07:02.080 Next, let’s look at some examples of regular expressions in action. We have the usual tasks: simple string matching, extracting pieces from longer strings, and replacing parts of strings with different segments. Regular expressions excel at splitting strings into various components, which is one of the fundamental applications of regex in Ruby.
00:07:45.760 The next example involves Unicode normalization in Ruby, which uses regular expressions efficiently. This example may seem unusual, but in practice, it works quite well. Many years ago, I provided an implementation of the Unicode bi-directional algorithm in Perl, but you could achieve the same in Ruby with a few lines of code.
00:10:50.720 To showcase a surreal example, consider finding the smallest prime factor of a number greater than one. The operation starts with a string of the number's length and involves matching and extracting the prime factor via patterns established by the regex.
00:12:05.680 Let's look at the practicalities of regex. The way regex is processed can lead to slow performance on larger inputs, especially with a complex structure featuring nested quantifiers. This kind of inefficiency is particularly dangerous, illustrating the potential severity of regex misuse. There's a famous quote: 'Some people, when confronted with a problem, think, I know! I'll use regular expressions!' therefore they end up with two problems instead of one. I hope my talk will help you avoid that outcome.
00:18:39.120 In summary, regular expressions are powerful but can be perilous without adequate precautions. I recommend being mindful of the contexts in which you utilize regex, ensuring its structure is clear via the X option, and utilizing tests for comprehensive coverage. Avoid using regex as input from end users to prevent exposure to Denial of Service attacks. Finally, I would like to open the floor for any questions regarding today’s presentation. Thank you for listening!