Regex

Summarized using AI

Just-in-Time Compiling Ruby Regexps on TruffleRuby

Benoit Daloze and Josef Haider • November 08, 2022 • Denver, CO

The video titled Just-in-Time Compiling Ruby Regexps on TruffleRuby, presented by Benoit Daloze and Josef Haider at RubyConf 2021, explores the advancements of TruffleRuby, focusing on its innovative approach to executing Ruby regular expressions (regex) at significantly enhanced speeds. TruffleRuby, which is a high-performance implementation of Ruby using the GraalVM Just-In-Time (JIT) compiler, aims to deliver fast performance while ensuring compatibility with Ruby 2.7 and supporting C extensions.

Key points discussed in the video include:

  • Regex Engine Background: Initially, Ruby used the Onigmo regex engine, based on backtracking, which supports a wide variety of encodings. TruffleRuby started with Joni but faced limitations in performance due to the backtracking nature.
  • Introduction of T-ReX: To address these performance challenges, TruffleRuby adopted T-ReX, a new regex engine utilizing deterministic finite automata (DFA), thus allowing efficient just-in-time compilation of regex patterns into machine code.
  • Mechanism of Finite State Machines: The speakers explain the workings of finite state machines and how they can be effectively mapped to the core features of regular expressions like concatenation, disjunction, infinite quantifiers, and capture groups.
  • Performance Improvements: Benchmark comparisons show that T-ReX outperforms Joni and CRuby, with speed increases ranging from 25 to 40 times faster in micro-benchmarks. The enhanced mechanisms include optimized inlining of Ruby methods with regex logic specific to the pattern used, facilitating considerable performance boosts.
  • Safety and Security: An important advantage of T-ReX is its resilience against regular expression denial of service (ReDoS) attacks, as it matches patterns in linear time, eliminating the risk associated with backtracking engines.
  • Future Enhancements: The discussion also touches on future plans to enhance T-ReX, including support for previously unsupported features such as arbitrary counted quantifiers, but recognizes the complexities involved in certain features like back references and atomic groups.

In conclusion, the talk emphasizes the effectiveness of finite state machines for regex matching over traditional backtracking methods, highlighting the seamless integration allowed by Truffle Ruby and T-ReX. This advancement allows developers to enjoy not only significant performance gains—from microbenchmarks showing 24 to 41 times faster processing but also a robust solution to common regex-related security vulnerabilities. The presenters encourage questions and further discussions on improvements and experiences with regex in Ruby development.

Just-in-Time Compiling Ruby Regexps on TruffleRuby
Benoit Daloze and Josef Haider • November 08, 2022 • Denver, CO

TruffleRuby together with Truffle Regex can now execute Ruby Regexps up to 40 times faster than CRuby! This is possible by just-in-time compiling Ruby Regexps to machine code by using Truffle Regex, a Truffle language for regular expressions. Truffle Regex uses finite-state machines, a much faster alternative to backtracking regexp engines. Because of the unique capability of GraalVM to inline across languages, the Ruby code and the Regexp are optimized and compiled together for ultimate performance.

RubyConf 2021

00:00:11.040 Hello and welcome to our talk on Just-in-Time Compiling Ruby Regexps on TruffleRuby. I'm Benoit Daloze, and I'm leading the Truffle project. Here is Josef Haider, who is the T-Rex creator and maintainer. We are both part of the GraalVM team at Oracle Labs.
00:00:22.880 If you do not know TruffleRuby yet, it's a high-performance Ruby implementation that utilizes the GraalVM JIT compiler, which gives it great performance. It targets full compatibility with Ruby 2.7 for the current release and Ruby 3.0 for the next one. It also includes support for C extensions, so the idea is that it would be a drop-in replacement for CRuby but faster. You can find more information on GitHub, Twitter, and our website.
00:00:54.480 Let's start with some background on regex engines. Initially, Ruby used an array extension called Onigmo, which is a fork of Oniguruma. These two regex engines are backtracking regex engines, and what is special about them is they support many encodings—about 30—while most regex engines only support one or two encodings. This means that almost all of the Ruby encodings are directly supported, so there is no need to transcode the regular expression or the string when matching.
00:01:38.560 TruffleRuby initially used Joni, which is a port of Oniguruma to Java by JRuby developers. It essentially has the same performance, meaning it uses basically the same code, just translated from C to Java. However, TruffleRuby wants to optimize all aspects of Ruby, including small languages such as array pack and string formatting, which TruffleRuby already compiles and handles very well.
00:01:46.479 However, for regular expressions, we have not yet achieved that optimization because we were still using Joni. You cannot just-in-time compile Joni in a meaningful way. It operates with very generic logic, utilizing backtracking—a solution that does not naturally align well with JIT compilation.
00:02:10.880 Thus, what we needed was a new regular expression engine for TruffleRuby that can efficiently compile regex patterns and optimize them effectively. This need led us to adopt T-Rex, an engine designed specifically for this task. With that, I will hand it over to Josef to present T-Rex.
00:02:26.239 Thank you, Benoit. T-Rex is a regular expression engine implemented as a Truffle language and is based on deterministic finite automata, also known as finite state machines. For anyone who is not familiar with state machines, they are essentially a mathematical construct designed to match a set of strings.
00:02:46.720 Such a finite state machine consists of a set of states connected by transitions. Each transition has a set of accepted symbols, and to match a string, one must traverse the finite state machine from the initial state to a final state. In the exemplary state machine on the slide, we have two states: State One, which is the initial state indicated by the start arrow, and State Two, indicated by the double circle as the final state. They are connected by a single transition accepting a single symbol.
00:03:24.000 Regular expressions have evolved significantly and now include features that were previously not present in finite state machines. Therefore, modern regex dialects cannot easily map the complete feature set to finite state machines anymore. Fortunately, the core features still exist and can be efficiently mapped.
00:03:57.760 I will quickly go through the essential features of regular expressions and how they can be mapped to finite state machines. The simplest and most basic are concatenations of successive terms, which translate to chaining states corresponding to separate terms together using transitions. Disjunctions of multiple terms can be mapped to a finite state machine by introducing additional transitions from the preceding state to each beginning state of every branch of the disjunction.
00:04:46.240 In the example on this slide, you can see a direct mapping of the regular expression 'ab' or 'ac' to a state machine. This state machine is non-deterministic because the first state has two transitions accepting the same symbol 'a'. This can't be reasonably executed by the regex matcher, so we must transform the non-deterministic state machine into a deterministic one. T-Rex performs this transformation through a power set construction applicable to any non-deterministic finite automaton, but it can sometimes introduce an exponential number of additional states.
00:05:57.760 Thus, if the number of states exceeds the compiler’s capability to manage, T-Rex will bail out from this power set construction and revert to an indirect matching mode, which creates deterministic states on the fly. While this slower method still maintains linear runtime complexity, it's advisable for performance reasons to leverage the direct method whenever possible. Infinite quantifiers can create infinite loops in regular expressions, and these can simply be represented by adding transitions that form loops in the state machine.
00:07:01.600 Interestingly, for all finite state machine-based regex engines, it is often better to use infinite loops than bounded loops because they can be mapped to a state machine much more easily. The first feature that requires extending the finite state machine model is capture group tracking. T-Rex tracks capture groups through an annotation process that marks all transitions with the capture group boundaries traversed during the matching process.
00:08:03.120 This method incurs overhead proportional to the number of states in the state machine, but it is still less than the exponential overhead found in backtracking engines. Since we cannot delve into all the implementation details of T-Rex, let's quickly discuss a list of supported and unsupported features to provide an idea of T-Rex's capabilities. We have already covered concatenation, disjunction, infinite quantifiers, and capture groups.
00:09:35.520 Character classes are directly supported since they simply map to the accepted symbol sets of individual transitions. Counted quantifiers are partially supported—these represent a counted loop, and T-Rex handles them by unrolling the loop, which results in a number of states proportional to the maximum count specified in the quantifier.
00:09:42.640 We plan to support arbitrary counted quantifiers with actual counting variables in the background in the future, though this feature is currently not implemented. Anchors, including backslash 'a' and 'c', are supported without overhead as they simply add initial and final states. Word boundaries are replaced with equivalent expressions utilizing look-around assertions.
00:09:50.560 Similarly, caret and dollar look-around assertions are fully supported—as are look-ahead assertions, which get merged into the parent expression's automata. Look-behind assertions are partially supported too; T-Rex currently supports look-behind expressions that consist of a single sequence of either literal characters or character classes.
00:11:02.720 Currently, back-references remain unsupported due to their complexity in matching with an automaton. Though we may develop a solution for this in the future, we are not overly optimistic. However, it’s important to note that this does not hinder TruffleRuby's implementation. For example, negative look-arounds are currently also unsupported, and we may add support for that in the future as the implementation is more complex than positive look-arounds.
00:11:13.040 Recursive sub-expression calls will probably never be supported in T-Rex, simply because it is nearly impossible to perfectly mimic backtracking behavior. Other unsupported features include possessive quantifiers, and although we may be able to replicate their semantics in the future, it is likely not worthwhile since possessive quantifiers are usually only used to address catastrophic backtracking behavior in backtracking engines.
00:11:32.720 Conditionals and absent expressions have also not yet been addressed. Now, moving on to the crucial part—how do we just-in-time compile actual regular expressions represented as deterministic finite automata? Typically, when you want to match a regular expression using a deterministic automaton, you would have a loop that always loads the current state and the current character.
00:12:27.840 Next, you would go through the current state’s transitions and check if any transition matches the current character. Upon finding a matching transition, you would then load that transition's target state and continue the loop. To avoid this indirection, we instruct the Graal compiler to compile a separate copy of the loop body for each state in the state machine.
00:12:46.560 On the next slide, you can see an example where we compile a state machine for the regular expression 'a+' followed by either 'b' or 'c'. This state machine is quite simple, consisting only of three states. Following each state label, you can observe the copied loop body that will be included in the compiled code for this particular regular expression.
00:13:29.680 For State 0, the compiler will emit a boundary check for the input string, then a character load or byte load depending on the encoding used. Subsequently, it emits an if-else cascade for every transition, meaning that if the first condition is met, we simply jump to the target state. This process repeats for State 1.
00:13:47.840 Since State 2 has no outgoing transitions, the if-else cascade is removed, leading us only to the return statement. This pseudocode can effectively be compiled by Graal to nearly one-to-one real assembly code, allowing it to be inlined wherever this regular expression matcher is called.
00:14:01.200 Now let's take a look at performance. We utilize the benchmark IPS gem to measure peak performance after warm-up, meaning we measure performance once the relevant methods have been JIT compiled. We compare three implementations: TruffleRuby with T-Rex, TruffleRuby with Joni, and CRuby 2.7 as a baseline.
00:14:48.320 We begin with some microbenchmarks, which are very simple. Here we match the string 'abc' against several regular expressions, which you can see below the graph. We have five regex patterns, two of which do not match the string, while the three in the middle do. Notably, CRuby 2.7 is the baseline, with Joni showing similar performance.
00:15:15.760 However, T-Rex achieves a performance enhancement of 25 times to 40 times faster, representing a significant speedup due to three main reasons: firstly, T-Rex uses a state machine regex engine rather than backtracking; secondly, T-Rex just-in-time compiles the regex; and thirdly, it also compiles the Ruby code, optimizing them together.
00:16:03.680 This compilation process starts with the matching goal, then integrates with the Ruby methods involved—all of which are inlined and optimized by the Graal compiler into machine code. In the case of a backtracking engine, the logic is more generic, but here we are comparing it specifically to the given regex, yielding faster results.
00:17:00.800 In our larger regex benchmarks, we begin with the Liquid parser, a component of the Liquid template language that heavily utilizes regex. Performance is critical when parsing Liquid templates due to the volume of templates available, as caching isn't always feasible. Notably, the Liquid template is parsed to an AST instead of directly transforming into Ruby code for safety.
00:18:49.760 The second benchmark tests a Browser Sniffer gem used to detect browser types, operating systems, and versions based on user agent strings. For the third benchmark, we use Regex Redux, a classic benchmark analyzing 50 megabytes of DNA and RNA sequences through regular expressions without IO operations.
00:19:01.120 Finally, we have the Syslog benchmark, which tests a log line in accordance with the BSD Syslog protocol and RFC standards. The results indicate great success: we see that T-Rex is 2.3 times faster on the Browser Sniffer, 9 times faster on Syslog, and even better performance on Liquid parsing when integrating both Ruby logic and regex.
00:20:18.480 Transitioning to a different subject, let’s discuss Redos—regular expression denial of service attacks—which have surfaced in Rails security reports this year. It's not an uncommon issue, with half of the recent issues found being Redos related, illustrating its frequent occurrence.
00:20:36.600 The advantage of T-Rex is that it is not susceptible to Redos. The reason is that T-Rex always matches in linear time, eliminating the potential for catastrophic backtracking that leads to severe slowdowns.
00:20:59.040 The problem arises when there’s a malicious input string and the regex pattern is poorly designed, leading to excessive backtracking. In backtracking engines, this can deteriorate to an exponential run time, but that concern does not exist in T-Rex.
00:21:53.200 On the flip side, not all features of Ruby regex can be supported by T-Rex. For regex patterns incompatible with it, TruffleRuby falls back to using Joni or traditional backtracking. This fallback mechanism also issues warnings to developers so that they are aware of potential performance issues.
00:22:52.160 Additionally, I want to discuss atomic groups, commonly added to address Redos. Their primary use is to reduce excessive backtracking, but they can present challenges. Expressing these in finite state machines is not straightforward.
00:23:10.640 Atomic groups do not cause excessive backtracking in T-Rex, which handles matching linearly; therefore, we can generally ignore them. However, if they are used for semantics, this might be more problematic.
00:23:35.680 In conclusion, we have demonstrated that using finite state machines for regex matching is substantially faster and safer than backtracking. TruffleRuby and T-Rex can effectively just-in-time compile Ruby regex patterns to machine code, optimizing both Ruby logic and the specific state machine for those regex patterns.
00:23:59.680 We have witnessed performance boosts ranging from 24 to 41 times on microbenchmarks and 2.3 to 9 times on larger regex benchmarks. Finally, we confirmed that TruffleRuby mitigates the risk of catastrophic backtracking, enhancing developer safety.
00:24:04.400 I would like to thank Jerko, who supported the integration of many Ruby regex features into T-Rex; Duncan, for his optimizations related to regex matching; and Kevin Maynard, for enhancing the regex logic in TruffleRuby.
00:24:20.560 Thank you for listening to our talk. If you have any questions, feel free to ask us on GitHub discussions, Slack, Twitter, or wherever you prefer. We are happy to answer them.
Explore all talks recorded at RubyConf 2021
+91