RubyKaigi Takeout 2021

Do regex dream of Turing Completeness?

We're used to using Regular Expressions every day for pattern matching and text replacement, but... What can Regexes actually do? How far can we push them? Can we implement actual logic with them?

What if I told you... You can actually implement Conway's Game of Life with just a Regex? What if I told you... You can actually implement ANYTHING with just a Regex?

Join me on a wild ride exploring amazing Game of Life patterns, unusual Regex techniques, Turing Completeness, programatically generating complex Regexes with Ruby, and what all this means for our understanding of what a Regex can do.

RubyKaigi 2021: https://rubykaigi.org/2021-takeout/presentations/dmagliola.html

RubyKaigi Takeout 2021

00:00:06.399 This is Conway's Game of Life. Most of you are probably familiar with it. It's an unusual zero-player game where patterns change over time, a bit like life, evolving with cells living and dying, all following very simple rules.
00:00:14.000 Some of you have probably written the code that implements this for one reason or another. Generally, it looks a bit like this: three nested loops going over the board, looking at one cell at a time, counting the neighbors around them, setting the new state, and repeating this process indefinitely.
00:00:23.519 Now, a few months ago, a weird idea popped into my head, and I just couldn't shake it off. Could you implement it like this instead? Can you implement Conway's Game of Life with regex? And if you can, what does that mean for our understanding of what regexes can do?
00:00:39.680 First of all, though, hi! My name is Daniel, and I work at Indeed Flex. We are a flexible staffing company based in London. As you probably noticed, I’m not originally from London; I come from Argentina. So, in case you were wondering, that’s the accent.
00:00:59.280 Now that we got that out of the way, let's go back to this weird idea that I had: can you implement the Game of Life with just regex? Now, at first, this sounds ridiculous, right? Regex is a pattern matching tool. How would you ever be able to drive behavior with them? You know how we’re going to look through all of these cells and count their neighbors. How could we ever do that?
00:01:12.720 Some of you may also be asking, 'Daniel, but why?' If you're asking that, you might not enjoy the rest of this talk; you may be a bit too sane for this. But for the rest of you wondering whether we can actually do this, the answer is yes, kind of. It's complicated, but I think the solution is quite interesting, and the implications of it are quite fascinating, which is why I’d like to share this with you.
00:01:45.200 In order for the approach to make sense, I can’t just jump straight to the answer. I'm going to have to walk you through more or less the same steps that I took, and that includes some dead ends and some intermediate steps. One thing I would recommend is that you try to stay ahead of me and think about how you would solve this at each stage. It’s going to make the talk way more fun!
00:02:05.440 Oh, and all of the code that I’m going to show will fly by a little bit fast, but there will be a repository that I’ll share at the end, and you can look at all of it in detail later if you want. So first of all, let’s look at the Game of Life and what the rules are. We have a two-dimensional board of cells; any cell can be alive or dead. Let’s think of this as ones and zeros. This board evolves over generations.
00:02:22.560 In each generation, each cell will be alive or dead based on how many neighbors it has. If there are too few neighbors, it dies; if there are too many, it also dies. If a cell is dead and has exactly three neighbors, it spawns into new life. And that’s what you see happening up there on that board.
00:02:44.480 Now, the rules reduce to this simple expression: you’re going to have to trust me here. In the next generation, you are alive if and only if you have three neighbors or if you have two neighbors and are already alive. Put another way, these are the transitions: a cell dies of loneliness if it has only one neighbor and needs at least two; a cell stays dead; a cell stays alive if it has two neighbors, which is enough; and a cell spawns into new life because it has three neighbors.
00:03:12.800 Throughout this talk, I’m going to represent cells like this: a '1' means alive, and a '0' means dead. A cell highlighting in orange will be the cell that we’re currently evaluating to see if it should be alive or dead in the next generation. So how do we approach this implementation? Well, we have a board with a bunch of cells in a certain state, and we need to change some of those to make the next generation.
00:03:35.200 Think of this as two separate products. First, we need to find the cells that need to change—that is, cells that are alive that need to die and cells that are dead that need to spawn into new life. Then we need to change them into that new state or, you could say, replace them for a cell in any state. Right? So, we find the cells that need to change and then replace them. Find and replace.
00:03:49.760 Now, this sounds a bit more like regex, doesn’t it? So hopefully, this idea doesn’t seem as ridiculous anymore. I'm going to add a seed here: we need to first represent this data in a way that will actually let us do this with regex. Let’s look at these three steps one by one, and you can follow along with that progress bar at the top to see which of the three steps I'm talking about.
00:04:07.920 Now, first of all, how do we represent this data so that we can operate on it? Normally, you’d represent this as a 2D array of zeros and ones, matching the rows and the columns on the board. When I’m looking at a cell, I can use plus and minus one offsets in both directions to find the eight neighbors. But regexes don’t work on arrays; they work on strings.
00:04:25.600 So, we need to format this data as a linear string. Imagine if you will, a long string of zeros and ones, where the string sort of wraps every N characters. That would give us the same board. To find the neighbors for a cell, we need to look at what's before and after the character that we’re evaluating.
00:04:43.200 Now, keep an eye on the colors. I’ll always show the neighbors before and after in green and blue. That’s how we started today. But how do we find the cells that need to change? Well, in regex, we can use zero-width look-behinds and look-ahead assertions to look before and after the character that interests us. What these do is say, 'Match this thing, but only if before or after the thing, there’s this other stuff.' But the important part is the zero width in the name; that other stuff doesn’t become part of the match—it’s an assertion, but it’s not in the result.
00:05:09.440 So for any given cell, I can see it with look-behind the previous four characters, and I can see with look-ahead the next four characters. And those are my eight neighbors. So, if I want to match these exact cells with these exact sets of neighbors, I can do that, right? I match the cell that I want, this orange one, but only if what’s before and after meets some criteria.
00:05:47.680 If I can do that, then I can replace these orange cells with their new values. I have a Game of Life! Now that’s worth saying again: the reason this works is that these regexes will only match one character—the one I have to change. They match based on its surroundings, but they don’t match the surroundings. So, a regex replace will only replace the one cell that I’m looking at.
00:06:05.680 Because it's a regex applying over a whole string, it applies over the entire string at the same time. So I just need to run this replacement once, and that’ll do a whole generation for the entire board. Now, it should be done, right? I mean, this should do it. But there's one small issue: how do you write a regex that matches a character based on the count of characters around it independently of the order?
00:06:25.120 I couldn’t figure that out. But at this point, I thought I had a fairly reasonable chance of getting this to work, but I had no idea how to do it. So my next thought was, surely somebody has thought of this and done it before, right? Apparently, though, I'm the first person dumb enough to actually try this or at least the only one that lived to tell the tale.
00:06:51.280 So, I can't really google my way out of this. I'm going to have to figure it out myself. Again, the immediate problem I have is how do I find the cells that need to change, and that’s based on the number of neighbors? Now how do I find characters that among the eight characters before and after them have either two or three ones?
00:07:10.160 In regex, we can use curly braces to specify 'I need X many of this.' I can say, for example, 'Find me two or three ones together immediately before this thing.' But what we need is to determine between these four characters before me and these four after me, I need between two and three ones for this character to match, and I don’t care what order they’re in.
00:07:30.560 As far as I can tell, you can’t do this with the regex constructs available. So this was not looking very promising at this point, and I got quite stuck for a while. But then I realized this problem falls into a specific category of problem that’s one of my favorites, and that is problems that are hard to solve. But instead of solving them, I can just cheat.
00:08:05.440 You see, when I’m looking at a cell, I’m only looking at nine binary digits, and nine binary digits are not a lot of combinations. I can just brute force this; I can write out all of the 512 combinations for these and spell out what would happen in each case. This is not exactly pretty, but I said Game of Life with regex, not with elegant regex.
00:08:27.680 Importantly, we only care about cells that change, so I can ignore all of the combinations where the cell stays the same. So, of the 512 possible combinations that we have, it takes us down to only 228 that change. Obviously, writing 200 cases of this by hand is not fun, so instead, I quickly wrote some Ruby code to do it for me.
00:09:00.000 The easiest way to brute force binary numbers in Ruby is to just count, and then you can convert them into base 2. Here, I'm counting how many ones there are in the string to see how many neighbors are alive, and then I decide whether the cell should change or not.
00:09:03.360 Now, this bit looks suspiciously like Game of Life code, but this code is never actually used to run the game. I only run it once to generate the regex, and then the regex runs the game, so that's okay. This code here will then print out the 228 combinations of cell states that represent either a cell state change or a death.
00:09:20.720 So for example, I have this combination which represents a death. I want to find the middle one to turn it into a zero, so I need to search for that pattern. But I need to do the look-ahead and look-behind thing so I only match the one. Now that I have that, all I would need to do is repeat it 228 times.
00:09:44.080 But there’s one more problem: my board doesn’t look like this string because I don’t just have those nine cells that I’m evaluating; I actually have a board that is much bigger than those nine cells.
00:10:04.960 So the nine characters that I want are not all together in the string—they have garbage in the middle. This stuff in red and pink that I don’t care about, so I need to ignore it. Now you'll notice that this board has a width of 10, and I always have seven of these 'cartridge' characters. I can just ignore seven characters.
00:10:23.000 Now that seven is hard-coded; I’m going to have to change it with the board size, but that’s the easy part, and we’ll do that later. Now I can take this last rule, duplicate it about 200 times, concatenate them all with one big 'or,' and replace that seven with the appropriate number for the board size. That should match all of these cells I need to change.
00:10:39.840 If you put that last expression in the generator code that I showed earlier, you end up generating this ridiculous file. Here, you can see all of the combinations of zeros and ones that we’re looking for for our three rows and our three columns. We have this code to read it. We read the lines, we put a pipe in the middle to do the 'or,' and then we replace the board size with the correct number for our board.
00:11:04.960 Boom! Now we have a regex that’ll match every single cell that we need to change. This solves the find part, but the only problem is I don’t know what to replace them with. This regex is going to find all the things that need to change, but that includes both spawns and deaths: zeros and ones. I need to know which one is which.
00:11:26.560 The good thing is I know that the cell needs to change, and there are only two possible values. If it’s currently a one, it needs to change to a zero, and vice versa. But the question is, what do I do here? What do I pass to the regex substitution to flip those cells?
00:11:49.600 Here I need to take an intermediate step for clarity. It turns out that the regex substitution accepts a string to replace with, or you can give it a block. In this case, the regex is doing all the heavy lifting finding all of the cells that I have to change. It's doing the change; it’s doing all the interesting work, and all I’m doing in that Ruby block is flipping the cells.
00:12:08.560 To my absolute surprise, that worked! This pattern you see here is called the glider gun, and it’s one of the things that makes the Game of Life magical. This is what it should look like, and that is my garbage implementation—it actually works! Oh, and that speed you're seeing, by the way, that’s real time; that’s how fast it runs. It's not the fastest thing I've ever seen, but considering the horror of what I'm doing, I am actually pretty impressed with Ruby.
00:12:29.440 And so, this is one way to do this thing. However, there’s one more problem: this is not a regex! This is a partial solution. I promised you that we would achieve the Game of Life with regex, not mostly regex, but there's some sneaky Ruby in the middle. I need to do this part also with regex.
00:12:43.360 Alright! So far we’ve been doing fairly standard regex work, you know, the kind of thing that you do every day. Hold on to your socks because the next bit is where it gets wild! Remember that the problem I have is that given a matched cell, I don’t know whether to replace it with a one or a zero. We don’t have an obvious way of doing that with regex so far.
00:13:05.760 Now, you might think that you can do this in two stages: have one regex with all the combinations that have a one in the middle (all the deaths), and another one with all the conditions that have a zero in the middle (all the spawns). You can then match all of the like ones that need to change, replace them with zeros, match all the others, replace them with ones, and you’re done.
00:13:29.280 But you can’t do that! You need to do all this in one single pass atomically. And that’s partly why the regex approach is magic because it gives me that for free! Let me explain why the standard implementation of the Game of Life starts by making a new empty board in each generation.
00:13:51.440 Then, it fills that with the data from the previous generation. That’s because if you modify the board that you have as you go, you will modify the neighbors of cells that you haven’t evaluated yet, and that's going to modify the outcome.
00:14:13.520 As an example, we have these three cells highlighted in purple that are going to die of loneliness, but the yellow one has three neighbors, so it's going to spawn new life. This board over there on the right is what would happen in the next generation. But if you're doing two stages—first die, then spawn—you first kill the three lonely cells, and then you have no neighbors left, and nothing gets spawned. This doesn’t work, and obviously the same thing happens if you first spawn and then do the deaths. You can't do this.
00:14:46.480 So, again, so far we’ve matched all of this. I need to change, but I need to pass to the regex substitution a parameter saying what to change them to. That value needs to resolve to a zero for all the ones and to one for all the zeros. To explain how we do this, we need to take a little detour and talk about named captures.
00:15:01.760 Most of you probably know that when you're matching in a regex, you can capture a chunk of the match and then use it in the replacement as a back reference. Let’s say, for example, I have this American phone number, and I want to format it nicely. I can match the three different parts, capture them with parentheses, and then replace them by making references to those three captures by number.
00:15:27.920 If your regex is super simple, this is fine! But it’s very common to have tons of parentheses in your regex, which makes it really hard to keep track of which one is which. When you have that problem, you can name your captures, and then you can call them by name when you’re replacing, using this backslash K syntax, which makes things much easier to follow. You know what's going on!
00:15:46.960 This is what we call named captures. Most of you probably knew about these, but here's the bonkers part that I didn’t think was going to work. You can do captures inside your look-ahead and look-behind structures, so you’re not matching that part; you’re just asserting that it’s there. But if it is there, it gets captured as a named reference that you can use later.
00:16:12.320 The real bonkers bit is that you can declare the same name for the capture multiple times in your regex. If you have an 'or,' whichever side of the 'or' matches kind of wins, and that's the value of the capture! Going back to something closer to our regex, if you remember, we were doing something like this: match a one but only if it’s preceded by three zeros. I’m not actually matching the zeros, but I can still capture one of them and name it. For example, let’s name it 'replace'.
00:16:43.920 Now, if that expression matches, I match the one, but I have a capture named 'replace' with a zero in it. I can also have two expressions with the same capture name separated by an 'or.' Only one of these two will match, and whichever one matches—that’s the value of the 'replace' capture.
00:17:10.560 This is nuts! I was 95% sure that wasn’t going to work. That regex will match one of two things: the one at the end of '0001' or the zero at the end of '1010.' It matches just that final one or zero and only if they have this prefix. If the one matches, the 'replace' capture holds a zero, and if the zero matches, the capture holds a one.
00:17:35.200 Do you see what just happened? If I’m at a one, my capture has a zero; if I match a zero, my capture has a one. All I need to do is replace the match with the capture, and we flip the number with no code! So, what we need to do is modify these 228 cases and add these captures. For any cell that is alive and needs to die, I pick a neighbor that is dead and capture that character.
00:18:01.760 For cells that are dead that need to spawn, I capture an alive neighbor. Then we change the generator code to pick an appropriate neighbor and surround it with the 'name capture' that we chose. The runner code now looks like this: I can tell each matched cell to replace itself with the capture that I chose, which happens to have the opposite digit. Now the dream is complete! I have a pure regex replacement that runs Conway’s Game of Life, which is exactly what I set out to do from the beginning.
00:18:27.680 Here’s our glider gun again—it still works! Now, there’s a little catch. I promise it doesn't matter for where I'm ultimately going with this, but I do have to come clean that the emulation isn’t perfect. Of those 228 cases that my regex needs to match, there is one and only one that it cannot deal with.
00:18:49.760 Did you catch it already? This guy should die of overcrowding; he is surrounded in all directions. But I need to capture a zero to be able to kill him, and there are no zeros around him, so I can't. This one guy won't die in my implementation.
00:19:15.520 I've tried everything—negative look-ahead and look-behind, negative character classes that say 'not zero' instead of one; capturing zeros in that interleaved garbage—nothing works! I need to grab onto a zero, and the zero just isn't there. I don't have anything I can grab.
00:19:38.560 As an aside, for the love of God, if you can think of any way to solve this, please tell me, because it's killing me inside! But as I said, I think this doesn't evaluate what I want to get at. So, I want to get back to an earlier question: why, why, why am I even talking about any of this?
00:19:57.120 What I said earlier is true—this is a question that popped up in my head, and I just couldn't get rid of it until I had solved it. But there's another side to this story, because it turns out Conway's Game of Life is Turing complete, and that's not what you’d expect for a system which is entirely defined by this one simple rule.
00:20:17.520 Now, let’s recall what that means: oversimplifying a bit, it means that any program you can write in any language, you could also implement as a Game of Life pattern if you wanted to. You can see a beautiful example of that right here. That’s a Turing machine running in Conway’s Game of Life.
00:20:37.920 This is something that I love! I love finding things that you wouldn’t expect to be Turing complete, but they are! I could have an entire talk just on this; it’s one of my favorite things ever. So, when I was thinking about whether you can run the Game of Life with regex, I was thinking of this.
00:20:58.800 If the Game of Life is Turing complete and I can run it with a regex, does that mean that regexes are Turing complete? Because that would be pretty nuts!
00:21:16.560 So I'm sort of using the Game of Life here as a proxy for Turing Completeness, because I thought I could emulate it with regex. As we saw earlier, that doesn’t quite work. But that leads to a more direct way: what if I just implement a Turing machine directly with regex?
00:21:32.960 Now as a quick reminder—an intro to what Turing machines are—they are theoretical machines that can do any computation that any other computer can do. They have an infinite tape of data, a head that reads and writes the data, as well as moves the tape around, and a state machine that, given the data on the tape and the state it’s in, decides what to write in the tape and whether to move left or right.
00:21:51.680 This simple thing can run any program that you can write on any computer. Here’s how we can do this with just regex. We're going to have to fly through this part a little bit; there’s more detail like encoding the tape with each part having to represent the data so that we can work on it with regex.
00:22:20.480 This string represents the state of our little Turing machine. This is a tape both to the left and to the right of the current position. That bit over there that goes from 'B' to 'B' is the initial data of the tape; the rest is the head of the machine, which moves left and right in the tape based on what the state machine tells it to do.
00:22:45.360 This particular machine can have eight different states and process six different valid input characters. If we have more states or other characters, we will add that in this initial string. We have these markers here so we can capture them with named captures and replace them into the other parts.
00:23:07.600 Just like before, we were capturing neighbors; now this is the current position of the tape—the position we're operating on—and this is the character we’re going to match to determine what state to go to and which way to move the tape.
00:23:26.480 This line, for example, says: if you’re in state 1 and the tape has a 'B,' write a 'C,' move left, and change to state 2. We can write a little bit of Ruby code to send this machine’s definition into a mega regex like the ones I had before.
00:23:49.360 What we do is match the head in that state with that character in the tape, and it will capture all of the right parts of the head and the tape for the replacement to work. Each of these lines will essentially create a clause in a giant set of ores in our regex.
00:24:14.500 This is the same program in regex form. This whole thing looks for all the possible contents that the head could have based on its possible states and all of the possible characters in the tape. It will also match the little bit of tape just to the left and to the right of the head and put all of the relevant pieces into named captures, just like we saw before.
00:24:37.920 This program takes a string of zeros and ones and repeats it. The code to actually execute this Turing machine is what I keep promising: nothing but the regex, and it works! You can see the head moving back and forth with the tape changing the values, and once it finishes, the tape at the end has the same string of zeros and ones as the input, but repeated twice!
00:25:06.320 So, we can implement a Turing machine with just regex! Now going back to my original weird question: does that mean regexes are Turing complete? Technically? Formally? Academically, very, very no!
00:25:25.920 And I want to make that clear; I'm not claiming that they are. First of all, in all that code we've seen, we have a loop around the regex, and that loop is pulling a lot of weight. Regexes cannot loop at all, and you really need to loop if you want to do any actual computation.
00:25:54.320 Also, it depends on what you mean by regex. In the formal definition of regular languages, regexes can't do all the beautiful things that you normally do in day-to-day programming with the looking back and forth and the back references. All that is technically not a regular expression.
00:26:22.080 So officially, no, they are not too incomplete. But, given that with just a simple loop around them, they can almost run a Game of Life, and they can absolutely run a Turing machine, they kind of are, for practical purposes. The regular expressions that you use day-to-day can absolutely emulate computation.
00:26:54.240 Now, I don’t think there’s any actual practical purpose for this, but I also think it's mind-blowing! I did not expect to get to this answer when I started! You know, like most weird things that I find, I just wanted to share that with you.
00:27:17.120 So I hope you enjoyed that little crazy tour, and thank you!