Pivorak Conf 2.0

Summarized using AI

Domain Modeling With Datalog

Norbert Wójtowicz • February 01, 2019 • Lviv, Ukraine

The video titled "Domain Modeling With Datalog" features a talk by Norbert Wójtowicz at the Pivorak Conf 2.0, centered around the Datalog language. Datalog is established as a declarative logic programming language, gaining traction in recent years as a tool for graph queries in server and client applications. The session aims to enhance the understanding of Datalog, addressing its application in various contexts, including as a graph database and a communication protocol akin to GraphQL.

Key points discussed throughout the video include:

- Introduction to Datalog: The speaker begins with a playful interaction to gauge the audience's familiarity with Datalog and related technologies like SQL.

- Basic Concepts of Domain Modeling: Wójtowicz explains that modeling real-world domains can be effectively accomplished using three data structures: streams, trees, and meshes. The focus is on graph structures that naturally represent relationships in business domains.

- Entity-Attribute-Value (EAV) Model: The talk illustrates how Datalog can simplify complex queries by using an EAV model, which allows for a structure that can evolve without requiring a fixed schema. This is demonstrated using GitHub as an analogy, where users and repositories are modeled as entities in a graph.

- Querying using Datalog: The audience is shown how to construct queries, emphasizing pattern matching and the flexibility of variable bindings within queries. The distinction between finding and retrieving values is highlighted.

- Complex Domain Relationships: Wójtowicz explores how to manage relationships between entities, such as users and repositories, indicating that schema changes do not disrupt the data model. He discusses polymorphism, where entities can act as different types based on the query context.

- Event Sourcing and Append-Only Storage: The talk addresses how Datalog can be used effectively for event sourcing, maintaining an append-only data structure that avoids the pitfalls of traditional database migrations when schemas change.

- Performance and Indexing: There are discussions on how the simplicity of the EAV model allows for efficient querying and performance, including the potential to create various database implementations by changing index structures.

- Datalog Implementation Examples: Wójtowicz shares insights on how Datalog implementations work in real-world applications, particularly in relationships and aggregations, illustrating its flexibility and robustness.
- Insights on Future Development: The concluding remarks emphasize the potential of Datalog as a go-to tool for domain modeling, arguing its effectiveness across diverse applications.

In summary, the session successfully demonstrates the advantages of using Datalog in domain modeling, focusing on its simplicity, flexibility, and effectiveness in handling complex relationships with minimal overhead, making it a compelling choice for modern applications.

Domain Modeling With Datalog
Norbert Wójtowicz • February 01, 2019 • Lviv, Ukraine

Datalog is a declarative logic programming language, which has in recent years found new uses as a graph query language in server and client applications. This talk introduces Datalog from its primitives and builds a mental model of how complicated queries can be resolved using simple data structures. This, in turn, will help you understand when and how you should apply Datalog in practice: on the client, on the server, as a graph database, as a communication protocol ala GraphQL/Falcor, or as an event-sourcing storage mechanism.

Pivorak Conf 2.0

00:00:08.840 Okay, if you can't read that, is that too loud? If you can't read that, come closer because there's going to be a lot of code here. And so the beers are for me, the T's are for the voice. This is a talk about Datalog.
00:00:15.389 Who here has ever heard about Datalog? Okay, perfect, nobody. Who here has done SQL before? Okay, some. Kind of like document databases, things like that? Anything? Yeah, you guys don't actually code, do you?
00:00:24.180 Okay, so this is going to be a weird talk because it was originally for like a two-hour workshop. What we're going to do because I think you only gave me about 40 minutes, forty-ish, is we're just going to stop at some point, right? Because there's not much more we can do, but we can always talk at the after-party.
00:00:36.870 Usually, this talk can go both ways. One way is I talk to you for 40 minutes and convince you to try Datalog and sort of tell you why it's cool and why this new technology is something you should be interested in. But this is a different kind of talk, so this is not that talk. This talk is actually going to just assume that you really want to learn Datalog.
00:00:58.170 What we're going to do is we're going to start with very little LEGO pieces, and we're going to quickly build up to a big thing. The idea is that we're going to try to give you an intuition for why this is awesome without ever actually saying 'this is awesome'. Okay? So I need you to concentrate for like 40 minutes, and then we can go drinking, except for me.
00:01:14.799 We started because we're going to go very fast down the rabbit hole. I can't give it the usual introduction that I give of why and how I build systems, so I'll just do this one slide. If you do anything with modeling actual real-world businesses and domains, it turns out there's only three things you need to do to build a system.
00:01:24.990 You need to have streams, trees, and meshes. These are the only data structures that ever really matter. So, streams just tell you things about give you semantics of order. This is like the events that are coming into your system; these are the queues that are backing up on your queue system, and things that are just not working correctly. Trees give you a hierarchy. Your entire Ruby stack is basically one big hierarchy of objects in memory. Your UIs are just a set of trees that talk about the hierarchy of your DOM and so forth. And this is a graph; this is what your business domain is.
00:02:07.560 Unfortunately, we don't like to admit it, and this is why we have problems. This is actually a very specific kind of graph that I like to call a mesh. To explain this by analogy, imagine that we're Spotify. In Spotify, there are only three actual kinds of nodes that are interesting to us: an artist, a listener, and a song. Anything you build as a Spotify developer is actually just a new relationship between things that already exist.
00:02:25.910 So, when you want to create the idea of an album, you just create a new relationship between an artist and some songs and give it some metadata, like the title of the album and when it was released. Similarly, when you create a playlist, it's a new relationship between a user and a set of songs that we've decided are important for that user. If we come back with a feature that allows users to subscribe to playlists, we're creating a new kind of relationship between that user and some playlist that was created.
00:03:04.320 With every single feature that we add to the system, we're just creating more and more relationships among things that already exist. Sooner or later, our business domain starts to look like this, and this, by the way, is why a lot of projects fail. If your main database for your domain is a relational database, then with every single feature you add, you are essentially doing new implicit joins between existing things.
00:03:43.470 It turns out that a relational database is terrible when it comes to doing these kinds of implicit joins that really link everything, and you end up moving it all to a NoSQL database because you've given up, or to Elasticsearch because you don't care about your data, or a bunch of other things. So, are we agreed that graphs are the way to build domains? Assuming that Datalog is one way—not the only way, but a really good way—of talking about how to actually build this kind of stuff in production.
00:05:14.460 Datalog has this nice, interesting idea: it doesn't matter how complicated your domain is; you only need this to describe it—an entity-attribute-value triple store, also known as RDF if you're into the Semantic Web kind of stuff. I don't expect you to believe me because I know you've already built very complex systems, so something as simple as this cannot possibly work. The point of this talk is to actually show you that it does, and we're going to do this by analogy.
00:06:58.900 We're going to use GitHub because I assume that you're all domain experts, so I don't have to actually explain what GitHub is, and we can get straight to the code. This is essentially the rest of the talk: on the right-hand side, what's going to be happening here is this is your data structure that represents your database. You can imagine this as some vector in memory or a file on disk that we're just going to keep appending to.
00:07:32.700 The stuff on the left is the code, right? It's like your queries; it explains all the data we're interested in, and the one thing I want you to focus on is that no matter how complicated the stuff on the left gets, the right-hand side is just an EAV structure. We're just going to keep adding to the bottom of it. It's going to be like building an append-only database that represents all of GitHub.
00:08:05.140 So, the first thing we have in GitHub is our users, so let's add some users. We have some JSON, you know, something that comes over the wire, and so we add three users into our system. What happens here? We're just specifying a username because that's the attribute of the user and a value. Notice that we have three unique IDs because there are three users in our system, so the fact that there are different IDs—integers—means that they are different things. Officially, this thing is called an entity and each of these rows is called the datum.
00:08:51.700 So, now we have some data in our system and we want to query it. I know it's not a very complicated database, but we'll get to that. To build a query, we first need to talk about pattern matching because this is the DSL. A pattern is essentially just a vector of three elements that represents the EAV structure in our database. So, wherever we see Xi, it's going to find all the datums such that the entity is Xi, and underscore just means 'I don't care.' It's a wildcard.
00:10:21.590 This will find this last datum; this one finds all three because all three are usernames. This will find this single datum because all of the values match, and this finds nothing because there is no such datum where all these things are all true, right? So, that's nice, but that's not interesting. Now we have variables; variables start with a question mark and some name. This finds all entities, this finds all attributes, this finds all values, and this finds entities and attributes and values.
00:11:40.500 This is essentially a table scan, right? This says just give me everything. So also, not very interesting, but we can combine things. We can say, 'find the datum such that username is pitless and bind it to E.' Similarly, we can do something like this or we can, for example, say find all usernames and then bind each of the IDs to E and each of the values to V.
00:12:50.490 Now we're ready to do a query—a query much like in SQL, except instead of 'select', where it’s more about 'find'. 'Find' tells us what we're interested in, and 'where' tells us how to find it. So 'where' is nothing more than just a set of clauses that we just saw. In this case, we’re gonna find 11 username, and then we're gonna bind the variable name to Rich Hickey, and that's what 'find' is interested in. As a result, it's a set of tuples—in this case, a one-element tuple—because there's only one thing we were interested in: which was 'name'.
00:13:52.150 Now, the important thing is that we're always going to be returning sets of tuples, okay? I never said, but Datalog is actually a declarative logic programming language, so actually, it turns out that order doesn't matter. Anything that is unknown, it'll figure out, and anything that is known, it'll just match. So we can flip this, and we can say this time I know what the username is but I don't know the ID.
00:15:31.679 It’ll find the actual ID. This time I don’t care what the ID is; I want you to find all usernames, and we’re trying to return the values. So again, a set of tuples, a single value. I need you guys to not hold back as we go, perfect, or even better, if something doesn’t make sense, yell. I really don’t mind; I love hecklers.
00:16:40.690 Okay, so speaking of sets, this returns all of the attributes in our system, but there’s only one attribute, right? This is how sets work; it removes duplicates. Coming back to our previous query, this returns all usernames. There’s a little sugar syntax we can use in 'find', so by default, it returns a tuple of a set of tuples. If you add this little ellipsis, it’ll just return the vector of values, and if you use a period, it’ll just return one value.
00:17:16.360 So, for a million points, why did it return Tomsky? Someone brave enough to speak up? It matched all three but it returns a set that has no concept of order, so it just returns one value that’s true, right? So you really don't have control over that. Okay, so let’s talk about multiple bindings. Here we’re gonna find the usernames, we’re gonna bind the IDs, and then we’re gonna bind the names and what you see is now we have a couple of two values. Now we can actually return multiple things.
00:18:45.430 We can sort of return relationships by treating things. So, let’s complicate our business domain because it’s kind of very simple for now. Let’s add an email address. Notice what is happening here; let’s add two more users to make it obvious. We’re adding brand-new information to the bottom of the file. We never changed anything we’ve already added to the system, right? We’re appending to the end of the file, and what we’re doing is we’re using the fact that the ID is the same to tell the system that these things are related to each other; they’re part of the same entity.
00:19:59.270 So, now we’re actually able to store multiple attributes about an entity. Once we have that, we can actually query it. So again, let’s look for a username like chikki. And now here’s the magic: here’s the magic of a constraint solver. What it does is it says if this is ID, then anywhere I use the exact same name, all the values must match. So if it found 11 here, it’s gonna look for all of the other 11s, and then it’ll find the 11 such that there is a user email attribute and then return [email protected] because that’s the email, right? But there's no concept of order, so you can flip it.
00:21:30.805 You can say, 'I know the email address; who was the name?' Right? Like, who was the person? And also, if something doesn't match, nothing gets returned because all of the clauses must be true for anything to be true. There can be multiple unknowns, so just by playing with this data structure, you can get all kinds of interesting information out of your system.
00:22:34.920 So, let’s make our domain even more complicated. GitHub has this concept of repositories, and to make it even more complicated, repositories are owned by organizations or users. Now, I want you to think about this for a second. How would you develop a SQL database such that you have users, organizations, and repositories, but a repository is sometimes owned by a user and sometimes by an organization? Think back.
00:23:36.820 So, usually, you would end up with one of a couple of solutions, all of them really suboptimal, where you either have multiple columns where some things might be null, or you have additional tables for many-to-many relationships, and you have a bunch of code, or you ignore the data in the database completely. In Ruby, you're basically just running around with a bunch of 'ifs' everywhere.
00:24:15.609 Let’s see how this works in Datalog. First of all, we create a new entity for our organization. This is 44, a brand new ID. Next, we create a repository 55. Notice what happens here: 55 repo owner 44. This value is actually a reference to a different entity in our system. Similarly, we can create another repository, DataScript, whose repo owner is 22, which is Tomsky. I want to point a couple of things out here.
00:25:38.070 First of all, this 44 and 22—while it makes sense from a database perspective—are really crude from a user experience, like a developer experience. There’s some nice sugar syntax where you can replace this reference with a two-element vector. This actually says to find me the entity such that org equals 44, and it’ll replace it one with the other. Hopefully, now it’s a little more obvious what kind of polymorphism is going on.
00:26:59.260 Because first of all, we have two repositories; one is owned by an organization, and one is owned by a user. They are both called 'repo owner' because that’s from a domain perspective exactly what they are. We don’t care about this. The next thing is, notice that we’ve never said anything about schemas. Datalog has no concept of schemas. Most systems, most databases, they force you to encode the schema during writes. Datalog reverses it and says that the schema you apply is during a read, not during a write.
00:28:44.240 So it's actually very nice because, it turns out, a username is always going to be a string no matter what, but the context of when you actually apply this information changes over time. It allows you to create these kinds of arbitrary polymorphisms. It also allows you to create different kinds of polymorphism. Even though there is no example of this here, imagine if there was such an entity that sometimes needed to behave as if it was an organization, but sometimes as if it was a user.
00:29:51.063 That one entity could have both attributes, and depending on whether you're querying by the user or by the organization, it would behave as that entity. So, there are very different kinds of polymorphism that you can achieve because of this kind of flexibility. So, let's do a query. We start off with the username 'con scheme', and then we're gonna find the repositories that are owned by this user.
00:30:44.600 Notice again, we’re using that magical 'P' because it’s the same name; it’ll find the same values in our system, and then it’ll bind the Rs for the repository, and it’ll give us back the result. This says, given a username, what are all the repositories that are owned by this user? As I've been repeating over and over again, there’s no concept of order.
00:31:14.740 So anything that is unknown is interesting. You can flip it; you can say, 'I know what the repository is; what is the username?' Or I can say, 'I don’t know either of those things; give me all of the repositories.' So now the question is: why did it only return DataScript? What happened to Closure?
00:32:29.250 Don't be shy; come on. It’s this username thing. We’ve created a point; we’ve created complexity in our system because of the way our domain is. Our domain is dirty and complex, and this is the way the real world works. But since we’ve created this complexity, we have to now deal with it, right? This action, this query actually returns all repositories such that they're owned by users.
00:33:11.580 If we're interested in organizations, we just switch 'username' with 'org name', and now we get Closure. Sometimes we don't care, so we have to be explicit about it, but we can do that polymorphism, right? We can say, 'I don’t care if it was an organization or a user; just tell me what all the repositories in the system are.' Now this 'or' thing is actually a little sugar syntax for something called a rule. A rule is essentially a different data structure somewhere else in the system where you give a name to a certain set of behaviors.
00:34:41.990 So in this case, we're saying that a repo owner is something such that it can have either an org name or a username. We can sort of use a function call inside our logic. The nice thing about rules is that they can actually invoke each other and call themselves recursively, which means it turns out that Datalog is a wonderful weapon of choice when you have graph traversal problems where you're not sure exactly how many hops you have to do to get to the thing you're interested in.
00:35:50.070 Classic examples include solving things like the Kevin Bacon problem—how many jumps do you have to get to someone that you’re interested in from an arbitrary actor to Kevin Bacon, right? These are the kinds of things that Datalog does naturally and very quickly.
00:37:00.870 So, our product owner comes back and tells us that now we have to talk about forking in GitHub. So, again, how do we model a fork? From a domain perspective, it’s just a repository like any other repository. It is a slug and an owner but it also has a reference to its origin.
00:38:16.050 So what we did was we created a 'repo fork' attribute where we’re talking about the original repository that it was forked from. Notice again that we’re just adding new data to the end of our file. We haven’t changed anything up to this point. Notice also that we didn’t go back and add 'repo fork nil' to all the other repositories in our system; this actually has wonderful consequences.
00:39:12.170 The reason why we don’t do this by default is if you live in a rectangular world of relational databases where all columns must exist for all rows, then you have things like null values—because physically, there needs to be something there due to padding and performance and such. Once we leave the rectangular world of relational databases, we don’t need things like nulls. The simple fact of not having attributes actually signifies a lot of important information.
00:40:03.430 It allows you to do things like say, 'Give me all of the repositories that are forks.' Notice that I don’t even care what the original repo is; I just care about the fact that this repository has such an attribute. Similarly, I can say if it’s missing that, I know that it’s one of the originals. I can easily now find all the regional repositories in our system and talk about relationships between things.
00:41:05.310 Here, I’m gonna find all the repos but I’ll call it the original ID, and I’ll find three repo owners, and again, I’ll do the same exact query, but this time, I’m gonna give it a different name. This third clause does all the magic because it actually specifies a specific relationship between the fork ID and the original ID. The logic solver realizes that all three apples are no longer valid, has to backtrack, and has to figure out what combinations exist such that all of these clauses are still true.
00:42:06.570 Now that we have an easy way of talking about how we find relationships between things, that’s all nice and good. What about multiple values? We have this very simple data structure here: EAV. What if there are multiple values for an attribute? One such implementation feature in GitHub is the idea of languages. You go to a repository, and it tells you that this repository was written in Ruby and JavaScript.
00:43:00.700 How would we model this in Datalog? What we do is we would simply repeat ourselves. I never said anything about the fact that a single entity can have a single attribute only once. If there are multiple such values that are true for an entity, just repeat yourself. Some of the data scripts are an enclosure and JavaScript, and my fork is also written in Closure and JavaScript. Now that we have this, we can actually talk about it.
00:44:23.560 We can make queries about—we can, for example, do things like 'just give me all the languages,' right? Tell me what languages are in GitHub right now. So, again, six values, but only three results because of sets. We can also say, 'I’m actually interested in what repositories are written in these languages.' This actually is like a table scan of repo to language—right? All the different relationships that we’re interested in.
00:45:40.559 Just because you can have multiple values means you can also have multiple references, because references are different from values. So the most important feature in GitHub is stars. It doesn’t matter how good your code base is; it doesn’t matter how maintainable it is or that the library has been running in production for twenty years. The only thing that ever matters in GitHub is how many stars you have.
00:46:53.790 So the question is: how do you model this relationship? Because we talk about how many stars a repository has, but that’s not actually what we’re modeling. What you model is a specific user starred a specific repository—a favorited repo, right? So Rich Hickey on your favorite hit is Tomsky favored too, and I also starred some repos.
00:48:03.040 So again, multiple values and multiple references just mean multiple entries in our system. But this is the way we model the system; this is not the way we’re interested in interacting with the system. What we’re interested in is how many stars our repository has. So, for this, you need aggregations. How would this work? Well, you would find all the repos; you would again use the magical binding to find all the Rs such that at least one star exists for them.
00:49:22.020 Then, I would take the people that used that star for that repo, and I would aggregate it, right? And I will get the result: three for Closure, two for DataScript. Now, as you can see, 'find' can use parentheses to do aggregations; here’s an example of 'count'. But as you would expect, the atomic and Datalog and all these implications have support for basically all the kind of aggregations you would expect, like all the things a decent database should do.
00:50:54.740 The other thing a database should do is give you the ability to do custom predicates because, no matter how cool your data query language is, at some point you’re going to want to ask your database something that it doesn’t know how to do. Here, we’re gonna find repos that start with the letter C, and the magic is in the parentheses. The syntax with these parentheses lets our Datalog engine talk to the host platform.
00:52:10.360 The ones implementation I’m most familiar with is Datomic, which runs on the JVM, and DataScript, which runs on JavaScript. So in both of these cases, this question, these parentheses, allow me to run arbitrary code that’s a predicate on my host platform. So, the 'starts with' method on the class String in Java, for example. So, it’s not so much interesting that the Java class String method has a 'starts with' method. What’s interesting is that you have a syntax that allows you to talk to your host platform and intersperse it with all this logic programming we’ve been doing.
00:53:21.360 You can mix and match in any way you feel. Another wonderful example of this is that Datomic, because it runs on the JVM, has access to Lucene. Lucene is a full-text search engine. So, one of these things that is nice to show is you have full-text searching built right into the database—not because the database does it, but because it allows you to interact with the host platforms, and it gives you a way of getting those results back to the system.
00:54:41.930 So you can keep doing your logic programming as if everything was the same as it ever was. Here’s an example of returning the links of the repos that we’re interested in for Closure from a full-text search, and that .99 is the rank the Lucene search engine gives you for how good this specific result is.
00:55:50.570 Okay, so we’re going to switch gears a little bit. Is there any questions, except, 'What the hell is going on?' Yes? No? Okay, that’s fine.
00:56:39.750 Okay. We can get—we have time. We have all the time in the world. I told you we’re not gonna get through all this talk, so it doesn’t really matter. So entity-attribute-value, hopefully, I’ve given you this little seed of doubt that, in fact, this is all you ever need. Seriously. It really is. We do this in production; it is all you ever need.
00:57:28.880 Now, the question that might come up is: what about performance? Because this sounds crazy. It turns out it’s really simple. Because of the fact that it is such a simple data structure, all it has is an EAV—nothing else. Turns out that if you just switch the order of those three columns, you can basically build any kind of emulating database you are interested in.
00:58:09.160 If you want a Redis key value storage, you just make sure you have an index built by value-entity, and then these kinds of queries are basically all one operation. If you’re interested in a classic SQL database, a row database, then you just want to have the index EAV again—it’s like looking it up in a map. Just look up the first thing, look up the second thing; that’s the value.
00:59:15.160 Now, usually where SQL falls down is aggregations. Because to do aggregations, you usually need—you know in the old days—a data warehouse. But what you really need is a database such that its index is first by the attribute, then by the entity, then by the value. Then it’s really easy to do, like count P, because all that data is in one specific space on disk. It’s all there.
01:00:30.360 Similarly, if you want to be then— the cool thing about it is you have these nested queries that you can dig deep in—but the wonderful thing about this structure is as soon as you get to 11 user stars, as soon as you get to this value 55, you just jump back into that index again and you just keep going, right? So it’s really simple to do deep traversals this way because you’re always using this simple index.
01:01:17.790 Now, what’s more interesting is how do you do it in reverse? Because in order to be a graph database, like Neo4j, you actually need— the way a graph works is you have the answer, but you don’t know what the question is essentially. So you actually need an index that starts with value, then adds the attribute, and then the entity.
01:02:13.590 Once you have that, you can end with the question. So Closure, and then look for the attribute 'href' or 'pulse log', and now you have the answer 55, and so you plug that answer in—5555 user stars, user stars, and you automatically know what it is. So as long as you build an index in this way, a graph database is really easy to build in a performant way.
01:03:12.700 That’s actually what happens in these real implementations of Datalog. Somewhere in the background, these indexes are just kept up. If you have really crazy custom things like full-text search, this is outside the domain of Datalog, but there are good ways, like using inverted indexes, as you saw. There’s a way to use these different systems using the same syntax.
01:03:37.910 So that’s usually question number one. The cool thing number two is we’ve been talking about this 'username' attribute as if it was some kind of a string or keyword. Although, say I do Closure most of these days, but I know this is a keyword; this is a symbol in Ruby, and it’s the exact opposite in Closure and Lisps.
01:04:09.720 This is okay. Never mind this Ruby—I don’t understand why Matz borrowed from Lisp but flipped everything. Sorry for that tangent. Username; this is not a keyword or a symbol. What it is, is actually an entity in our system, which is amazing! Because it turns out that since it’s a first-class entity, you can just give it random attributes just like everything else.
01:04:51.750 You can give it things like you can tell it a certain value type; you can tell us that it’s always a string so your system can always check for this kind of thing. You can give it documentation like what this thing is that’s stored in our system. Because now we have this flat structure, it's usually very good to have documentation.
01:05:43.960 Also, notice we kept using namespaces to keep our sanity when we build these kinds of systems. It can also do things like specify uniqueness constraints. If you tell your system that something's unique, it will actually be able to verify that stuff during transactions. You can even do things like give it hints.
01:06:49.100 So, in the Datalog system, for example, if you give it a 'DB full-text true' hint on an attribute, it’ll know that in the future, you’re going to want to do full-text searching on this specific attribute. Now, the next question that always comes up is: how do you do updates in an append-only database where you don’t delete anything? Any ideas, suggestions?
01:07:59.180 You could introduce nulls. That’s how you’re going to help yourself. One obvious thing is you can go back and find deleted items, but that defeats the purpose of append-only databases anyway. Another thing you can do is say, let's just add a new value to the end. But, if you remember, what that does is it just creates multiple cardinalities for a single value.
01:08:41.680 If I did 33 username and I gave myself a different name, actually, what I'm modeling is that I have nicknames—like there are two ways you can find me on the Internet via pitiless and find some other username. It’s a really nice property that I can give it a li, but it doesn’t help us do updates.
01:09:47.880 So this is sort of a trick question because it turns out that EAV is enough to talk about modeling your domain, but your domain is information at rest. When you’re talking about updating things in your system, you’re not talking about your domain; you’re actually talking about communicating change over time, and that’s the key. This is enough if you have a static system and just want to talk about relationships.
01:11:07.600 But as soon as you want to talk about updating things and changing things over time, we actually have five columns. This is what real implementations of Datalog look like. The fourth column is a transaction column that says if there are multiple things that have the same number, it means they happened in the same transaction, which allows for atomic writes and acid compliance.
01:11:57.390 The last thing is something that states whether I've learned something or if I must now forget it; it’s true or false. So when I add information, I’m actually asserting. I create a new entity and a new time and I’m saying from this point forward I now know about a username.
01:13:14.210 If I want to delete information, I’m actually retracting facts. What I’m saying is this username 'Tomsky' that existed before at time 3000, this fact is no longer true. It’s important that all these values must be the same, except for the transaction and the operation, because remember multiple cardinality. This gives you a good way of even if there are multiple things that might be true, I’m just saying forget about this one specific instance of the truth.
01:14:25.720 Hopefully, that gives you some hints about how to do an update in place because an update in place is nothing more than saying 'forget about this guy, remember about this guy' using the same transaction instance, right? So that’s updates. Turns out, it’s not as complicated as you would think.
01:15:43.490 Now, the next nice thing is, just like attributes, transactions are also entities in our system. It’s meta—Turtles all the way down. The reason for that is, again, you have really nice consequences because you can add arbitrary attributes. For example, this is essentially like an atomic clock; like it’s just an atomic number that’s rising. Here, I can also specify a timestamp.
01:16:49.200 So I know when at the CPU time this happened; I can also do arbitrary things. For example, we often do things like for auditing purposes, we specify who is doing this transaction or why are they doing this transaction. I can’t tell you how many times it’s been wonderful to track down a bug and then be able to go back and realize that three months ago this piece of information entered our system because we did some JSON import. It’s amazing.
01:18:05.520 Once you have this capability, you sort of think, 'How did I live without it?' So I don't have much time to go over this, but basically, there are time-traveling APIs that sort of give you a hint of why this is cool. For example, you can do things like go back in time and see what did this database have at a specific time or you can actually do the same exact opposite—say forget everything from this point forward and just think about these datums.
01:19:27.610 Or just give me an audit; tell me about everything that’s happened to my system, filtering out things I don’t have access to. This is wonderful for permissions. We don’t do white labeling; we don’t do white and blacklisting of queries. What we do is we filter out any datums that you are not allowed to see, and then we say, 'Go ahead; you know, the sky’s the limit.'
01:20:55.990 Similarly, with a hypothetical database, it says, given the database and these additional things, if those were true, what would the result of this be? It would be like doing a transaction, a query, and a rollback without actually ever touching the database. This is all possible just because we have a very nice, simple primitive data model onto which we can apply these magical attributes.
01:22:15.320 Now, this isn’t something we’re going to talk about today, but this would help with cascading deletes. This is one of the last things I want to talk about, which is a cherry on top of everything you’ve seen so far—like this logic programming language. There’s this thing called the pull API syntax that’s recently…well, it’s been around for a while, but it’s picked up a lot of steam because essentially it’s GraphQL on steroids.
01:23:36.430 What happens is you do the query as you would normally do it and then instead of sorting out the information you’re interested in, you’re gonna say, assuming this entity is the one I’m interested in, I want you to pull, and the query is a data structure just like in GraphQL or Falcor that describes the information we’re interested in.
01:24:51.030 Now, I’m gonna show you a bunch of stuff; I don’t expect you to remember all of it. I just want to give you a hint of the power that’s nested here, and I highly encourage you to go and read more about it. The most basic thing you can do is build a vector of attributes and say these are the attributes I’m interested in. It’ll figure things like cardinality for you.
01:25:39.860 Using map notation, you can actually do joins. You can, for example, say, 'I wasn’t a repo, but now I want to go to the owner, and from the owner, I want you to give me their username or name.' Then you can go back. This underscore is a reverse graph; it says, 'I’m in Closure land, I’m an organization closure, and that will give me all the repos that this is the owner of.'
01:26:47.070 Then again, give me something like information. As you can see, like there’s parameterization and filtering; there’s all this kind of interesting stuff. You can do another join this time, find me the commits and then pull Arthur Shaw’s, right? Our entire backend/front end API looks like these big queries and then just filters on the backend for things you’re not allowed to do.
01:28:07.760 We don’t care what the client is asking for; that’s the idea behind this GraphQL or Falcor stuff. So this is, I think, the end of the talk because this is, I think, slide 150 of like 250. Among other things on this link.
01:29:32.880 This gist is the rest of the talk—like a lot of the interesting stuff we want to talk about. So, probably, I don’t know how much time we have; maybe I’ll show you one more thing.
01:29:41.840 Applications, one thing is event sourcing. Anytime people have played with event sourcing, they know that the main problem with event sourcing is how do you migrate data when your schema changes? This EAV structure basically says that you don’t need to. All you need to do is store your events in this very primitive, basic structure; you’re never going to have to do migrations.
01:30:18.980 Similarly, like this is Datomic; this is a very simplified overview of Datomic. What they do is they separate the transactor that’s responsible for taking new writes into your system. They don’t do storage; they say storage is not our problem: run it on PostgreSQL, run it on DynamoDB, run it on React, run it on file systems, run it in memory during tests—we don’t care. Your peers query the actual data.
01:31:18.020 The way we do this on the JVM is we actually have this peer in the same JVM process as our application, which means that when the transactor is writing stuff and the peer is asking for information, because of that transaction ID, you never have eventually consistent data. We always know if we have all the data we need to do the query.
01:32:09.250 If not, we'll just grab it from storage. But it turns out that the way this system is designed, almost it like once you have a nice running system, and you like the similar kinds of queries that are happening all the time, it turns out that almost all the data you need is almost always in memory.
01:33:10.520 It’s a wonderful thing when you’re doing database queries, and you’re pulling straight from memory; you’re not even doing like over the wire transfers. This is a more realistic implementation of Datomic because things like memcache and the kind of infrastructure stuff. DataScript is a wonderful tool; DataScript is all this talk.
01:34:01.430 This stuff I’m talking about but designed for web applications for JavaScript. It runs on the JavaScript engine, and basically you can replace all of your React state management with one simple database that lets you—like all React components basically do arbitrary queries on this kind of thing.
01:34:47.470 As I mentioned earlier, the pull API is how we communicate front-end and backend. We also have an extension of the Datomic pull syntax; it’s called EQL, which is also on the gist slides I mentioned. It talks about a syntax for doing mutations.
01:35:32.370 So just like GraphQL lets you do mutations and reads, there’s a very nice syntax for doing all this. Again, we use the same thing for between React components around data. This is like the second of four stops. So, we’re just gonna stop there; maybe we have time for one question?
01:36:38.480 If not, feel free to find me at the after-party; I can talk about this for hours. If you’re interested, I highly recommend you check out the gist where you can find a lot more stuff. Okay, thank you.
01:37:41.840 Microphone? No? Technology’s hard. Okay.
01:37:44.920 Questions? Oh, I can’t work. We have one. Hi.
01:38:01.680 Great talk! Overall, really mind-shaking. Especially in the union on Friday. The question is: what kind of data are you storing in your Datalog, and whether it is your main data storage or just supporting one?
01:38:24.630 Yeah, it’s our main data storage. Basically, we store everything in Datalog. I mean, the only thing you wouldn’t want to do, and this is not a problem with Datalog, it’s with the implementations like Datomic with the JVM, is if you have something like you’re recording mouse movements; you’re basically just streaming events.
01:38:58.550 This is something you probably don’t care about the audit history of, but it turns out that that’s the only exception to this rule. And I guess if you have enough memory, then that’s not even a problem, right? But that’s probably something you don’t want to do necessarily. But that’s a very specific case. Basically, this should be your default.
01:39:55.840 And this, by the way, is also something I should mention: even if it turned out that we built systems that ended up, for organizational or other reasons, being ported to Postgres or whatever, I do all my prototyping in Datalog. This is my go-to database, and I don’t leave it until there’s a reason to leave it, right? This should be your default.
01:40:47.800 Because if nothing else, the way it helps you think about your domain and focus on the problems that you’re actually trying to solve— that by itself is worth a pile of gold.
01:41:31.900 All right, and could you just briefly describe what your domain is, what you are mostly working on?
01:41:42.220 So, right now I work at a company that does conversational AI. So basically, we build things that people talk to, so a lot of NLP and AI and stuff like this, and we use Datalog all over the stack for all kinds of things.
01:42:06.160 But I’ve used this approach at previous companies from various kinds of domains, so it really isn’t domain-specific at all. Thank you so much. And all the speakers have unique presentations. It’s a Viva Rock t-shirt.
01:42:42.590 The nice! It’s your turn.
01:42:50.890 Oh, thank you. Thank you so much.
Explore all talks recorded at Pivorak Conf 2.0
+1