How NEO4J Saved my Relationship

Ruby

Coraline Ada Ehmke

#artificial-intelligence-ai

#ruby

#data-analysis

How NEO4J Saved my Relationship

by Coraline Ada Ehmke

This video, titled "How NEO4J Saved my Relationship" and presented by Coraline Ada Ehmke at BathRuby 2016, discusses the practical uses of Neo4j, a graph database designed for complex data relationships that traditional relational databases struggle to handle. Ehmke introduces graph databases as a superior alternative for managing highly connected data, emphasizing their performance and flexibility compared to relational databases. Here are the key points discussed in the talk:

Overall, the presentation emphasizes the growing relevance of graph databases in modern application development, advocating for an exploration of Neo4j due to its performance, flexibility, and ease of use for complex data relations.

00:00:16.730 I was a little nervous when I found out there were going to be 500 people here, but then I learned that one person couldn't make it, so it was 499, which is kind of reasonable. So, I am Coraline Ada Ehmke, and I'm very happy to be in the UK.

00:00:24.869 You may not know this about me, but I'm actually an amateur ukulele player. I've been tweeting all week since I got here with the #UKEcology hashtag for the benefit of my colleagues back in America to share some facts about your country.

00:00:38.430 That's absolutely true! I learned to sew, and I'm a Rubyist. I've been doing Ruby since 2007 and am the author of about two dozen RubyGems. You can learn all about my work at my website, coraline.codes, which I think is the most awesome vanity URL ever. I'm on Twitter as @Coriander.

00:00:54.360 As mentioned, I'm the author of the Contributor Covenant, which is the most popular open source code of conduct in the world, with over 14,000 adoptions, including Rails, JRuby, and Elixir. You should take everything I say today with a grain of salt since I'm speaking from the perspective of a liberal, progressive, transgender feminist.

00:01:20.549 I hold the controversial opinion that people should not be unkind to one another online. So you've been warned! Today, we're going to talk about databases. We have many options when it comes to selecting a database, but in the Rails world, we always seem to reach for PostgreSQL and Redis, especially if we're feeling experimental.

00:01:39.990 So why is that? Traditional relational databases were designed to model paper forms in tabular data. You can easily imagine a table in a relational database populated by a form in a GUI. But who says that tables are the best metaphors for how we store our data? We have a lot of options when it comes to databases, and there are many different kinds of databases at our disposal.

00:02:18.639 The one I want to talk about today is graph databases. Let's start by answering the question: What is a graph database? Graph theory was pioneered by Leonard Euler in the 18th century and has been actively researched and improved upon by mathematicians, sociologists, and other scientists in the intervening centuries.

00:02:40.440 It's only in the past few years that graph theory and graph thinking have been applied to information management. Part of this is driven by the massive commercial success of companies like Google, Facebook, and Twitter, which use proprietary graph technologies to drive their solutions.

00:03:04.320 In recent years, we've also seen the introduction of general-purpose graph databases into the technology landscape. So, what kind of data is suited for a graph database? Data with context, where in order to understand the context, we need to know and qualify the connections between things.

00:03:30.450 In a graph database, there are two kinds of things we're dealing with: nodes and edges. Nodes are entities, and if you insist on thinking in relational terms, you can think of a node as a table, except that it's a table that can hold absolutely anything. It is object-agnostic. Edges, or relations, are named and directed, meaning they always have a start and an end node.

00:03:53.520 Nodes contain key-value pairs, and relations can also contain key-value pairs. You can attach metadata to relations. In a graph database, relations are first-class citizens, which means that we can actually query them, unlike join tables, which are not easy to query.

00:04:18.730 So why would you choose a graph database for intensive data relationship handling? Graph databases have superior performance over relational databases. With a traditional database, the query performance for relational data worsens as the number of records and the depth of the relationship increases, but with graph databases, performance stays linear and constant even as data grows.

00:04:37.440 The structure of a graph model flexes as our applications change and grow. Graph databases are schema-less, meaning you don't have to mess around with a lot of migrations. They are also object-agnostic and don't care what kind of data is stored.

00:05:02.220 Graph databases are well-suited for agile methodologies because we don't have to model the domain completely ahead of time, and we can change our minds. We can change the existing graph structure without modifying the data we already have or the functionality that we already have.

00:05:36.300 There are some typical use cases, which I'm not going to get into right now. It wouldn't be a graph database talk without modeling a social network; in fact, it's actually a requirement, and I don't want to lose my license.

00:05:54.440 So if you bear with me, in a relational database, we might model Twitter accounts this way, where there's a name, a handle, and of course, an ID associated with them. We might have a join table for followers that contains the follower ID and person ID, and maybe some metadata, like when they followed.

00:06:10.540 In a graph database, we use nodes and edges. Nodes would contain users, and edges would represent the relationship between users. We might have a relationship called 'follows,' and we could attach metadata about the followed relationship to that edge.

00:06:39.860 But when two or more domain entities interact, like Coraline Ada interacting with someone else in this example, facts emerge. Relations and metadata are facts. We can represent these facts as separate nodes with connections to the entities engaged in the fact.

00:07:01.420 So we might model it like this, with a 'follow' node, where we have a relation between a person and a 'follows' relationship with another person. Once facts are represented as first-class citizens, we can ask questions like how many people someone followed on December 31st or who someone has been following for less than a month.

00:07:20.870 Notice that we are not interested in the kind of thing that's stored in a node—the graph database doesn't care. We have things that can be described as users, we have things that can be described as follows, and we have another thing that can be defined as relations to different kinds of relations.

00:07:52.030 This is different from a relational database where we'd have to sort each of these things separately. I want to talk today about Neo4j, which is a specific graph database created by a company called Neo Technology. Neo4j is open source, and it was written in Java.

00:08:08.900 Interestingly, the native interface for Neo4j is over HTTP, and we are going to see an interesting side effect of that fact a little bit later. Neo4j was released in 2010 and comes in three editions: the community edition, which is free to use but limited to a single node; the enterprise edition, which gives you clustering, hot backups, and monitoring; and there's a government edition which is like the enterprise edition but comes with some additional certifications.

00:08:25.290 Neo4j is ACID compliant, which means it provides us with atomicity, consistency, isolation, and durability—guaranteeing reliable database transactions. Unlike some alternative databases, Neo4j stores graph data in a number of different store files. Each store file contains data for a specific part of the graph, and the division of storage responsibilities between these different files facilitates very high-performance graph traversals.

00:08:48.240 The node store is a fixed-size record store where each record is exactly nine bytes long. These fixed-size records enable fast lookups. For example, if we have a node with an ID of 100, we know that its record begins 900 bytes into the file based on this format, allowing the database to compute a record's location in constant time rather than having to search it.

00:09:16.280 The first byte of a node is the in-use flag, which indicates whether there is data associated with this node or if it can be reused. The next four bytes are the ID of the first relationship connected to the node. The last four bytes represent the ID of the first property for the node. The record is quite lightweight, containing just a couple of pointers.

00:09:37.010 The relationship store consists of fixed-size records, where each record is 33 bytes. Each relationship record contains the IDs of the nodes at the start and the end of the relation, a pointer to the relationship type, and pointers for next and previous relationship records. This last pointer is part of what's called the relationship chain.

00:09:57.570 It's very efficient in its storage and is designed for high-performance operations and fast queries. So how do we query in Neo4j? We don't use SQL; we use a language called Cypher. I'm going to walk through the CRUD (Create, Read, Update, Delete) verbs to show you what Cypher looks like.

00:10:02.700 Our first operation is create. We're creating a graph person, A, with the name 'Coraline' and a handle 'Coraline Ada.' We'll also create a node, B, which is the graph person with the name 'Bath Ruby' and the handle 'Bath Ruby' with no spaces. Then, to create a link between the two, we match A and B where A's name is Coraline and B's name is Bath Ruby. We create a 'follow' relationship and, for the purposes of illustrating what we created, we will return both nodes with the 'follow' relationship in between.

00:10:30.350 The thing I want to point out with that particular query is that we are actually using ASCII art to draw a relation between A and B as an arrow because they are directional. We can visually represent which direction the connection goes, which is pretty cool. Now for a read operation, we're going to match node A that follows node B, where node A's name is Coraline, and will return node A, the follow relationship, and all instances of node B.

00:11:05.200 For an update operation, we're going to match node A with the name 'Coraline,' we're going to set A's name to 'Code Witch,' and we're going to return node A. Delete is a little bit more complicated; we'll match node A where A's name is Coraline, and we will detach and delete A.

00:11:35.590 This is important because a note has to have both a start and an end since an edge must connect two nodes. We can't just delete nodes; we must detach from all existing connections or edges as part of that delete operation.

00:12:01.170 You are probably used to a database query interface that looks something like this for PostgreSQL. What kind of console does Neo4j provide? This is pretty cool. This is where the native HTTP interface comes in handy. Neo4j's console is actually a single-page web app. Let's try a friends of friends query against our sample Twitter dataset and see what that looks like.

00:12:15.900 We're going to match a follower, which is of type 'graph person,' with a connection that we will call 'follows' to a depth of two connected to nodes where the name of the user is the lure. And because we don't want to return a ton of records, we'll limit it to one hundred records.

00:12:48.060 Here's the cool part; check that out! Isn't that awesome? We get a graph because our data is graphical, and why not get a graph back? We can see the relation between the things: the node we queried at the beginning is at the center of the graph, then the immediate followers, and then the followers of those followers, which is pretty cool. You can drag them around and double-click on them to see all the metadata associated with each node.

00:13:01.780 Now let's try it again with friends at a depth of three and check out the performance. So I just tapped up and entered the same query from history, changing the depth to three. We run it again, and it's still super fast! Awesome! So, we're Ruby developers, how do we do Neo4j in Ruby? We use a gem, of course: neo4jrb, which is an active model-compliant wrapper for the Neo4j graph database.

00:13:44.100 Because it's active model compliant, if you know Active Record, you can navigate pretty easily in neo4jrb. Let's take a look at some code from our Twitter example: we have a graph person. Notice that we're including Neo4j Active Node, so unlike Active Record, this is not an inheritance situation, which I actually prefer.

00:14:08.650 We're adding behavior to our class; we can define our class however we want and we don't have to declare that our class is an Active Record. We also declare a couple of properties—this is a declarative schema where we define a name property and we will index on the name because we plan to search on it, as well as a handle record.

00:14:29.340 We also define our relationships. We have many outbound edges which we will call 'followers' and we are going to relate them to the class 'Follower.' In terms of our edge, we include Neo4j Active Rel. I hate the fact that it's called Active Rel instead of Active Relation; I think that abbreviations are indeed the devil.

00:14:57.050 Then we mark the 'from' class as 'Graph Person' to class 'Graph Person,' and the label for the relationship is 'follows.' You could add metadata properties here too if you wanted to. So, in that Graph Person, we have a method called 'friends to depth.' This is what it looks like when you build a Cypher query in your Ruby code.

00:15:29.010 We're going to query as W, which is just a handle to the query. We're going to match a follower with a followers relationship to a depth specified for a user with the handle of our choosing. We will skip the direct relations and just go to the nested relations, returning distinct followers by handle.

00:15:58.640 One reason we map to followers is that the queries are lazily evaluated. So when you call that, you actually get a query object back, and when you evaluate it, you receive a struct with follower objects inside it. To access the records, you need to use a mapping method.

00:16:29.730 Let's see the next example from the REPL. We'll take a Graph Person just the first one and assign it to GP. Then we call GP.friends_to_depth and specify a depth of 2, returning our Graph people. We can even increase the depth to 3, which is still fast, and even depth 4, which is incredibly fast.

00:16:54.890 Speaking of performance, how does the performance of Neo4j compare to relational databases? There was some research conducted that compared Neo4j to a relational database in responding to queries, with the time in seconds shown for each. At a depth of two, returning 2,500 records shows performance between a relational database and a graph database is roughly equivalent.

00:17:14.220 However, you can see that the relational database performance worsens as the depth increases. To be fair, PostgreSQL has added recursive queries which enhance performance for friend-of-friend queries, but Neo4j is still significantly faster.

00:17:40.970 I have shown you the requisite Twitter graph as an example of how to use Neo4j, but Neo4j can actually be used for some pretty advanced data modeling as well. Now, let me introduce you to Sophia, which is my artificial intelligence side project.

00:18:06.320 Sophia features natural language processing with semantics and grammatical mapping. My goal in designing Sophia is that she will be able to comprehend and even create metaphors. In short, I want her to be able to dream!

00:18:38.480 So how did I model data in Sophia? Core abstract concepts are called contexts. In my management application, here are some contexts like 'animal,' 'beauty,' 'color,' and 'difficulty.' If we look at specific contexts like beauty, you can see we have these roots.

00:18:58.820 Roots have a base form and are basically expressions of a context, having a base form and grammatical forms. If we look at one of the roots, like 'beautiful,' it has a positivity ranking, meaning that expressions related to context can be positive or negative.

00:19:13.730 For instance, 'hot' indicates a positive expression, while 'cold' would be considered a negative expression. Because of these positivity rankings, it’s easy to derive synonyms and antonyms just by comparing those rankings.

00:19:38.760 So we have a list of synonyms here, and some antonyms. If a word matches exactly the positivity ranking of this word, then it shows up as related. At the bottom, we have parts of speech. The concept of beautiful can be expressed as an adverb 'beautifully' or as an adjective 'beautiful.'

00:19:55.680 We also have some metadata associated with those parts of speech. For example, 'beautifully' modifies 'manner' and is a verbal adverb. As an adjective, it has a comparative form 'more beautiful' and a superlative form 'most beautiful.' So how did I model this in Neo4j?

00:20:25.440 Let's do a query where we look at a context and its roots. We're going to match node A, which is a Gramercy meta context with the name 'beauty,' with a directional relation to node B, which is a Gramercy meta root, and return the context, the relation, and the root.

00:20:48.640 We will limit it to 100 records. You can see we have 'beauty' in the center, and each of those things radiating off of it is a root. Now, let's go from context and roots all the way down to parts of speech.

00:21:11.400 For this, we will match node A, which is a Gramercy meta context with the name 'beauty,' with a directional relation to a Gramercy meta root at one layer deeper, using a directional relation to node C, which is a Gramercy part of speech, and return all of the node simulations we referenced.

00:21:34.350 This is a simpler graph where we're simply looking at roots for the word 'beauty.' We have a couple of roots hanging off of that with grammatical expressions as the green nodes; the red nodes are the roots, and the orange nodes represent context.

00:21:57.940 Sophia also understands is-a relationships, which are a way of modeling facts. I have this fact explorer, which is a list of facts that she knows, such as: a cat is a mammal; a cat has fangs and a tail; 'Elfie' is a cat; and that 'Elfie' is adorable!

00:22:26.250 We can query facts using natural language. The partner could ask a question in the context of animals, figuring out that the subject is 'cat,' the verb is 'have,' and the predicate is 'fangs.' The answer to 'Does a cat have fangs?' is yes, because a mammal is a living thing.

00:22:59.790 In this case, two concepts arise: living thing and the fact that a cat has a tail. Sophia then knows that some mammals, like cats, do in fact have tails.

00:23:22.600 We can also add new facts by making declarative sentences. For example, 'Elfie is cool,' and the response can be 'I'll remember that.' If we look down at the bottom, Sophia parses the sentence, identifying it as a statement.

00:23:39.610 The subject is 'Elfie,' the verb is 'is,' and the predicate is 'cool.' For the context, Sophia knows this refers to an animal because 'Elfie' has already been expressed as a cat and a cat is a mammal and a mammal is a living thing.

00:24:06.250 However, Sophia is not able to determine what 'cool' means, as the term can express temperature but also a disposition. There's not enough context in the sentence to distinguish between the two.

00:24:25.710 If we had additional words, like 'She's cool and interesting,' then Sophia would understand that we're talking about disposition. And again, we can ask questions that rely on that context. So, is a cat cool? Because we stated that 'Elfie is cool' and 'Elfie' is indeed a cat, Sophia knows to state that some cats are, in fact, cool.

00:24:45.380 So how is this hierarchy structured? Let’s look at another query: we’re going to match node A, which is a category that has relations to multiple nodes called objects, with a relation labeled as f2 and another relation called f-22 which is an is-a component.

00:25:06.240 A component is something that makes up an object, and objects also have characteristics, which are attributes describing them. We can see that a cat is connected to 'mammal' and 'animal,' indicating that cats have tails and fangs, and mammals have fur.

00:25:27.060 So that's my AI project, Sofia. For Sofia, a graph database was an obvious choice due to the complexity of the data I was modeling, and it's paid off quite well. But when should you use a graph database? If your relationships are complex, if the metadata around a relation is equally important as the relation itself, if your data is deeply nested, or if performance is critical to you because graph databases exhibit linear performance characteristics.

00:26:15.540 When you want to move quickly, graph databases allow you to change your mind during development processes. And let's be honest: use a graph database if you like bright, shiny things! As developers, we are attracted to bright, shiny objects, so we should be honest about that.

00:26:26.520 Give it a try! It's a quick download; installation takes just a few minutes, so you might want to grab a cup of coffee or tea. There's great documentation for the Neo4j ARB gem, and its compatibility with Active Model makes it easy to get started.

00:26:26.520 New tools encourage exploration and play, and personally, this is how I learn best: by leaving behind assumptions about how things work and finding new ways to make them work. From my perspective, this is one of the best parts of being a developer: being constantly encouraged to learn and grow.

00:26:26.520 So, give graph databases a try! Even if you don't end up using one in production, you will challenge your assumptions about data modeling and pick up some neat new tricks along the way. Hopefully, I've inspired your curiosity and you will go ahead and try it out. Thank you very much!

BathRuby 2016