Database Design for Beginners

by David Copeland

In the talk "Database Design for Beginners" at RailsConf 2019, David Copeland emphasizes the critical importance of database design in building reliable applications. He starts with a thought experiment comparing the consequences of losing an application's source code versus losing its database. Through this exercise, he posits that the data is often more crucial than the application code itself, underscoring the need for solid database modeling.

Key Points Discussed:

Understanding Databases: Copeland clarifies that the focus is on OLTP (Online Transaction Processing) databases, which serve as the single source of truth for information in applications.
Importance of Normalization: He introduces normalization, which is the process of structuring a database to minimize redundancy and dependency, ensuring that it can be trusted. He mentions different normal forms, culminating in Boyce-Codd Normal Form (BCNF).
Example of a Wrestling Database: To illustrate his points, he uses a relatable example based on professional wrestling. He outlines the facts needed to build a wrestling database, such as wrestlers, shows, and matches, demonstrating how to identify entities and relationships.
Identifying Anomalies: Copeland describes how poorly designed databases can lead to anomalies that compromise data integrity. By progressively refining the data model, he shows how to eliminate these anomalies through a clear understanding of functional dependencies and keys, ensuring that the database is robust.
Primary vs. Synthetic Keys: Discussion on the distinction between business keys (natural identifiers from the data) and synthetic keys (artificial identifiers created to maintain data integrity).
Physical vs. Logical Models: He concludes by distinguishing between the logical model (the design and business rules) and the physical implementation in a Rails application, stressing the importance of enforcing constraints in the actual database.

Conclusions/Takeaways:
- Quality database design is crucial; prioritize understanding business rules before modeling data to ensure reliability.
- Normalization, particularly BCNF, is essential for trustworthy data modeling that keeps anomalies at bay.
- A solid understanding of functional dependencies and establishing primary and synthetic keys safeguards the integrity of the data architecture.
- Regularly revisiting and updating the database design is necessary to adapt to changing business requirements.

Ultimately, the presentation aims to bridge the knowledge gap in database design, equipping developers with the skills to effectively model and manage data, which is foundational in application development.

00:00:20.510 All right, hello! Thank you all for coming. I want to start off this talk with a thought experiment. It's a way to think through something in your head to try to learn about it. You usually play out a hypothetical situation, and that can reveal interesting information.

00:00:34.050 So I want you to think about an application that you work on that has a database—think of one that is very important to whoever is paying you to work on it. For example, I worked on a warehouse management system at Stitch Fix that managed our inventory: how much we paid for items, who has them, where they are, and to whom we sold them. It's super important. Now, consider what would happen if the source code of that application disappeared from existence—gone forever, with no backups or Git repositories to refer to. How could you recover from that? It would be painful and expensive, but with your knowledge of how the app works, you could work through it, and the business people would have a sense of how they want it to operate. Thankfully, the database would still be there to help you piece it together.

00:01:19.170 Now, imagine the other scenario: the application is fine, but the data in the database goes away forever—with no backups or dumps sitting in S3. It's just gone, never to be returned. How could you recover from that? I think that would be an extinction-level event. You would never be able to recover because the whole reason you have a database is to remember things that you can't recall or to share information with people who need it. If that is gone, you can't remember what the data looked like to repopulate it—you would be dead.

00:02:01.830 What this thought experiment tells me is that the data in applications is actually more important than the source code that manages them. That's interesting because I suspect we've all spent a lot more time learning how to write code rather than modeling databases or managing data. I have two degrees in computer science, but I didn't take a single database course. They didn't make me do any of that. So, today, I'm going to try to help close that gap.

00:02:19.709 My name is David Copeland, or @davidcopeland on Twitter. How I know anything about database design, despite my lack of courses in college, is that I was fortunate enough to work with someone who understood database design and the theoretical computer science that makes it a real thing. He was very particular about having our databases modeled properly. He explained why that was important by showing me a database that was poorly modeled, giving me the language to express why it was insufficient.

00:02:36.689 Since then, I've read books and papers to understand the theoretical computer science underlying what we're going to discuss. If I do a good job here, you won't have to worry about the complex math because it can be challenging to understand. Hopefully, we can all start to feel more confident about database design. As a side note, while I have your attention, I'm the co-author of 'Agile Web Development with Rails 6,' which just came out in beta today. If you're new to Rails and want a book to learn Rails, this is a very good one. I'll tweet a discount code later, so don't worry about that.

00:03:12.540 Now, let's start with what a database is. Many things in life can be considered databases—Redis is a database, your file system is a database, and Excel can be a database. But the kind of database we're talking about is the type your application probably uses, called OLTP, which stands for Online Transaction Processing.

00:03:37.109 Instead of delving into several slides to explain what that means, I'll cut to the chase. Essentially, OLTP is a source of truth. It is the one place where the facts about the world that you care about are stored. For instance, in that warehouse management system I mentioned, the source of truth is where we store information about how much we sold an item for. This data gets copied everywhere—it is cached for performance, and it gets aggregated for business intelligence.

00:04:11.220 Thus, it stands to reason that database design is how we ensure our database can be trusted as the source of truth. Think about Excel: if we just throw everything into it, we can place whatever we want there. How can we trust it as a source of truth if literally anything can go in there? Therefore, the design of the database in Excel does not lend itself to reliable data storage. This leads us to the first consideration in our database design: what facts about the world do we want to record?

00:04:43.139 Often, we're given a generic business problem to solve or a wireframe to implement, and that is perfectly fine. However, implied within those requests is the need to capture information that will be useful to support answering questions. This is what we want to put into our database. These are the facts that we want to record, and we need to understand what those facts are. Active database design focuses on ensuring you clearly comprehend the requirements.

00:05:24.810 So the first step is to document what those facts are. We're going to work through an example database that will initially be terrible and improve over time. We'll see why it is getting better, as understanding these concepts can be quite challenging without a running example. For this, I’ve chosen a domain that many of you might be familiar with: professional wrestling. We're going to make an application related to professional wrestling that stores data about this domain.

00:05:54.150 Let’s consider the facts we want to know. A wrestler may have a finishing move, which they use to win matches. We want to keep a database of who uses which finishing move. Wrestlers can compete in particular shows too: for example, The Rock might wrestle on SmackDown, while Stone Cold Steve Austin fights on RAW. We need to record who wrestles on what show, as not every wrestler competes on the same show.

00:06:35.909 A wrestler might also be cleared to wrestle in a specific match on a show. We should track when The Rock participates in the main event because that will boost ratings, but perhaps Zack Ryder competes in an opening match. Additionally, in professional wrestling, each show typically involves a title fight. A show on RAW, for example, will feature the Universal Championship battle.

00:07:00.840 You don’t have to know anything about professional wrestling to begin thinking about how to solve this problem. As a seasoned Rails and Ruby developer, you’ll be looking at all these nouns: they are key to understanding what you should do next. We will have a Wrestler table, a Shows table, and we need to determine how matches are linked to these. We can start to see an object model begin to take shape: our Wrestler table will store information on wrestlers, while the Shows table contains the relevant shows.

00:07:37.140 It’s important to understand that if you haven’t done any data modeling before, the process we are going through might seem fast and confusing. For those who have experience, it may feel intuitive to build a relationship based on your understanding, which isn’t necessarily wrong. However, such a gut feeling approach can lead to issues.

00:08:47.430 Let's say Zack Ryder ends up wrestling in an opening match on a show called 'Superstars,' but suppose that 'Superstars' does not actually have an opening match. Our data model may let us record these two pieces of conflicting information, which is problematic. This ambiguity cannot coexist; we don’t want our design to generate contradictions, and a model that allows for inconsistencies cannot be trusted to serve as the source of truth.

00:09:34.309 What we are encountering is known as an anomaly, which occurs when the data model either prevents certain facts from being stored, allows contradictory information to exist, or forces us to delete information to maintain consistency. We’ll see more specific examples of anomalies shortly, but the key takeaway is that a data model with anomalies cannot reliably serve as the source of truth, allowing incorrect information or failing to store correct information.

00:10:10.440 Normalization is the process through which we eliminate anomalies from our design. This might sound formal and somewhat abstract, but it is incredibly helpful. A level of formality will enable us to make strong statements about the changes we’re going to implement in our data model. Consequently, we can gain confidence that we are improving the quality of our design. I won’t delve into a lot of complex mathematics, but I will use a few terms from this area of research so if you want to know more, you can look these terms up.

00:10:58.140 Normalization is what we're going to learn about, and normal forms are established truths about the data model that tell us which anomalies have been removed. Normal forms range from lower to higher—for example, a lower normal form is subject to more anomalies while a higher, more normalized data model has fewer anomalies, and thus can be more reliably trusted as a source of truth.

00:11:39.490 Let’s return to our wrestling domain and instead of theorizing, let’s visualize it. Here’s our initial data model in an Excel spreadsheet format: we have a column for every fact we want to store along with some rows as data. For instance, we might record The Rock's finishing move as 'The Rock Bottom,' and note that he wrestles on RAW and SmackDown—where the Universal Championship is defended. We are starting to collect facts about the landscape we consider.

00:12:29.310 People may look at this structure and see it's not a good design. We can say it looks messy, contains duplication, and a lack of rules, which clarifies why it fails as a data model. Our first step is to modify our table into a relation—it’s the minimum design requirement for any legitimate database. Moving forward, we will make improvements that lead us to what is identified as a normalized schema, specifically Boyce-Codd Normal Form.

00:13:32.700 For perspective, a relation is a table that lacks duplicate rows, contains each field's value only once, and upholds that all fields in each column are of the same type. Although it may seem simplistic, this is the bare minimum. Remembering the initial Excel format brings light to how it does not meet these criteria. Before we assess our database as a relation, I want to discuss data types. It's important to note that all fields in a table need to share the same type.

00:14:27.420 To illustrate, consider the names of wrestlers—there are many strings available representing names, but the total number of strings far exceeds the number of wrestler names. Thus, the data types must be defined. Another important aspect to keep in mind is that a NULL is not a value; hence, NULL cannot be allowed in our database at all. We'll explain what that means when we write code to interact with our data model, but for now, NULL should not be a concern.

00:15:35.380 Returning to our Excel spreadsheet, we can see multiple issues. For instance, the Universal title should just be referred to as 'Universal Championship,' which is inconsistent and a duplicate entry. To resolve this, we can correct the title and duplicate every row to eliminate multiple values—now we have a relation. Every row adheres to single values and types, but we also acknowledge anomalies may still exist here.

00:16:21.270 For example, Kenny Omega is a wrestler who does not work for WWE but possesses a finishing move that we would like to include in our database. However, if he isn't on a show right now—recording his finishing move becomes impossible. If we delete The Rock and Kevin Owens, we will also lose sight of which show the Universal Championship is defended, which can't happen.

00:17:36.680 These examples represent the anomalies that we need to eliminate from our data model, and by transitioning from whatever we have into a relation, we can utilize a clearly defined method to do just that. This involves documenting the business rules in terms of functional dependencies and keys. By defining business rules using functional dependencies, we will confidently clarify which elements can be relied upon to maintain accuracy in our design.

00:18:48.780 Functional dependency, simply put, occurs when the value of one column implies the value of another column based on the business rule. For instance, if we know a wrestler's name, we can confidently determine their finishing move. The same goes for a show title. However, a show can feature multiple matches, so there isn’t a valid functional dependency arising there.

00:19:08.070 Moreover, functional dependencies may not apply in reverse situations. For example, multiple wrestlers may use the same finishing move, meaning a finishing move does not uniquely identify a wrestler. This entire classification depends on the business rules established for our application. If we had a rule that no two wrestlers could use the same finishing move, then it would introduce a functional dependency.

00:20:04.560 It is essential to differentiate between the data fulfilling functional dependencies and the data model itself executing that requirement correctly. Presently, our Excel spreadsheet does satisfy these functional dependencies because it's based on a singular row. However, the crucial point is that while the data might seem to validate the dependencies, the data model must also reflect this to avoid ambiguous outputs.

00:20:49.480 It's important to highlight that unambiguous results must lead to querying 0 rows or 1 row. Thus, keys maintain this standard. Our functional dependency between wrestler and their finishing move makes it clear we should retrieve this unambiguous outcome. However, just the wrestler's name isn't inherently a key, though it is part of a greater key.

00:21:07.130 Thus, if a query only pulls data based on a wrestler's name, it cannot guarantee a non-ambiguous response. For example, if I query for The Rock's finishing move, the result is ambiguous based on existing rows. Our database is allowing the storage of data that violates the defined business rules, as its design lets us return uncertain results.

00:21:46.420 So how do we fix this? The challenging part of design is understanding and capturing requirements clearly. To get to our solution, we might consider a table. If the wrestler serves as the key of this table and we also include their finishing move within this table, then we should be able to query it accurately to yield results that maintain reliable standards.

00:22:37.470 Now we will create a table where the wrestler becomes the key, while populating the other fields using the finishing move. Thus, we no longer require a finishing move column in our original table. By redoing the tables and assessing relevant connections, we will observe that data representation starts to take shape. The Wrestler's name can link directly to matches, creating relational integrity.

00:23:29.680 In continuing this method, we perceive the Show and Title functional dependencies expressed similarly. Functioning through consequential chains, we can soon observe a data model emerging that can produce logical deductions, with no duplication or ambiguity. That clarity leads us down a track to conclude that unconfused information should be attainable by querying.

00:24:34.850 The design maintains consistency across functional dependencies associated with each part of our key identified earlier. By applying this framework, we can ensure that this pattern will yield accurate information when it comes to conducting queries based on our modelling.

00:25:01.730 What happens next is grounded in theoretical computer science and logic. This process isn’t merely theoretical; it involves applying a methodology backed by fundamental mathematics, which guarantees the enhancements made to our database design. Essentially, we have transformed our original relation into Boyce-Codd Normal Form.

00:25:51.600 The result implies we now have a database model free of anomalies stemming from functional dependencies, meaning our understanding of the defined business rules enables our model to secure data integrity without violating constraints. Thus, operating under these principles affirms our approach, creating a configuration that holds no space for misunderstandings.

00:26:49.040 After following these distinctive steps, we made significant progress. We can now insert or modify records with the security of knowing the integrity of our relations is upheld. Now we realize that we can easily add Kenny Omega to our finishing moves table even though he competes outside WWE; we can change The Rock's finisher to the People's Elbow with minimal fuss.

00:27:35.080 But as with all things, our model does not exist in isolation. It’s vital we also move through this process correctly while we plan the physical implications in our Rails application. Adjusting to utilize Rails correctly incorporates understanding that the physical model you build is simply a translation of your logical model into a built database.

00:28:35.730 As you define models and migration processes, you should ensure that all defined keys work to enforce associations between tables and regulate data types. Implementing unique indexes guarantees that there won't be duplicate rows obstructing our tables. Additionally, foreign key constraints can manage connections across relationships in your models.

00:29:34.120 The unique index can ensure no rows have the same overall value for their specific key values. This still applies beyond the primary key that Rails automatically creates for you, allowing greater integrity in your designs.

00:30:00.130 In deploying foreign key constraints, Rails is generally lacking to employ these strategically. These sorts of constraints are pivotal when considering relational integrity, disallowing data that doesn’t exist from making its way into your tables. Thus, retaining valid references helps prevent any discrepancies.

00:30:56.779 Establishing data types requires decided considerations regarding allowable values. For instance, restricting strings to valid wrestler names while avoiding NULL as a value is essential. Default to NOT NULL to sidestep complications, while implementing checks that prevent invalid overlaps within the dataset.

00:31:43.030 Furthermore, you may want to enforce case-insensitive indexing for names, along with trimming whitespace spaces in discernible methods. These advanced checks can eliminate variations and erroneous entries, reinforcing data cleanliness as a priority.

00:32:56.520 In summary, we have reviewed a thoughtful process for establishing our database design in leagues with our business rules. Implement unique indexes for all business keys, maintain foreign key enforcement principles, and prioritize NOT NULL by default for clarity in structure.

00:33:29.870 In conclusion, I hope everything I presented today made sense. Creating this session took extensive research and a considerable amount of reading to clarify these concepts, so I hope parts of it are resonating with you. Remember, Boyce-Codd Normal Form is not infallible; your database model may also have other anomalies that arise from complex nuances, yet consistently applying these principles will set you on a solid path.

00:34:38.780 Understanding the importance of your data management over just the applications is essential. That’s the key takeaway; prioritize your database's integrity. Thank you.

00:35:43.090 We now have a few minutes left to address any questions. Are there situations where you would not want to implement Boyce-Codd Normal Form?

00:36:06.290 One potential scenario could involve working with a separate database that you've designed to be denormalized for rapid access. You might store data in a format that prioritizes performance, however, it is always prudent to retain a version of the data that is fully normalized. For my part, I believe there is a distinction: Rails validations focus on user-friendly input, while database constraints maintain data integrity. The two can exist independently; using one over the other is a matter of style.

00:37:06.180 For resources to learn more about these concepts, the relevant Wikipedia articles present decent information. I also read a book that outlined these theories comprehensively; I'll gather that name for you later. The guides available on Wikipedia can be helpful to delve further into details. You might find more links to literature connected to deeper studies into these principles.