Data Integrity
Pokedex Chronicles: Part 1 - Database Normalisation, Part 2 - Indexes Strike Back

Summarized using AI

Pokedex Chronicles: Part 1 - Database Normalisation, Part 2 - Indexes Strike Back

Mateusz Woźniczka • July 22, 2024 • Malmö, Sweden

In the presentation titled "Pokedex Chronicles: Part 1 - Database Normalisation, Part 2 - Indexes Strike Back", speaker Mateusz Woźniczka takes the audience on a journey through database concepts using a relatable theme drawn from the Pokémon universe. The video discusses the process of database normalization and indexing, highlighting their importance for maintaining data integrity, efficiency, and performance.

Key points covered include:
- Introduction to Database Normalization: Using the character Red from Pokémon as a protagonist, the speaker unfolds the storyline of creating a Pokedex app, which eventually leads to challenges in data organization.
- Emergence of Business Requirements: As new features like Pokémon evolution and moves get introduced, the initial data structure becomes chaotic. This represents the common issue of handling evolving business requirements in data management.
- Data Structure Challenges: The presentation illustrates how non-atomic values, duplicates, and empty cells lead to inefficiencies and inconsistencies in the database.
- Normal Forms: The process of normalizing the data is explained, showcasing how separating out moves and types can improve data integrity while increasing complexity in retrieval processes.
- Introduction to Indexes: The second part of the presentation shifts focus to indexing. As more Pokémon entries get added, the inefficiencies in searching by name are uncovered, leading to a discussion on the organization of data in B-trees and the importance of indexing for performance.
- Balancing Complexity and Speed: The use of indexes provides faster searching capabilities but requires careful management to maintain overall data integrity and structural organization. The trade-offs made in database design choices are discussed.
- Conclusion and Insights: The video concludes with a reflection on how the principles of normalization and indexing can be applied effectively to enhance database management related to Pokémon.

Overall, the presentation serves as both an informative introduction to database principles and an engaging narrative that blends technical concepts with a beloved game franchise, providing a memorable learning experience for the audience.

Pokedex Chronicles: Part 1 - Database Normalisation, Part 2 - Indexes Strike Back
Mateusz Woźniczka • July 22, 2024 • Malmö, Sweden

Step into the world of database mastery with our Pokedex adventure! Embark on a journey to unravel the secrets of normalization and indexing, enhancing data integrity, efficiency and performance along the way!

Baltic Ruby 2024

00:00:07.680 I will start my presentation with an introduction, but not my introduction—rather the introduction of the protagonist of our talk. This is Red. I have a question for the community: does anyone know who he is? No? No one? Well, someone try again!
00:00:20.080 No, that's not it. Yes, if you can guess who he is, I’ll buy you a beer at the party! So, as I said, this guy is Red, the protagonist of our talk. He comes from the Pokémon game, which, if you haven’t played, I highly recommend. It's a fantastic game!
00:00:28.000 So, basically, like any 10-year-old in the Pokémon world, he wants to be a Pokémon trainer. He approached the owner to get his Pokémon and start his journey. However, the Product Owner had a business idea: Every 10-year-old wants to be a Pokémon trainer, so they would create an app.
00:00:46.800 They planned to add some games and multiply those games by 17. Kids would input their parents' credit card details, and they'd become rich! The app would gather all the information about Pokémon, and whoever uses this app would become better at battling or whatever. They would call this app the Pokedex.
00:01:05.840 This is part one, focusing on data normalization. Now, back to our protagonist, who had some information in mind. He had seen a design for a game that looked great, but who cares about design? He decided to gather information about Pokémon. The app would have names, statistics, types, strengths, weaknesses, and attack types.
00:01:30.360 He filled out the details for the first Pokémon—nothing fancy. Then, the first battle happened, and he had information on another Pokémon, and another one after that, until something unexpected happened: new business requirements arrived! The professor told him that Pokémon can evolve. Basically, if they fight for long enough, they can change their form.
00:01:59.320 He thought, 'No problem! I’ll just add another column titled 'next form' to the table, fill it out, and we will be done for the day!' Great success! But then he noticed another Pokémon that apparently does not have any forms. No worries, let’s just keep filling in the data.
00:02:16.400 Then, yet another business requirement surfaced. It turns out Pokémon can have multiple moves. They can attack in various ways. Our protagonist looked at his table and realized it was a no-brainer. He thought, 'Let’s change the name of this column to 'moves' and just add whatever moves Pokémon can have.' This time he closed the ticket. Great success!
00:02:41.880 However, he started to notice some problems. 'Okay, I’m not very good at this job—I'm just 10 years old. But I can see that there’s a mess in the statistics. Considering there are 155 Pokémon, the professor will surely hire more people to contribute to this table, leading to even more chaos!' He discovered that he started adding attack, defense, attack, defense but then made mistakes.
00:02:58.080 If it continues to grow, it will get messier. So, he thought they needed to fix this issue now when it's still easy. Let's write down this problem in their backlog: non-atomic values can lead to inconsistencies. Let's split the data because it's quite straightforward.
00:03:21.080 We’ll keep numbers, which is super simple, and we don’t really have any ambiguous fields here—at least for now. We fixed one issue; however, moves are problematic too. There are non-atomic values here, so we aren’t completely done with the problem yet. But we’ll see—perhaps it will resolve itself.
00:03:35.440 Editing isn’t easy either. Let’s say it turns out that this grass type of Charmander or whatever he’s strong against needs to be changed—does it mean that this grass also has to be changed, or not? The same goes for weaknesses; it’s a tricky situation. So, they’ll write another problem in their backlog: some values have to be updated in multiple places.
00:03:54.040 Let’s return to our journey. They found another Pokémon, filled out their database, and started noticing a pattern: there were empty spaces. Is this good or bad? They didn't know. But considering they’re working with an overwhelming amount of only 4 megabytes of memory, it could be a problem. Just to be safe, they decided to add it to their backlog.
00:04:08.680 Empty cells are unnecessary because they occupy space in our table. Now, our protagonist realizes that this rock thing is not actually a Pokémon—it's just a rock! But who would know? He wants to remove it because it’s unnecessary information, but if removed, they would lose the information that rock types are strong against fire and weak against grass. So, they find that deletion could lead to unwanted data loss.
00:04:22.280 They want to delete one piece of information, but this deletion causes a loss of valuable information that they want to keep. They could store data about types, so let’s make sure to return to this and handle it later. New business requirements have arrived! Their professor says that Pokémon should have IDs. They should be identified not just by name but also by unique ID.
00:04:45.200 This requirement was easy to implement, as they just had to add one column at the very front, assigning IDs. Finally! No problems! Ticket closed! Now, they can use these IDs to address the Pokémon in their next form. What if they have a Pokémon like Charmeleon? If they make a typo, it will lead nowhere, so they change it to an ID to prevent mistakes.
00:05:01.840 This is a much safer option compared to somebody typing 'Charmeleon' incorrectly. However, just as in real life, new business requirements have emerged. It turns out that moves also have types. For instance, the move 'Ember' is a fire type, which means something to their end users, while the move 'Bubble' has a water type.
00:05:27.920 This information is essential, so it must be retained in the database. Fortunately, their table has just evolved! From its previous version, it has now changed to include separate rows for Pokémon-move combinations. They no longer have non-atomic values, and they can easily add a column titled 'move type' and fill it out with all the necessary data.
00:05:42.280 And, you guessed it—ticket closed! They are safe. Their table has evolved further, and they can address the problems concerning atomic values. They now concluded that they are no longer suffering from the problem of having non-atomic values in their table. This is great news! They have fixed some issues, but, of course, it comes with its own problems.
00:06:01.920 They have a lot of duplicated data, because now, as I said, every Pokémon isn't just a table about Pokémon; it's a table about Pokémon linked with moves. For instance, Charmander now has two records. Therefore, the Squirtle has two, and another Pokémon has three records per one Pokémon. But fortunately, their table will need to evolve yet again!
00:06:24.960 They took a look at it and realized they can remove duplicates. In this table, they didn't need to know twice that 'Ember' is a fire type. They can filter out the duplicates and add IDs. They know IDs are beneficial, and they will rename this table because currently, they have two tables: Pokémon and moves.
00:06:46.320 This new table will just be moves. Thus, they are left with something much cleaner and more efficient. They have ensured the Pokémon table reflects this in a way that mirrors the move table. They can see they have ended up with quite a few duplicates, which can safely be deleted since they add no real value to the table.
00:07:08.920 Let’s clean it up now, removing all the unnecessary duplicates. Their Pokémon table looks much better! However, they need one more element linking the Pokémon table with their moves table. Essentially, they now have a Pokémon table on the left and a move table on the right.
00:07:32.000 Instead of using move names, they're using move IDs. It will work like this: there will be a linking table showing which Pokémon corresponds to which move. They want to delete the empty rows because they contain no information.
00:07:50.480 From this point onward, they have evolved quite nicely. Their new table now tracks information about Pokémon, moves, and provides a complete view of which Pokémon has which move. They returned to their problems to see which of them have been resolved. They are pleased to find that they no longer have empty cells in their tables, and duplicated data has also been eliminated.
00:08:09.520 And yes, you guessed it, their table is evolving once more! They now realize that some data points do not rely on the Pokémon themselves but on the types. For example, the attribute indicating the strength of a Pokémon against certain types does not pertain to the Pokémon but rather to the type itself. Hence, they’ll need to adjust their data structures again.
00:08:30.799 They will extract those data entries, create a new table for the types, and add the IDs here. This way, they can ensure that changes to the strengths and weaknesses of types only occur in one location. Their Pokémon table has now evolved into something like this: Pokémon, moves, types, and the linking table of Pokémon moves.
00:09:00.000 Now, they look back to their problems. They no longer have to edit values in multiple places. They’re able to retain the information about types effectively. Most importantly, deletions are now safe—their structure allows them to remove unnecessary data without complicating matters.
00:09:20.720 Let’s get rid of all unwanted data, which he can do now without losing other important details. They feel incredibly achieved. With their new table, they also want to quickly learn more about the Pokémon, let’s say Butterfree. They can see that in the main table, the type ID is four.
00:09:43.920 Now, they will need to check the type that corresponds to that ID, which they know is the bug type. They want to identify strengths and weaknesses and determine which moves this Pokémon has access to based on its ID and the information present within their database.
00:10:10.560 However, the downside of this data structure is that reading it has become more complex! Let’s transition to the summary of part one. In case we describe problems encountered instead of naming them inconsistencies, we could call them anomalies. Instead of saying evolution, they'll refer to it as normal forms.
00:10:32.080 It will indicate they’ve done a lecture on database normalization. The key takeaway? The higher the form, the lower the redundancy; less redundant data means a higher data integrity. With greater integrity, it is easier to perform modifications.
00:10:49.680 With higher forms, however, comes longer read times—it’s the cost of maintaining better data integrity. Now that we’ve concluded part one, it’s time to jump into part two. Why not? Episode two begins with a cold open highlighting the first business requirement.
00:11:06.080 They have a data structure ready, but their business owner says, 'That’s great, it works, but we actually need more data. Otherwise, it’s useless for users.' So they get to work adding more Pokémon, filling up their database with countless entries.
00:11:39.560 To showcase the capability of their table, they decide to implement more Pokémon—ending up inserting 32 entries into their database. It may not be excessive, but it's enough to highlight certain records and observe how their table adapts.
00:11:59.040 However, they received yet another requirement, because the functionality for finding Pokémon by name is now slow. The business owner complains: 'You have added so many records that now it takes a while to retrieve them—fix it!' So they set off to debug the issue.
00:12:23.680 They search for Charmander, and thankfully, it isn’t too bad. But the engineers, knowing better, anticipate potential issues. They decide to look for another Pokémon randomly, and it turns out it’s at the very end of the table.
00:12:39.960 Finding this Pokémon takes quite too long, because they’re searching line by line until eventually locating it. This leads to the realization that the root cause is that their data isn’t sorted, leading to inefficiencies.
00:12:58.560 Data would be easier to find if they sorted it, so they go about organizing their table. 'We have some magic control over our table,' they confidently state. Now, the table is sorted alphabetically by name.
00:13:22.280 As a result, they search for Charmander again. Since it's sorted, they can easily navigate. They perform binary search, narrowing down possibilities quickly. 'We found it!' They rejoice because they’ve solved the issue by sorting the table.
00:13:43.760 However, this success doesn’t last long because they soon learn they've broken something; the search function for finding Pokémon by ID is now malfunctioning. So, they find themselves in episode two, titled 'Indexes Strike Back.' The cold opening may be long, but it sets the stage for a deeper dive.
00:14:10.320 Realizing that names are correctly ordered, but IDs are not, they must revert their changes and gain a deeper understanding of the issue. They find that their table is essentially an index—not in the structure of a normal table, but rather, it's a B-tree.
00:14:32.640 The B-tree enables efficient searches but requires a rethinking of their organization. It’s slightly more optimized than the standard binary tree because it streamlines writing and reading processes effectively.
00:14:49.360 The data structure allows for branching, where they learn that searching for Pokémon with ID 7 starts at the root and helps navigate through branches quickly, speeding up search times. This understanding leads them to a better grasp of how their tables work but introduces new intricacies.
00:15:05.120 As they dissect their current table and its indexes, they discover that, while navigating records becomes faster, date retrieval is tricky, especially with pagination. If they retrieved four Pokémon, they'd need to read all over the disk—a slower process than expected.
00:15:31.200 Fortunately, they realize that an index-organized table can optimize how data is stored, where records that are close in value are also close in terms of physical placement in disk space. This means when retrieving four Pokémon, they can likely access them with just one read.
00:15:58.720 They recognize that another development team probably made these updates while working on features for the index. To resolve their current dilemma, they opt not to modify the existing indexed table. Instead, they determine the requirements for a secondary index to ensure swift retrieval.
00:16:23.040 This new index will provide a structure tying Pokémon names to their IDs, allowing for faster searches. The business owner feels relieved knowing there would be an extra layer for efficient searches. However, the saga continues because the business owner returns, expressing concerns about speed.
00:16:43.960 They gather together to identify what’s slowing down their queries. As they test, they need to search for Charmillon again, which requires them to navigate through both the primary and the secondary index to retrieve it. Once they find it, they realize they are conducting two searches—one for the primary index and another for the Pokémon data.
00:17:05.120 To solve this, they strategize, proposing that instead of retrieving an ID from the first index, they’ll store a physical address in it. So instead of pointing to a number, the index will simply indicate where the Pokémon resides physically on disk.
00:17:27.920 This significantly speeds up the lookup process because their queries now require just one search instead of two, making the business owner extremely happy. Finally, they close this long-anticipated ticket!
00:17:41.960 Just when they believe they are done, another business requirement surfaces. It turns out their team discovered new Pokémon, which now need to be added to the table. Timeliness is essential, so they proceed to add the new entries at the bottom of the currently-existing data.
00:18:03.600 Although adding Pokémon to the table is straightforward, they must remember that indexes aren’t as simple. They have to update the indexes as well, which may entail a bit more work.
00:18:26.320 As they venture into updates, they realize it isn’t just as easy as adding items; it entails a larger effort to manage the integrity of both their main tables and indexes to avoid chaos. It’s manageable, but it’s a more tedious process.
00:18:48.000 Soon after, another business requirement comes in: it’s time to remove expired Pokémon, because they’ve become extinct! They acknowledge this, but it raises more concerns.
00:19:10.760 Removing Pokémon leads to complications. If they delete certain entries, they'll cause a shift down the line, and everything beneath that deleted Pokémon has to move up. This entails additional rewrites to maintain structural integrity.
00:19:28.320 After removing Pokémon, they discover it's involving itself in data structure havoc. They need to transform their solution, shifting from an index-organized table to a secondary index for the IDs along with names.
00:19:52.080 This new structure means they’re no longer concerned about the ordering of data; data won't have to be adjacent or maintained in a specific order. They adopt a hash table approach—where the exact location of the Pokémon isn't important.
00:20:13.360 No longer fearing complexities tied to moving around entries, they can simply remove unwanted Pokémon without worrying about data from below being affected. The new insertion strategy doesn’t require worrying about the order of rows.
00:20:35.440 Afterward, they revisit the table, realizing some of their data was inaccurate—the Pokémon were not extinct! This time, they simply perform an addition to the table without any ordeals. It’s simple to throw them back in without any concerns about positioning.
00:20:53.960 Peering back into their previous work leads to the exploration of indexes, which allow for speedier searches. They reflect on how indexes can enable quicker lookups but are also an additional structure to maintain. The cost of having performance improvements entails further maintenance.
00:21:14.960 Determining that, generally, the primary structure behaves like a heap while managing indexes neatly organizes data retrieval. The presentation concludes with an opportunity for questions. They express immense appreciation towards the audience for the moments shared.
00:21:38.880 They note their GitHub account, created in 2016, is still a work in progress but compelled by passion. Thank you so much for the opportunity to speak about database normalization and indexing techniques in the world of Pokémon!
Explore all talks recorded at Baltic Ruby 2024
+8