Chris Hasinski

Fantastic Databases and Where to Find Them

wroc_love.rb 2023

00:00:10.200 Um, Andre introduced me as Kristoff, but I know that for English-speaking people, that’s pretty unpronounceable and difficult to spell. So, I usually go by Chris. Today, I want to talk to you about some mythical creatures, specifically, what I like to call 'hippie dragons.'
00:00:21.480 The term 'hippie dragons' is used when you have a map and don’t know what the lines mean, so you label them as hippie dragons. This informs people that you’re unsure and encourages exploration. This phrase is a modern adaptation; the original phrase was 'Here be lions.' However, ancient maps did not actually feature dragons. Interestingly, there’s a globe known as the Lennox globe that has this phrase inscribed on it in Latin, 'draconess,' which translates to 'Here be dragons.'
00:00:34.380 I want you to remember this reference as I shift to discussing something a bit scarier: layoffs. Layoffs have been the main topic in our industry for the last two years. However, with every end comes new beginnings, and that usually means recruitment. Consequently, we are gearing up for recruitment interviews.
00:00:52.079 During these interviews, candidates are often advised to ask questions about the project to show that they are genuinely interested. After countless interviews that left us indifferent, it’s important to engage with the specifics. A pertinent question could be, 'Which version of Rails do you use?' If they say Rails 7, that’s cool. Rails 6 is acceptable, but if they are still on Rails 5, there may be some lingering unease. If they admit to using Rails 4, you might consider charging them extra!
00:01:12.740 Other good questions delve into whether they practice Domain-Driven Design (DDD), as that’s a trendy topic. If you’re a front-end developer, you might want to inquire how they manage state on the front end or, conversely, if you’re interested in back-end processes, you could ask about Continuous Integration (CI) or Continuous Deployment (CD).
00:01:18.299 These are all valid questions during a job interview. Some candidates ask about testing practices, which can be fruitful to discuss, especially mutation testing, which may not have been addressed adequately in previous workshops. Generally, these interviews tend to lean towards specific topics.
00:01:39.420 However, most interviewers tend to answer questions positively until you ask them about their database. This may elicit responses like, 'Oh, it’s Postgres. Which version? I don't remember, but I’ll get back to you on that.' Often, the real conversations we have in hallway meetings focus on storing and manipulating data, which is precisely what a database accomplishes. If you need proof, let’s examine a typical Rails application.
00:02:07.500 This is an extremely simplified and, frankly, oversimplified diagram showing a Rails application. It can be divided into three sections. The first is the front end, which is the part the user interacts with, because that’s how you receive customer support tickets whenever something doesn’t work. So naturally, they will focus on the front end.
00:02:29.220 Next, we have the big part of the application. As Ruby developers, we often can’t look away from this part—it’s where our interests lie. However, there’s another component that actually does most of the work: the database. If you still doubt that, let's consider something known as the test pyramid.
00:02:45.480 When you Google 'test pyramid,' you’ll find an idealized version representing how tests should be structured. The majority should be unit tests, with a layer of integration tests, and a smaller layer of end-to-end tests. The end-to-end tests, while crucial, are also expensive and challenging to maintain.
00:03:03.360 On the contrary, if you look for the opposite structure (the 'ice cream cone' model), you will find that it comprises a few unit tests, more integration tests, and a multitude of end-to-end tests, including manual tests, which tend to be the most labor-intensive. For a typical Rails app, this model doesn’t necessarily hold true, as we often find we can only effectively test a few components at the unit testing level.
00:03:19.620 Meanwhile, we frequently end up relying heavily on integration tests where we interact with the database to verify that our logic functions correctly. Most Rails applications are essentially worthless without a database because they are meant for interacting with and manipulating data stored in a database. Yet, we still treat our databases as mere storage boxes—objects we don’t need to examine too closely.
00:03:51.240 They are exotic and mysterious—the dragons on our maps—and I would like to talk to you about how to manage these mythical creatures. First, let's admit that databases are indeed peculiar. They defy normalization principles; they neither adhere to object-oriented nor functional paradigms, being instead declarative—a harshly foreign approach for many of us.
00:04:05.820 They are also disconnected from most of your business logic. For instance, every connection is represented as a single user; the database doesn’t recognize who's currently logged in, although Rails does. However, databases possess one saving grace: they excel at processing data and do so much faster than Ruby could.
00:04:22.440 Let’s start by addressing what I consider the smallest creature: SQLite. The accompanying images were generated by Binjourney, which I asked to create representations of various database systems.
00:04:41.340 SQLite is typically used for projects I engage in during my free time—ones that I’m eager to showcase but don’t want the hassle of setting up a full-fledged database. The SQLite documentation does mention its use on websites and how it can handle substantial traffic. It’s efficient—able to manage around 20 million SQL queries in a day, even on a comparatively outdated VM from seven years ago, shared with 23 other applications.
00:05:04.440 The clock speed of SQLite operates remarkably well; essentially, it's the Superman of databases. SQLite is immensely fast and operates under a public domain license, making it easy to back up. This is not merely marketing fluff—consider the fact that from a single set of fixtures or schema, you can create a multitude of databases for your tests, running them in parallel.
00:05:19.140 Nevertheless, SQLite does have its challenges. In typical implementations, every query behaves like Ruby, suggesting you should use processes rather than threads. There’s only one writer, meaning you may encounter database unavailability; thus, implementing retries is recommended when inserting or updating data.
00:05:42.720 Additionally, it lacks full-text search capabilities, which can be a dealbreaker for small websites, since search functionality is in high demand. Despite these limitations, if your use case is primarily read-heavy, it might be worth considering SQLite even for larger products.
00:06:01.260 Now let’s address the 'elephant in the room'—PostgreSQL. This is arguably the best open-source project out there, in my opinion. It is incredibly versatile and can seamlessly integrate within your entire stack—whether as a GraphQL interface, cache, queue, or just about anything else.
00:06:26.519 An intriguing post by CTO Stefan Schmidt outlines various instances where Postgres can replace other components within your stack. I extracted some examples because I find them fascinating. One particular example is the use of unlocked tables, which means that when you do not write a headlock to disk, you can speed up your database operations significantly.
00:06:46.200 Although this may sound trivial, it is particularly impressive in practical use. For example, if you employ this technique in your specs, it could lead to substantial reductions in test times—up to three and a half minutes on poor test suites and around 20 seconds on well-optimized ones.
00:07:05.820 However, keep in mind that if you restart your database, those unlocked tables will disappear, requiring migrations to be run again. While great for continuous integration, it might be less beneficial for local development due to the recurring issues of managing migrations and states.
00:07:20.520 In addition to unlocked tables, Postgres comes with an extensive array of functionalities. The documentation is vast and full of interesting features, particularly in the advanced functions section, which deserves a special mention as well.
00:07:32.820 One highlight is window functions. Has anyone used a window function? Unfortunately, they often remain underestimated and misunderstood, often regarded as exotic features meant solely for data engineers or scientists working on reports.
00:07:49.560 However, they are exceptionally powerful! For example, using a window function can enable you to calculate averages over various partitions of data, providing more granular insights in reports. Let’s look at how this average salary can be determined by department using SQL.
00:08:05.760 While it may seem complex at first, translating this SQL to Rails is permissible. You can still access this functionality through Rails. Another example revolves around identifying the cheapest product per category, which can play a crucial role in e-commerce strategies.
00:08:21.360 Here, assigning a row number to each category based on the price and ordering them by name ensures your categories consistently show the cheapest items on top. All these operations can be effectively translated into Ruby using pattern matching and functional programming.
00:08:32.880 The final notable window function example I want to mention is the cumulative sum. This query can sum values in a growing manner, allowing you to track performance over time more effectively.
00:08:41.880 You can refactor this into efficient compositions within SQL, rather than forcing an excessively large single query. Though it may appear convoluted at first glance, if you break it down effectively, it allows for seamless integration into Rails and delivers aggressive performance.
00:09:00.180 The speed of these queries surpasses Ruby implementations by a long shot. One reason lies in the cost estimation feature within Postgres, which can perplex many users.
00:09:12.780 Many people cringe at the thought of interpreting costs; let’s raise our hands if we truly know what the initial figure represents in cost estimation. It's commonly unclear how to interpret these costs, likening it to story points in agile development.
00:09:29.460 You can actually look up cost parameters in the Postgres configuration summary, and these costs are predetermined. Yes, they can be adjusted, should you find that your environment's conditions warrant different assessments.
00:09:44.520 For instance, sequential scans can cost more if used on slower hardware. The system will add those costs up to provide overall estimations for your query.
00:10:04.320 Moving on to unlocked tables: while they speed up operations, it’s important to note that if you restart your database, you lose those tables. For CI processes, this is excellent, as you can regenerate a test setup from scratch.
00:10:19.440 Local environments can get complicated, as unwanted data states arise, often leading to the frustration of unnecessary migrations or strange database issues.
00:10:34.740 Now, let’s switch gears again and highlight the versatility of PostgreSQL through its various extensions! Among these, PostGIS stands out for geographical data handling.
00:10:47.680 There’s also Timescale for time series data, and while I hear a lot about these extensions, it’s hard to overlook the amazing capabilities brought in by the various other tools available in Postgres.
00:11:01.080 The flexibility of including options like columnar storage provides significant performance improvements as a result of the different types of queries you can execute.
00:11:15.020 The ability to store data in columns rather than rows allows compression, resulting in a smaller storage footprint while significantly increasing query speed for aggregations—all while managing multiples of these setups in a single query.
00:11:27.120 Distributed tables allow you to leverage multiple servers at scale, which is especially advantageous in cloud environments where horizontal scaling offers cost benefits. By partitioning your data based on user location or other metrics, performance vastly improves.
00:11:39.600 While distributed tables may sound complex, they simply require implementing one additional complexity to improve data management, with only a single function to execute.
00:11:54.000 Moving into the world of exotic databases, I’ll mention CockroachDB—which emphasizes automatic sharding and geographical awareness to improve availability.
00:12:09.600 Next, we have MySQL, which has become a strong contender again due to its improvements in speed and functionality, benefiting many users and industry projects.
00:12:21.120 It’s intriguing how MySQL has managed to grow and adapt, making it a valid choice for various database needs. Performance-wise, it differs from Postgres primarily due to its transactional assumptions.
00:12:34.440 For instance, in transactions, while Postgres might leave a versioning system behind, MySQL will layer updates directly onto existing rows. Thus, it can perform faster while handling user data effectively.
00:12:44.520 We even see topics arising like columnar databases gathering interest. ClickHouse, coming from Russia and once utilized by Yandex, is a fantastic system for handling marketing data thanks to its columnar storage.
00:12:56.040 It promises remarkable ingestion rates, for example, Cloudflare uses ClickHouse for six million inserts per second. By adapting and transforming this technology, many companies have successfully migrated their workloads from more traditional Postgres setups.
00:13:08.640 Although I see we are running low on time, I must emphasize that the world of databases is vast and intriguing, as highlighted by this discussion. If the interest persists, I'd love to delve further into subjects like NoSQL or document-oriented and graph databases.
00:13:21.840 In the meantime, I encourage you all to explore the fantastic world of databases on your own! Thank you!
00:14:00.480 Thank you for the presentation! It was great to hear all of this. I have a question regarding how to interpret the cost values you mentioned. Is a cost of 100 a high value? How do we evaluate it?
00:14:17.880 It is indeed a bit like story points. You can check the Postgres configuration for the specific operations and see their assigned costs, which are predetermined; they can also be modified. If you feel, for instance, that sequential scans are more expensive due to your hardware, you can assign them a higher cost value.
00:14:36.600 And so effectively, the cost is just summed together for a specific part of a query.
00:14:51.960 Moving back to unlocked tables, you mentioned they vanish when the database restarts, but does this happen if only the connection is reset?
00:15:10.920 I don’t believe a simple connection reset causes this, but restarting Docker will. If the database crashes or similarly resets, those tables will go as well. It can cause issues for local development when you might need to run migrations all over again.
00:15:27.720 Thank you again, Chris! Your insights are greatly appreciated.