Talks

Postgres at any Scale

Postgres at any Scale

by Craig Kerstiens

The video "Postgres at any Scale" features Craig Kerstiens presenting at the Birmingham on Rails 2020 conference. The discussion centers on PostgreSQL's capabilities and strategies for effectively utilizing it across various scales of data handling. Kerstiens begins by addressing the audience's familiarity with PostgreSQL, highlighting its growing popularity as an open-source relational database. He shares his background, mentioning his experience with Heroku Postgres and how it contributes to the insight he shares. Key points of the presentation include:

  • Popularity and Features: PostgreSQL is recognized for its rich feature set, including support for a wide range of data types (around 180), transaction handling, foreign data wrappers, and the pub/sub model with listen/notify features.
  • Community and Development: The PostgreSQL community is robust, with a core team and various contributors leading to continuous improvements and regular releases. Kerstiens emphasizes the evolutionary nature of PostgreSQL and its liberal licensing model that allows modification and redistribution.
  • Data Scaling: Kerstiens categorizes database scaling into three types: small (up to 10GB), medium (10GB to 1TB), and large (over 1TB). For small databases, he suggests using built-in features and testing backups regularly. In medium databases, monitoring cache hit ratios and indexing strategies are crucial, while larger databases may require employing read replicas and sharding for efficient data distribution.
  • Sharding and Scalability: He discusses sharding as a method to divide database up into smaller portions to improve performance, proposing the use of the Cytus extension that simplifies the process. Key strategies for maintaining reliability and performance include managing connections effectively through tools like PG bouncer, ensuring backups are reliable, and optimizing the database structure for transactional consistency.

In conclusion, Kerstiens encourages developers to engage with PostgreSQL, test their implementations, and stay informed on its evolving nature to maximize its effectiveness for their applications. The talk serves as a comprehensive introduction for those looking to leverage PostgreSQL to manage data at various scales, highlighting its versatility and community support.

00:00:17.330 Hi everyone, today we're going to talk a little bit about Postgres.
00:00:23.220 Quick show of hands: who already uses Postgres? Awesome! And who does not? Alright, if you're too shy, we can chat afterwards. I'm happy to convince you.
00:00:37.260 I take the exact opposite of the philosophy that says the database is a dumb hashing tool. We will get into why I think it’s important to know a little about your database to work better with it and to handle it at pretty much any scale without having to go back to school and take a bunch of database courses.
00:01:00.250 This is really a 101 session with a lot of links and references to help you treat it a little better. First, a little bit about me: I was one of the first product managers at Roku and spent about five and a half years there, primarily building and running Heroku Postgres.
00:01:13.470 We ran around 1.5 million Postgres databases across a team of about eight of us, which I think is a pretty good ratio of databases to engineers for such a large site.
00:01:22.110 I will talk a little bit more about the site later, but the short story is that we turned Postgres into a distributed database, and it is transparent to your application.
00:01:33.079 We were at Clara right about a year ago, and now I'm at Microsoft. If you like Postgres and want to learn a bit more about it, I create a newsletter called Postgres Weekly, which is highly curated.
00:01:46.500 It sounds like Ruby Weekly and others, but it doesn't include a hundred items; it’s more about the top 10, 15, or 20 items that are really targeted toward application developers, not just the DBA crowd.
00:02:12.210 Based on the show of hands in this room, I think everyone's hand went up. Postgres is more popular than ever; it’s kind of the largest open-source database and is growing in popularity.
00:02:28.620 If you look at sites like Hacker News and see who’s hiring, I don't know if you count that as a barometer of success, but Postgres ranks number one among databases.
00:02:43.470 It's really well-loved and well-liked as a relational database. First, this is a really fun excerpt from the Postgres mailing list from 2006 about Tom Lane.
00:03:02.370 Anyone know who he is? For those who don’t, Tom Lane helped create TIF, was on the board for creating the spec for JPEG, co-authored the JPEG spec, and then co-authored the PNG spec.
00:03:27.840 He got tired of working with images and decided to work with databases instead. He has been working with Postgres for about the last 15 to 20 years and is the primary contributor to much of Postgres.
00:03:39.720 He is among the top five or so open-source contributors. This was him, 14 years ago on the mailing list, coming back from a conference, stating that the biggest mistake Postgres ever made was its name.
00:04:02.610 He argued that it is not Postgres SQL; it’s okay to just call it Postgres. It’s painful to this day, and I’m actually thinking of starting a campaign with all the core developers to call it Postgres to annoy them until they change it.
00:04:34.140 If you're new to Postgres or someone asks you why you like it, I think many of us engineers often jump onto a bandwagon and say, \'Oh, it's cool, of course I’m using it.\' But when asked why, we may not provide a good answer.
00:04:48.820 It’s a really feature-rich database. For instance, data types are something to consider. I’m genuinely excited about this. You wouldn’t think that it stands out in terms of a database, but it’s quite useful.
00:05:10.660 Postgres supports powerful transactional DDL. If you’re running a migration and it fails, you can actually roll it back and not end up in a mid-migration mess.
00:05:32.570 Foreign data wrappers allow you to connect from inside Postgres to another database and query directly. So, if you have stuff in Redis and want to interact with it in Postgres, you can query directly from those other databases.
00:05:54.020 The Listen/Notify feature provides pub/sub directly inside your database. Some of these features will not be covered in detail in this talk, but if you’re curious about them, I’m happy to chat afterwards or you can look them up.
00:06:06.870 Now, let’s delve into some background on Postgres. It's unique as an open-source database. Within Rails, there is no single owner; it’s based on a very liberal license which is rare for databases.
00:06:29.370 You can take it, modify it, and resell it. This happens frequently; for example, Greenplum, Redshift, Aster Data, and others were all originally based on Postgres. They took it, modified it, and now don’t look much like Postgres anymore.
00:06:46.670 Postgres has gained popularity in academia because you can create PhD projects on it, ship them, change them, and do whatever you want with it.
00:07:00.670 The community is interesting; there’s a core team that acts like a steering committee with a committer base of about 40 people who are committed to Postgres. This number has actually doubled over the last three years.
00:07:20.520 There are major contributors and minor contributors, so you can contribute significantly to Postgres without necessarily becoming a committer.
00:07:28.520 Discussions and patches get handled through mailing lists. There’s no central Git repository or pull request model; it has functioned like this for the past 20 years.
00:07:41.520 If you want to learn more about Postgres, the PG User mailing list is a good place to start. If you want to deep dive into Postgres, the core hackers list is where they discuss deep kernel issues.
00:07:51.230 It is fascinating, though I don’t understand half of what they discuss there. If you want to level up, it's a great place to go and subscribe.
00:08:01.250 Postgres regularly releases updates, usually every fall. Previously, major releases had decimal point versions—9.2, 9.4, etc. Now, major versions are numbered without decimals, such as 10, 11, 12.
00:08:13.750 Minor versions are released with security fixes or patches, typically about once a month. They used to have marquee features, like JSON support, but it’s evolved.
00:08:34.750 In version 9.4, JSONB (Binary JSON) was introduced, allowing better storage and indexing, which is extremely powerful.
00:08:50.260 Let’s discuss more features and why Postgres stands out. One of its strengths is its liberal approach to new data types. There are around 180 types supported.
00:09:03.350 For instance, if your application is using arrays, you can compactly store data in arrays in the database instead of joining them to another table.
00:09:17.480 Data types such as JSON and JSONB are highly useful. JSON is better for preserving whitespace in API data logs, while JSONB is better for querying.
00:09:38.180 Working with time zones is made easier with Postgres, as it can automatically shift time zones.
00:09:48.890 Timestamp is an incredibly useful data type in Postgres. UUIDs are handy if you’re using them in your app and can be directly stored in the database.
00:10:04.300 Range types are beneficial if you’re developing applications that need to manage range constraints, like calendar applications where overlapping classes shouldn't occur.
00:10:18.940 SQL support in Postgres is feature-rich, including common table expressions (CTEs) known as \'with\' clauses. Who enjoys writing SQL?
00:10:35.380 Hands? More hands for reading SQL. SQL is a powerful language, but with CTEs, your queries can be more readable and maintainable.
00:10:50.300 Window functions are very useful for tasks like ranking based on categories. Postgres has many built-in functions for data manipulation.
00:11:07.430 Regarding indexing, if you're like me and don’t have a CS degree, navigating through different index types can be daunting.
00:11:22.250 In general, B-tree is a commonly used index type, and with each major Postgres release, new index types are introduced.
00:11:37.390 There’s a group in the community that works on about index types, finding uses for specific data structures. For example, GIN (Generalized Inverted Index) is great for JSONB.
00:11:50.180 Full-text search and geospatial indexing are also prominent features of Postgres, allowing complex data interactions like overlapping text or polygons.
00:12:03.440 When dealing with spatial data, I often hear that Postgres is great for working with large datasets of information that cluster together.
00:12:18.920 When creating procedural SQL, there is PL/pgSQL, and the option to use PL/v8, which incorporates the full v8 JavaScript engine directly into Postgres.
00:12:35.430 This means you can write JavaScript within the database, opening up new capabilities and making complex logic easier to implement in your application.
00:12:51.430 Postgres is also becoming a data platform, moving away from being just a strict relational database. It is slow and stable but unique with its extension framework.
00:13:02.670 You can add low-level hooks to change Postgres behavior, create new data types, or even new indexes. One notable extension is PostGIS.
00:13:17.780 PostGIS is widely regarded as the advanced open-source geospatial database, which hooks into Postgres, extending it with new data types and functions.
00:13:31.330 If you are looking for time series capabilities, there are several extensions that support this as well. If you need to migrate from Oracle to Postgres, there’s an extension to help with compatibility.
00:13:49.920 The ability to extend Postgres allows you to make extensive changes without waiting for a new core release. There are about 100 extensions available now.
00:14:05.319 For pro tips, if you use terminal interfaces, PSQL is really good. How many of you have configured your .bashrc?
00:14:18.360 Most hands. How many of you have a PSQL config? Anyone?
00:14:23.419 PSQL is powerful enough to do things like auto-format your text, keeping it wrapped within your screen’s width, and address query timing for performance tracking.
00:14:40.458 If you're ever asked to generate a report and need to retrieve the query from history, PSQL saves the history of every query run against your database.
00:14:55.073 If the query is complex, you can save it and execute it later. This feature is especially handy when re-running past reports.
00:15:10.726 If there are common queries you often write, you can name them so that executing them is as easy as just typing their name.
00:15:24.369 If you’re not a CI person, another data tool to consider is pgAdmin, which provides a GUI interface for Postgres.
00:15:38.370 Hopefully, by now, you’re convinced about using Postgres. If not, I encourage you to check out some articles I’ve shared.
00:15:53.374 This talk has largely been about managing Postgres at various scales. To start, small databases—less than 10 gigabytes—are easy to handle with Postgres.
00:16:11.488 You can leverage its data types, use arrays, timestamp data types, and JSONB effectively. Start putting indexes in place early.
00:16:28.642 Make sure to have a backup strategy—if you haven’t tested your backups in six months, it’s wise to do so.
00:16:43.615 Spend some time setting up your environment with PSQL and other tools. This will alleviate crisis situations when you are triaging unexpected problems.
00:17:01.900 When you reach a medium scale—beyond 200 gigs to a terabyte scale—you’ll want to track performance closely. Key metrics to monitor include your cache hit ratio.
00:17:20.460 You should aim for a cache hit ratio of 99% or higher. If it drops below 99%, you may need to analyze your indexing.
00:17:35.500 Also, track your index hit ratio. When you issue a query, you don’t want Postgres scanning every single row.
00:17:51.740 Leverage indexes for your most popular queries. This leads to better performance when you analyze the number of rows in each table.
00:18:05.520 Sometimes you may need to drop unused indexes. It helps to periodically review the indexes and drop those that haven’t been used recently.
00:18:20.840 Index management is crucial to optimizing your database’s performance and keeping it robust.
00:18:34.280 Connections in Postgres can be a sore spot. Each new connection incurs some time overhead due to TLS negotiation; I estimate this takes about 50 milliseconds.
00:18:47.350 Creating a pool with Active Record can lead to resource wastage if many idle connections accumulate, especially with high connection volumes.
00:19:02.070 PG Bouncer is a great solution you can implement to optimize this situation. It helps manage connections more effectively by closing idle connections.
00:19:20.370 If you are running a sizable workload, I would highly recommend using PG Bouncer.
00:19:31.500 Now, as we delve deeper into the medium scale, focus on practical database needs—adding indexes, monitoring your cache ratios, and adjusting memory as needed.
00:19:46.330 If you want to optimize specific queries, use tools like PG Stat Statements to review every query that has been run against your database.
00:19:58.090 PG Stat Statements shows how many times each query was executed and the average and total time taken, allowing you to identify slower queries.
00:20:13.523 If you’re above 50 to 100 gigs, consider shifting your backup strategy to encompass physical backups.
00:20:30.810 Most people do a PG dump once a day; this is effective for logical backups.
00:20:43.540 However, as you grow, you may want to adopt a physical backup strategy, such as using PG Backrest for efficient growth.
00:21:00.060 As your database grows larger, think about sharding strategies to manage data effectively. Sharding allows splitting a large database into manageable bits.
00:21:17.230 Citus is an extension that can help you turn multiple Postgres databases into a distributed setup.
00:21:38.030 Ensure to hash your data upfront so you correctly distribute workloads across your shards based on customer IDs or other hashable identifiers.
00:21:54.870 This approach will help keep data organized and ensure that your queries remain efficient as your application scales.
00:22:10.770 Managing relational databases can be challenging due to multiple relationships and models, but there's a way to simplify the architecture.
00:22:28.990 Grouping related data is essential; consider colocating your tables to help ensure transactional consistency without excessive overhead.
00:22:43.610 You wouldn’t want to rely on a reference table with vast amounts of records, as it complicates the writing process across multiple nodes.
00:23:01.050 In terms of collocation, keep your tables tightly linked through keys, which can yield better performance and less friction when querying the data.
00:23:16.510 Monitor for N+1 queries that can deplete your cache hit ratio, leading to performance issues.
00:23:32.900 Look for opportunities to offload workloads to read replicas for reporting and analytics that can tolerate slight lags.
00:23:49.320 Ultimately, if sharding is necessary, look into modern practices as the landscape of distributed databases has evolved.
00:24:01.910 Lessons can be learned from successful implementations, like Instagram, to manage scaling issues effectively.
00:24:16.050 In summary, for small- to medium-sized databases, keep a close eye on your cache hit ratios, use proper indexing, and test your backups regularly.
00:24:34.640 Monitoring connection limits with PG Bouncer can alleviate many performance issues.
00:24:48.160 For larger scales, refine your backup strategy, select the right sharding model proactively, and ensure column colocations are effectively laid out.
00:25:04.060 This proactive management approach can ensure that your applications remain robust and responsive as they scale.
00:25:16.370 Thank you for your time!