Craig Kerstiens
Postgres Performance for Humans
Summarized using AI

Postgres Performance for Humans

by Craig Kerstiens

In the talk titled "Postgres Performance for Humans," Craig Kerstiens, a product lead on the data team at Heroku, provides an insightful overview of Postgres database performance optimized for developers who may find the database's inner workings opaque. The focus is on practical strategies to enhance performance and troubleshoot issues while better understanding the system's capacity. Key points discussed include:

  • Database Workloads: Differentiate between OLTP (Online Transaction Processing) for web applications and OLAP (Online Analytical Processing) for data warehousing.
  • Postgres Configuration: Initial setup considerations, particularly in cloud environments, and best practices for optimizing performance via recommended configuration settings.
  • Cache Management: Emphasizes the importance of RAM for caching frequently accessed data, aiming for a cache hit ratio of 99% or higher.
  • Effective Indexing: Discusses methods for assessing and implementing indices, stressing a target index hit rate of 95% or higher and listing steps for creating or dropping indices as needed.
  • Data Retrieval Methods: Explains the difference between sequential scans and index scans, highlighting how index-only scans introduced in version 9.2 enhance efficiency.
  • Query Analysis: Introducing the EXPLAIN command and its variants (EXPLAIN ANALYZE) to analyze query execution plans for better performance optimization, alongside tips on managing and optimizing slow queries.
  • Types of Indexes: Outlines different types of indices such as GIN, GiST, and the advantages of B-tree indexes as a default choice unless unique requirements dictate otherwise.
  • Connection Pooling: Suggests using connection poolers like PG Bouncer to manage database connections effectively under load.
  • Scaling and Backups: Provides insight on scaling strategies like sharding and replication for high availability, emphasizing the critical importance of regular and tested backups, differentiating between logical and physical backup methods.
  • Monitoring Tools: Advocates for utilizing PGSTATSTATEMENTS for query performance tracking.

Overall, Kerstiens underscores that mastering these fundamental practices—particularly cache management and index optimization—greatly improves Postgres performance, providing a reliable starting point for managing workloads efficiently. He concludes that enhancing memory resources can be a significant performance booster for any application utilizing Postgres.

00:00:00.320 All right. Hi everyone, I'm Craig Kerstiens. I work at Heroku with Terrence and Richard.
00:00:05.520 I run product for our data team there and our languages.
00:00:11.519 I've done a variety of things there, so if anyone wants to talk about a product at Heroku, which I basically describe as doing whatever the engineers don't want to do, I'd be happy to chat about that afterwards.
00:00:17.920 Um, really quickly, a few shameless plugs to start with: I curate something called Postgres Weekly, a weekly newsletter full of end-user tips for Postgres. It's not targeted at DBAs.
00:00:24.640 I also blog a good bit about Postgres and Python, which most of you probably don't care about.
00:00:30.560 For those with Macs in the audience, if you have MacPorts or Homebrew for Postgres and are having problems with it, you should switch to Postgres.app. It's a nice native Mac installer.
00:00:35.840 As I mentioned, I work at Heroku, where we run a few Postgres databases. We've said publicly that we are the largest fleet of Postgres in the world, and no one's disputed it, so we are going to keep saying it.
00:01:11.119 I had to drop this in after a conversation at dinner last night. How many people here know who Tom Lane is? I see about three hands. Tom Lane has written a lot of software. There was a survey back in 2000, and he was responsible for about one percent of open source software in the world.
00:01:22.560 He was on the committee for TIFF, co-authored the specification for JPEG, wrote libjpeg, and co-authored the specification for PNG. For the last 15 years, he has been responsible for probably 50% of every Postgres release. I also have an email from about ten years ago of him on the mailing list; the gist of it is that it's Postgres, not Postgresql. He essentially said that the Postgresql rename back in '96 was probably the biggest mistake the project has ever made, which probably still holds true. So just go with Postgres; it makes life easier.
00:01:51.440 At the start of my talks on Postgres, I usually put up a slide asking how many people here don't use Postgres. A few hands go up. Okay, I probably won't convince you in this talk, but if you need convincing, this is the reason why. This is probably missing quite a few things now. If you want a quote for Twitter, a colleague of mine said, 'I'm not an Emacs guy, but it's really flexible. It does a lot more than just being a relational database; it's becoming more of a data platform than just a relational database.'
00:02:58.959 Moving on to the topic of this talk, which is Postgres performance. I assume everyone else here is either a Postgres user or a fan, despite not raising their hands. So Postgres databases and workloads are generally grouped into two forms: OLTP and OLAP—OLTP being primarily for web apps and transactional workloads.
00:03:24.640 I assume that's mainly what most people are here doing. How many people here are doing significant BI reporting or data warehousing stuff? A few hands. Sorry, I don’t have a ton for you today; I’m catering to everyone else.
00:03:42.640 Let's discuss setup and configuration quickly. This is usually where people start, and I'm going to breeze through this because you basically do it once and then you never have to care about it again. If you're running on Amazon, the short of it is to use Heroku or take a look at the talk on Postgresql when it's not your day job. Amazon is kind of virtualized on virtualized infrastructure, so you treat it a little differently than you would for other clouds.
00:04:17.359 If you're on other clouds, look into Postgresql when it’s not your day job. There's a great talk by Christoph Pettis, who basically says to go through and set these 50 configurations and then you're set. Don’t worry too much about what they mean; they’re perfectly safe to set. That’s great, as Postgres isn't fun to set up initially. If you're on real hardware, there's a book called 'High Performance Postgresql' by a former colleague, Greg Smith. It's a bit old, about three or four years, but it's still very applicable. He's working on updating it for a newer version.
00:05:01.360 If you have things like SSDs or Fusion I/O, he really dives into how to tune and optimize. There's a link to the talk as well, but if you search for it as 'Postgresql when it’s not your day job,' that’s where to start. The first thing I look at when examining any database, whether it's with customers or one of my own databases, is how it's performing. I begin with the 30,000-foot view.
00:05:58.240 The number one rule of Postgres performance is managing your cache and memory. Postgres is going to be better at keeping things in cache than you are. I know you might think you’ll know better, but you won’t. Postgres will keep frequently accessed data in memory and serve it from there. I've found that most of the time data will follow the 80-20 rule: you'll have about 20% of your data—maybe even just 10%—that gets accessed very frequently, and Postgres will keep this data in memory.
00:06:42.880 Therefore, you don't need enough RAM to match the total size of your database; you just need enough RAM for the frequently accessed data, which varies by application. The easiest way to assess this is to run a simple query which will return a cache hit ratio.
00:07:05.520 What you're aiming for is a cache hit ratio of 99% or higher. As soon as you drop below that, performance really starts to tank. At that point, the easiest fix is to provide more RAM to the system.
00:07:30.800 Another significant factor in improving performance is checking if you're using indexes effectively. Here's a simpler query to give you your index hit rate. I wouldn't necessarily recommend that you aim for a 99% hit rate everywhere for indexes, but as a rough guideline, if your cache hit ratio is always over 99%, that's excellent. For index hit rates, a target of about 95% or higher is a good benchmark.
00:08:01.680 If you see that there are tables with a large number of rows that are never indexed, it's prudent to review and create the needed indexes. You can also query and find unused indexes, allowing you to drop them if needed. The rough rule of thumb is: if there are no indexes being used in large tables, it’s a good idea to add one.
00:08:37.199 In the interest of time, I won’t go through all the specific queries, but you can find them on my blog. If you're a Heroku user, there's also an extension called pg-extras that provides a suite of additional commands, essentially running these queries for you, making it easier to access the data you need without needing to know them all.
00:09:05.639 If you're not on Heroku, does anyone here have a PSQL RC configuration? A few hands. It’s incredibly handy for running queries and accessing insights from your database. For instance, you can use the 'x auto' setting to automatically format output based on screen size for enhanced readability. Additionally, you can set slow query reports from within your session to keep track of performance.
00:09:43.440 Quickly, let's cover some basics of how data retrieval works in Postgres. There are two main ways it accesses data: sequential scans and index scans. A sequential scan begins searching the data on disk in order until it finds your requested data. However, if you're searching for a specific record (e.g., 'where id = 3'), it searches all records unless a limit is applied. On the other hand, index scans utilize various types of indexes to access data and significantly optimize the search process. Under the hood, the way this works is more intricate than implied, but essentially it allows Postgres to narrow down access to specific records more efficiently.
00:10:40.399 Before version 9.2, a limitation in Postgres was that it always had to hit disk to retrieve data even if it was present in the index. With version 9.2, index-only scans were introduced, allowing for better performance under certain conditions. However, they can still be tricky to master. The general rule of thumb for database optimization is that sequential scans are well-suited for larger data ports, while index scans are ideal for most web applications where you want to avoid fetching a large number of records unnecessarily.
00:11:06.399 When examining specific queries for performance, place the word 'EXPLAIN' before your query, which will provide an execution plan. Understanding the output from that can seem confusing at first, but the values to focus on are the startup cost, the maximum cost, and the number of rows returned. The maximum cost indicates the time it is projected to take to return the query, which is crucial to your analysis.
00:11:39.750 If you prepend 'EXPLAIN ANALYZE' to the query, it reports the actual execution time and the plan that gets executed instead of merely estimating it. This provides a clearer view of its actual performance, allowing you to assess the time taken and optimize processes accordingly. For most applications, page response times should ideally be less than 100 milliseconds, while common queries should ideally take less than 10 milliseconds.
00:12:10.680 For rare queries, the target is under 100 milliseconds. In looking at query performance, if you discover that a problematic page is taking longer—let’s say an average of 295 milliseconds—you can often speed things up significantly by adding the right indexes. For example, if your query is filtering on 'salary', adding an index on that column can reduce the execution time significantly.
00:12:46.399 While you might want to review every single page if a user complains about it—great, look into those queries—you shouldn't wait for complaints to determine what's slow. That’s where PG_STAT_STATEMENT becomes invaluable. If you aren’t utilizing this feature, you should be. It provides incredible visibility within Postgres.
00:13:15.839 In its earlier versions, PG_STAT_STATEMENT wasn’t as useful, but from version 9.2 onward, it became an essential tool. It normalizes every query executed against your database, allowing you to analyze how long queries take on average. This tool gives insights into which queries consume the most time, aiding in optimization.
00:13:49.839 From there, you can extract the query text that is underperforming and optimize it. When it comes to determining which queries to optimize based on duration, it often makes sense to not focus on the one taking 10 milliseconds, as it may only improve slightly. However, a query taking 80 milliseconds would yield a greater return on performance if optimized.
00:14:22.360 This leads us to discussing indexes. Postgres supports several types of indexes. As you dig deeper into its documentation, you may feel overwhelmed. A rough rule of thumb to keep in mind is that when you create an index, you're typically creating a B-tree index, which is generally the right choice unless you're doing something more unique that requires additional features.
00:14:58.640 If you’re using HStore or array data types, the GIN index is what you want to use, while those doing full-text searches or dealing with polygon shapes should consider the GiST index. There’s also K-nearest neighbor indexes for similarity match queries, though I must admit I’m not as well-versed in the latter. Most of the time, again, a B-tree index meets your expectations.
00:15:46.000 I want to mention some developments within the Postgres community. There’s a notable group known as the 'Russians' who contribute a great deal to Postgres. They’re incredible and are often researching and implementing new index types and optimizations. Some of their works include creating GIN and GIST indexes.
00:16:18.560 For instance, there's a new index type called 'Vodka' that they’re currently working on, which is an interesting addition to the family of index types. It’s amusing—hence the lighthearted name— and showcases how innovative their approaches can be when developing new database functionalities.
00:17:02.160 Now, there are a few other types of indexes like conditional and functional indexes. Conditional indexes work great if you need to index only a subset of data. For instance, if you're tracking historical addresses and only want to index current addresses, you can create indexes based on a condition like where 'current_address' is true.
00:17:29.760 Functional indexes, alternatively, allow you to create indexes around calculations or transformations done on data. If data always returns the same value when passed through a function, then it’s imperative to declare the index as immutable during creation, allowing Postgres to optimize the process effectively.
00:18:00.360 Combining conditional and functional indexes is also possible, offering great flexibility when setting up your database. It's also recommended to use the 'CREATE INDEX CONCURRENTLY' option; though it may be slightly slower, it won’t hold locks on your database tables for extended periods, allowing the application to run smoothly during index creation.
00:18:38.559 Now let’s touch on HStore and JSON. HStore is an incredibly useful data type in Postgres that acts as a key-value store. It’s included in Postgres but needs to be enabled with the command 'CREATE EXTENSION hstore' to use it in your database applications. It's advantageous for instances where unpredictable or dynamic schema data is stored.
00:19:30.920 In version 9.2, Postgres introduced JSON support, which was quite limited in functionality initially but improved with version 9.3. With the release of JSONB in version 9.4, the JSON support became much more mature, allowing for better indexing and faster querying. You can actually create GIN indexes on JSONB, making it efficient when filtering and searching through data.
00:20:08.840 Moving onto connection pooling, who here runs a connection pooler? It’s likely that many of you do but may not even realize it. Active Record does it by default, as does SQL. Once your application begins making hundreds of connections, consider implementing a standalone connection pooler.
00:20:46.560 PG Bouncer and PG Pool are two options you might explore, and I’d recommend PG Bouncer once you reach around 200 connections to your database. PG Pool has several functions, but it does many things only moderately well. The main advantage of using a connection pooler is scaling your application while enhancing the performance and efficiency of your database interactions.
00:21:46.240 An easy way to scale up your database is to throw more memory at it. While memory is critical for database performance, there are times when you may reach a limitation where scaling up is not feasible. At that point, you might consider scaling out. There are several options to achieve this, including sharding, which can be complicated, or simply utilizing replication and offloading read queries to a replica.
00:22:15.840 There are a few well-established tools for replication, such as Slony, Londiste, and Bucardo, as well as more recent developments like Wall-E and Barman, with the latter being a Heroku project. With these tools, you can effectively archive your PostgreSQL Write-Ahead Log and set up replicas, allowing you to manage read queries in a more efficient manner.
00:23:03.280 The key benefit of having replicas and conducting read queries there is that each Postgres database will maintain its own cache. This allows you to distribute read workloads according to specific needs, optimizing the efficiency for those specific models without overwhelming a single database node.
00:24:01.600 Now let’s speak about backups—does everyone here take backups? Do you test them? If you don’t test your backups, I’d wager you probably don't have reliable backups at all. It's crucial to regularly test backups to avoid any horrifying situations should you ever need to restore them, so be proactive with your backup strategy.
00:24:47.360 There are two main types of backups: logical and physical. Logical backups, often taken with the 'pg_dump' command, maintain human-readability and portability. However, they can start to fail under load, especially as your database size exceeds about 50GB. For larger databases under load, community experts often recommend physical backups, such as those taken with Wally or Barman, since they are easier to scale and impose fewer performance hits.
00:25:32.560 Logical backups are great for exporting and compressing data but can slow down the system during a 'pg_dump.' On the other hand, physical backups copy byte data on disk with limited architecture compatibility.
00:26:13.920 In summary, OLAP and BI data warehousing are distinct from the OLTP applications primarily discussed here. As the discussion progressed, we ventured into advanced topics like disk I/O importance and the relevance of queries on datasets in your applications.
00:27:00.399 When conducting queries, it's beneficial to leverage PG_STAT_STATEMENTS to understand load across applications. Ideally, your goal should be to keep your data in cache with high cache hit ratios while diligently monitoring and optimizing your indexes.
00:28:02.560 Above all, if you do the fundamentals correctly, scaling up memory is an incredibly effective way to tackle performance issues. Thank you, and I'm happy to take any questions.
00:28:56.160 You!
Explore all talks recorded at Ancient City Ruby 2014