When it all goes Wrong (with Postgres)

by Will Leinweber

In the talk titled "When it all goes Wrong (with Postgres)" presented by Will Leinweber at RailsConf 2019, the speaker addresses common issues developers encounter with PostgreSQL databases. Leinweber, a professional with extensive experience in PostgreSQL management, emphasizes the difficulty in diagnosing problems when Postgres operates smoothly most of the time. This talk aims to guide developers in identifying and resolving common database issues, particularly when users suddenly face performance problems despite no apparent changes.

Key Points Discussed:

- Common Issues with Postgres:

- Many problems stem from applications overusing the database rather than Postgres itself.

- Familiarity with common failure cases will facilitate quicker problem-solving.

Identifying Problems:
- The importance of understanding the system as a combination of resources: CPU, memory, disk I/O, and parallelism.
- A framework for diagnosing issues based on the relationship between these resources.
Key Assumptions:
- Hardware quality is fundamental.
- Postgres should not crash; if it does, further expert analysis may be necessary.
Diagnostic Techniques:
- Utilizing tools like perf, gdb, iostat, and others to analyze running queries and system performance.
- Recognizing common patterns in database operation to identify the root of issues.
Real-world Analogies and Examples:
- An analogy comparing Postgres issues to wartime aircraft studies illustrates the importance of understanding failure patterns to improve system resilience.
- Highlighting the significance of ‘backlog’ work and how certain users can drastically affect performance.

Conclusions and Takeaways:

- Strengthening application architecture and maintaining proper database resource allocation can prevent performance issues.

- Proactive measures like implementing rate limiting and scaling hardware can avert many common disasters.

- Developers are encouraged to familiarize themselves with diagnostic tools while keeping an eye on common usage patterns to maintain healthy database operations.

- Understanding the cyclical nature of system stress can help prepare developers to prevent outages.

00:00:20.960 Okay, hello everyone. Welcome to my talk about "When It All Goes Wrong (with Postgres)."

00:00:28.050 Just briefly, a little bit about me: my name is Will Leinweber. I'm a part of Citus Data, which recently became part of Microsoft. My website is bitfishing.com.

00:00:35.910 If you happen to go there right now, please put your computer on mute because it does autoplay music. Some parts of the slides here will be a little dense, so I've already posted the slides online. Don't worry about taking pictures of the more complicated parts; I'll tweet a link to them afterwards.

00:00:50.550 First, though, let me share a bit of my background. Currently, I work on Citus Cloud, which is a managed version of Citus. For context, Citus is an open-source extension to Postgres that turns it into a multi-node distributed database.

00:01:02.100 Before working at Citus, I was at Heroku focusing on Postgres. At both companies, I have been helping people with their applications that run on Postgres, providing support and guidance on running Postgres effectively.

00:01:21.869 Even though I’ve been working with Postgres for a long time, I still like to think of myself as an application developer first rather than just a DBA. Much of the help, knowledge, and insight I'll share today comes from my current colleague at Citus, Dan Farina, and one of my previous colleagues at Heroku.

00:01:34.950 In this talk, I want to address the common failure cases with Postgres and how to fix them, but also how to quickly figure out what went wrong.

00:01:40.619 The audience today is intended for people who use Postgres but aren’t necessarily Postgres experts. I understand that not everyone here is a full-time DBA.

00:01:54.989 The problem with Postgres is that it generally works well. When I say that it's "pretty good," it’s akin to when someone asks about a weakness in a job interview, and the answer is, "I just try too hard."

00:02:06.959 The issue is that if you’re just running one or two Postgres instances for your applications, you may not get a lot of exposure to the various ways things can break. Every time you encounter a problem, it can feel like it’s the first time.

00:02:19.080 Hopefully, in this talk, I can share with you different ways I’ve seen things go wrong and give you a starting point. You may not memorize exactly what to do, but you’ll know where to look for solutions.

00:02:34.560 So, what do you do if you encounter a problem with Postgres? Do you type it into your favorite search engine and follow the first Stack Overflow post you find? Not exactly! That type of advice can be misleading.

00:02:53.100 Instead, you might come across a mailing list post from Tom Lane and have to sift through a lot of information. But that isn’t what I mean either.

00:03:09.870 The challenge with Postgres is that it’s a complicated system. It includes not just Postgres itself but also the operating system, the hardware it's running on, and the network reliability.

00:03:22.049 One of the most common problems I encounter is cases where the database is simply overloaded. You might think that everything was fine and wonder why there's a sudden issue.

00:03:35.430 Statistically, I’d estimate that around 95% of problems arise from the application being the root cause, while about 4% can be attributed to a Postgres feature called autovacuum, something I can discuss in more detail later. The remaining 1% encompasses a variety of other issues.

00:04:02.400 The irony is that while the autovacuum feature is interesting, most discussions focus on that elusive 1% of rare issues instead of tackling the everyday problems that arise from overloaded applications.

00:04:14.220 A common scenario I witness is clients reaching out with claims that all their performance metrics indicate a database problem. They state that they haven't made any changes to their applications, so naturally, it must be the database that's to blame.

00:04:32.100 This brings me to an analogy that might resonate. During World War II, when fighter planes returned from missions, the military would analyze the locations of bullet holes to determine where to add armor. Logically, one would think bullet holes are the areas to reinforce.

00:04:54.420 However, the reality was that the planes that came back were the ones that survived. The ones that didn't come back often sustained damage in areas that looked relatively unscathed. Therefore, focusing on the visible damage would lead to a misguided approach.

00:05:10.800 In terms of application deployments, just because you didn’t change anything doesn’t mean that external factors haven’t affected your application. Increased traffic or unexpected usage patterns can still lead to problems, often without any notifications.

00:05:28.170 A typical challenge arises when specific resources are stretched thin on your database: memory, CPU, or otherwise. If you are already operating near your performance limits, even a minor uptick in resource demand can snowball and cause significant issues.

00:05:49.869 For example, a query that typically takes a manageable amount of time could suddenly experience delays because the resources are constrained. Similarly, small but frequent queries might compound on this and create headaches.

00:06:02.070 In this talk, I will guide you through diagnosing the actual cause of performance degradation. But first, let’s address some essential assumptions that must be true for the solutions I present to be effective.

00:06:25.169 Firstly, the hardware that your database is running on must be sufficient; low-quality components can lead to problems beyond the scope of this discussion.

00:06:40.530 Next, maintaining good backups and ensuring that all packages are up to date is crucial. Finally, your application and its state should also be sound.

00:06:54.000 Another point to highlight is that your Postgres instance should not suffer from unexpected crashes. If it frequently crashes and creates core dumps, that is a serious issue requiring expert intervention.

00:07:10.290 An important consideration is the configuration of Postgres's memory settings. The default setting in most operating systems allows applications to allocate more memory than they truly need.

00:07:25.920 If Postgres faces memory allocation issues under typical workloads, it could trigger the Out of Memory (OOM) killer, resulting in crashes. Ensuring that you have overcommit settings configured correctly is necessary.

00:07:39.570 During the session, I will also discuss containers. They often introduce additional complexities, especially when running Postgres because of the memory management nuances with container technologies.

00:07:55.529 Larger extensions in Postgres can also cause crashes, so it’s something to keep an eye on.

00:08:09.780 So, if Postgres isn’t the problem, how do you refine your question? We can reframe it to how Postgres and Linux can help you diagnose the actual issues.

00:08:25.110 To do this, we need to think about the server's resources such as CPU load, memory usage, disk I/O, and the degree of parallelism. Whenever you connect to Postgres, it forks a new process for that connection.

00:08:39.720 We can estimate the number of backends that are running based on the connections being made, which can give us insight into the level of parallelism.

00:08:55.620 Another factor to consider is the locks that Postgres is holding. We can examine how these interact when issues arise.

00:09:09.690 If there isn’t excessive CPU, memory, or I/O load, but you still experience problems, this might indicate poor credentials or interrupted networking.

00:09:25.740 If there are many locks, one lock could be blocking everything else, preventing any progress. Often, this can happen due to uncommitted transactions.

00:09:38.940 If you see idle transactions in Postgres, it suggests that someone opened a transaction but didn't commit it, and that can lead to problems.

00:09:53.520 When there are many backends operational, what's likely occurring is that the application is submitting excess work that the system cannot process.

00:10:09.960 If your application opens up connections without closing them, or employs a web server that fails to properly close connections, you could be experiencing a connection leak.

00:10:26.340 In situations where you see a lot of I/O activity but not many backends, it might indicate that your application workload is reading extensive old data from disk.

00:10:45.840 Data that previously resided in memory might now be stored in disk cache and will need to be reloaded, resulting in performance drops.

00:11:00.540 Recent events like a database failover could lead to significant slowdowns as everything has to be reloaded into memory.

00:11:17.790 It's also possible that I/O is high and there are many backends because the application is overwhelming the database with requests.

00:11:32.520 In those cases, if your application employs auto-scaling, adding web servers won’t solve the problem if the root cause is a workload design issue.

00:11:46.260 Next, let’s examine the scenario where memory usage is high without significant disk usage and few backends. This could be related to queries performing large group-by operations.

00:12:03.030 If your queries aren’t compacting the data effectively, they may consume an inordinate amount of memory while trying to pull in all the data required for processing.

00:12:19.440 Now, assuming high memory usage correlates with application workload, you may be dealing with additional application backlogs.

00:12:36.450 This is compounded when locks are holding datasets longer than thought, preventing the release of memory for other operations.

00:12:52.230 Another situation arises when high disk activity occurs alongside abundant memory usage and few connections. This often suggests that a disk-intensive operation is in progress.

00:13:06.960 In this case, seek to determine if queries are executing sequential scans or if a certain user action is causing excessive data reads.

00:13:22.740 If your system has a relatively small dataset being accessed repetitively, it’s possible everything is fitting comfortably into memory.

00:13:37.320 From a performance standpoint, it’s important to monitor for instances where users may send excessive numbers of queries unexpectedly.

00:13:51.630 In cases where CPU, memory, and I/O demand are all soaring, with disk activity being limited, further investigation is warranted.

00:14:06.120 It's crucial to analyze whether a database is performing substantial writes due to either data tasks or sequential scans.

00:14:22.350 On the converse, high disk activity coupled with high parallelism may signal that reporting queries are monopolizing database resources.

00:14:38.760 In such scenarios, consider implementing a follower database where these heavy queries can execute without impeding transactional performance.

00:14:54.690 If case everything is high, it suggests that the application itself is generating more work than what your Postgres instance can handle.

00:15:08.100 I realize this is a lot of information, and I am committed to providing a comprehensive overview.

00:15:20.340 Having discussed what can go wrong, let's explore the tools available to identify and diagnose these issues.

00:15:40.849 As a primarily Ruby developer, I initially avoided tools that provide detailed symbol reports. My inexperience with C made these tools feel intimidating.

00:15:55.279 However, I'm here to reassure you that they’re not as daunting as they appear, and using them can offer invaluable insights into what Postgres is doing.

00:16:10.830 One such tool is `perf`, which allows you to analyze the performance of your system. You can record what Postgres is doing and generate reports that help visualize CPU loads.

00:16:25.770 `perf top` provides real-time CPU data, much like `top`, and can help identify which areas are consuming CPU cycles.

00:16:39.690 `GDB` is another great tool, providing a one-shot ability to trace what Postgres is executing at that moment.

00:16:51.180 If you type `GDB` without any parameters, you can receive an immediate backtrace of active processes, allowing you to understand what Postgres is doing.

00:17:06.840 Tools like `IO stat` can help you monitor disk I/O, where you can identify patterns of too much reading or writing.

00:17:20.760 `htop` provides a colorful live-data representation of processes, showing which resources are being utilized and where.

00:17:33.840 For monitoring network performance, you can use `bwm-ng` or similar tools to analyze network interface activity.

00:17:49.170 If you want to check the backend statuses in Postgres, you can run the `pg_stat_activity` command to see which queries are currently active.

00:18:03.180 The `pg_stat_statements` extension provides a summary of all the queries executed, allowing for performance analysis.

00:18:18.390 So, to recap, knowing what may go wrong and understanding the array of tools available can help you effectively troubleshoot.

00:18:31.620 What do you do once you identify the issues? You may need to adjust configurations, implementing rate limiting if a particular customer is overwhelming the API. It’s ideal to have such limits established proactively.

00:19:07.680 Scaling your database could mean moving to larger hardware solutions, including getting more RAM or faster disks. Identifying whether you over-provisioned is tricky and requires careful consideration.

00:19:21.120 Lastly, code changes may be required. If you’re producing numerous unoptimized queries, exploring ways to optimize those or batch requests can significantly improve performance.

00:19:39.840 I want to conclude with an interesting perspective from a talk I watched years ago, which delves into the perpetual cycle of outages within systems and applications.

00:20:03.090 It highlighted three critical boundaries your system faces: economic thresholds, workloads, and the performance boundary, which is uniquely defined without natural deterrents.

00:20:18.511 The performance boundary lacks a safety buffer. Without one, applications will inevitably drift toward it until a major incident occurs.

00:20:33.659 As you tighten your response thresholds—like ensuring that 95% of requests don’t exceed certain latencies—you risk getting closer to that boundary until a failure occurs.

00:20:49.170 After emergencies, there’s often a temporary push for better practices, but over time the urgency fades, and systems may end up pushing boundaries once again.

00:21:09.330 This cycle keeps repeating, affecting not just systems but everyone involved—employees, customers, and stakeholders.

00:21:25.800 Thank you so much for your attention, and if you have any questions or points to discuss further, I'm here.