00:00:20.960
Okay, hello everyone. Welcome to my talk about "When It All Goes Wrong (with Postgres)."
00:00:28.050
Just briefly, a little bit about me: my name is Will Leinweber. I'm a part of Citus Data, which recently became part of Microsoft. My website is bitfishing.com.
00:00:35.910
If you happen to go there right now, please put your computer on mute because it does autoplay music. Some parts of the slides here will be a little dense, so I've already posted the slides online. Don't worry about taking pictures of the more complicated parts; I'll tweet a link to them afterwards.
00:00:50.550
First, though, let me share a bit of my background. Currently, I work on Citus Cloud, which is a managed version of Citus. For context, Citus is an open-source extension to Postgres that turns it into a multi-node distributed database.
00:01:02.100
Before working at Citus, I was at Heroku focusing on Postgres. At both companies, I have been helping people with their applications that run on Postgres, providing support and guidance on running Postgres effectively.
00:01:21.869
Even though I’ve been working with Postgres for a long time, I still like to think of myself as an application developer first rather than just a DBA. Much of the help, knowledge, and insight I'll share today comes from my current colleague at Citus, Dan Farina, and one of my previous colleagues at Heroku.
00:01:34.950
In this talk, I want to address the common failure cases with Postgres and how to fix them, but also how to quickly figure out what went wrong.
00:01:40.619
The audience today is intended for people who use Postgres but aren’t necessarily Postgres experts. I understand that not everyone here is a full-time DBA.
00:01:54.989
The problem with Postgres is that it generally works well. When I say that it's "pretty good," it’s akin to when someone asks about a weakness in a job interview, and the answer is, "I just try too hard."
00:02:06.959
The issue is that if you’re just running one or two Postgres instances for your applications, you may not get a lot of exposure to the various ways things can break. Every time you encounter a problem, it can feel like it’s the first time.
00:02:19.080
Hopefully, in this talk, I can share with you different ways I’ve seen things go wrong and give you a starting point. You may not memorize exactly what to do, but you’ll know where to look for solutions.
00:02:34.560
So, what do you do if you encounter a problem with Postgres? Do you type it into your favorite search engine and follow the first Stack Overflow post you find? Not exactly! That type of advice can be misleading.
00:02:53.100
Instead, you might come across a mailing list post from Tom Lane and have to sift through a lot of information. But that isn’t what I mean either.
00:03:09.870
The challenge with Postgres is that it’s a complicated system. It includes not just Postgres itself but also the operating system, the hardware it's running on, and the network reliability.
00:03:22.049
One of the most common problems I encounter is cases where the database is simply overloaded. You might think that everything was fine and wonder why there's a sudden issue.
00:03:35.430
Statistically, I’d estimate that around 95% of problems arise from the application being the root cause, while about 4% can be attributed to a Postgres feature called autovacuum, something I can discuss in more detail later. The remaining 1% encompasses a variety of other issues.
00:04:02.400
The irony is that while the autovacuum feature is interesting, most discussions focus on that elusive 1% of rare issues instead of tackling the everyday problems that arise from overloaded applications.
00:04:14.220
A common scenario I witness is clients reaching out with claims that all their performance metrics indicate a database problem. They state that they haven't made any changes to their applications, so naturally, it must be the database that's to blame.
00:04:32.100
This brings me to an analogy that might resonate. During World War II, when fighter planes returned from missions, the military would analyze the locations of bullet holes to determine where to add armor. Logically, one would think bullet holes are the areas to reinforce.
00:04:54.420
However, the reality was that the planes that came back were the ones that survived. The ones that didn't come back often sustained damage in areas that looked relatively unscathed. Therefore, focusing on the visible damage would lead to a misguided approach.
00:05:10.800
In terms of application deployments, just because you didn’t change anything doesn’t mean that external factors haven’t affected your application. Increased traffic or unexpected usage patterns can still lead to problems, often without any notifications.
00:05:28.170
A typical challenge arises when specific resources are stretched thin on your database: memory, CPU, or otherwise. If you are already operating near your performance limits, even a minor uptick in resource demand can snowball and cause significant issues.
00:05:49.869
For example, a query that typically takes a manageable amount of time could suddenly experience delays because the resources are constrained. Similarly, small but frequent queries might compound on this and create headaches.
00:06:02.070
In this talk, I will guide you through diagnosing the actual cause of performance degradation. But first, let’s address some essential assumptions that must be true for the solutions I present to be effective.
00:06:25.169
Firstly, the hardware that your database is running on must be sufficient; low-quality components can lead to problems beyond the scope of this discussion.
00:06:40.530
Next, maintaining good backups and ensuring that all packages are up to date is crucial. Finally, your application and its state should also be sound.
00:06:54.000
Another point to highlight is that your Postgres instance should not suffer from unexpected crashes. If it frequently crashes and creates core dumps, that is a serious issue requiring expert intervention.
00:07:10.290
An important consideration is the configuration of Postgres's memory settings. The default setting in most operating systems allows applications to allocate more memory than they truly need.
00:07:25.920
If Postgres faces memory allocation issues under typical workloads, it could trigger the Out of Memory (OOM) killer, resulting in crashes. Ensuring that you have overcommit settings configured correctly is necessary.
00:07:39.570
During the session, I will also discuss containers. They often introduce additional complexities, especially when running Postgres because of the memory management nuances with container technologies.
00:07:55.529
Larger extensions in Postgres can also cause crashes, so it’s something to keep an eye on.
00:08:09.780
So, if Postgres isn’t the problem, how do you refine your question? We can reframe it to how Postgres and Linux can help you diagnose the actual issues.
00:08:25.110
To do this, we need to think about the server's resources such as CPU load, memory usage, disk I/O, and the degree of parallelism. Whenever you connect to Postgres, it forks a new process for that connection.
00:08:39.720
We can estimate the number of backends that are running based on the connections being made, which can give us insight into the level of parallelism.
00:08:55.620
Another factor to consider is the locks that Postgres is holding. We can examine how these interact when issues arise.
00:09:09.690
If there isn’t excessive CPU, memory, or I/O load, but you still experience problems, this might indicate poor credentials or interrupted networking.
00:09:25.740
If there are many locks, one lock could be blocking everything else, preventing any progress. Often, this can happen due to uncommitted transactions.
00:09:38.940
If you see idle transactions in Postgres, it suggests that someone opened a transaction but didn't commit it, and that can lead to problems.
00:09:53.520
When there are many backends operational, what's likely occurring is that the application is submitting excess work that the system cannot process.
00:10:09.960
If your application opens up connections without closing them, or employs a web server that fails to properly close connections, you could be experiencing a connection leak.
00:10:26.340
In situations where you see a lot of I/O activity but not many backends, it might indicate that your application workload is reading extensive old data from disk.
00:10:45.840
Data that previously resided in memory might now be stored in disk cache and will need to be reloaded, resulting in performance drops.
00:11:00.540
Recent events like a database failover could lead to significant slowdowns as everything has to be reloaded into memory.
00:11:17.790
It's also possible that I/O is high and there are many backends because the application is overwhelming the database with requests.
00:11:32.520
In those cases, if your application employs auto-scaling, adding web servers won’t solve the problem if the root cause is a workload design issue.
00:11:46.260
Next, let’s examine the scenario where memory usage is high without significant disk usage and few backends. This could be related to queries performing large group-by operations.
00:12:03.030
If your queries aren’t compacting the data effectively, they may consume an inordinate amount of memory while trying to pull in all the data required for processing.
00:12:19.440
Now, assuming high memory usage correlates with application workload, you may be dealing with additional application backlogs.
00:12:36.450
This is compounded when locks are holding datasets longer than thought, preventing the release of memory for other operations.
00:12:52.230
Another situation arises when high disk activity occurs alongside abundant memory usage and few connections. This often suggests that a disk-intensive operation is in progress.
00:13:06.960
In this case, seek to determine if queries are executing sequential scans or if a certain user action is causing excessive data reads.
00:13:22.740
If your system has a relatively small dataset being accessed repetitively, it’s possible everything is fitting comfortably into memory.
00:13:37.320
From a performance standpoint, it’s important to monitor for instances where users may send excessive numbers of queries unexpectedly.
00:13:51.630
In cases where CPU, memory, and I/O demand are all soaring, with disk activity being limited, further investigation is warranted.
00:14:06.120
It's crucial to analyze whether a database is performing substantial writes due to either data tasks or sequential scans.
00:14:22.350
On the converse, high disk activity coupled with high parallelism may signal that reporting queries are monopolizing database resources.
00:14:38.760
In such scenarios, consider implementing a follower database where these heavy queries can execute without impeding transactional performance.
00:14:54.690
If case everything is high, it suggests that the application itself is generating more work than what your Postgres instance can handle.
00:15:08.100
I realize this is a lot of information, and I am committed to providing a comprehensive overview.
00:15:20.340
Having discussed what can go wrong, let's explore the tools available to identify and diagnose these issues.
00:15:40.849
As a primarily Ruby developer, I initially avoided tools that provide detailed symbol reports. My inexperience with C made these tools feel intimidating.
00:15:55.279
However, I'm here to reassure you that they’re not as daunting as they appear, and using them can offer invaluable insights into what Postgres is doing.
00:16:10.830
One such tool is `perf`, which allows you to analyze the performance of your system. You can record what Postgres is doing and generate reports that help visualize CPU loads.
00:16:25.770
`perf top` provides real-time CPU data, much like `top`, and can help identify which areas are consuming CPU cycles.
00:16:39.690
`GDB` is another great tool, providing a one-shot ability to trace what Postgres is executing at that moment.
00:16:51.180
If you type `GDB` without any parameters, you can receive an immediate backtrace of active processes, allowing you to understand what Postgres is doing.
00:17:06.840
Tools like `IO stat` can help you monitor disk I/O, where you can identify patterns of too much reading or writing.
00:17:20.760
`htop` provides a colorful live-data representation of processes, showing which resources are being utilized and where.
00:17:33.840
For monitoring network performance, you can use `bwm-ng` or similar tools to analyze network interface activity.
00:17:49.170
If you want to check the backend statuses in Postgres, you can run the `pg_stat_activity` command to see which queries are currently active.
00:18:03.180
The `pg_stat_statements` extension provides a summary of all the queries executed, allowing for performance analysis.
00:18:18.390
So, to recap, knowing what may go wrong and understanding the array of tools available can help you effectively troubleshoot.
00:18:31.620
What do you do once you identify the issues? You may need to adjust configurations, implementing rate limiting if a particular customer is overwhelming the API. It’s ideal to have such limits established proactively.
00:19:07.680
Scaling your database could mean moving to larger hardware solutions, including getting more RAM or faster disks. Identifying whether you over-provisioned is tricky and requires careful consideration.
00:19:21.120
Lastly, code changes may be required. If you’re producing numerous unoptimized queries, exploring ways to optimize those or batch requests can significantly improve performance.
00:19:39.840
I want to conclude with an interesting perspective from a talk I watched years ago, which delves into the perpetual cycle of outages within systems and applications.
00:20:03.090
It highlighted three critical boundaries your system faces: economic thresholds, workloads, and the performance boundary, which is uniquely defined without natural deterrents.
00:20:18.511
The performance boundary lacks a safety buffer. Without one, applications will inevitably drift toward it until a major incident occurs.
00:20:33.659
As you tighten your response thresholds—like ensuring that 95% of requests don’t exceed certain latencies—you risk getting closer to that boundary until a failure occurs.
00:20:49.170
After emergencies, there’s often a temporary push for better practices, but over time the urgency fades, and systems may end up pushing boundaries once again.
00:21:09.330
This cycle keeps repeating, affecting not just systems but everyone involved—employees, customers, and stakeholders.
00:21:25.800
Thank you so much for your attention, and if you have any questions or points to discuss further, I'm here.