00:00:20.900
This talk is about optimizing your app by understanding your PostgreSQL database.
00:00:26.490
From the title, what I want to communicate in this talk is that we will cover the problems you face from application users' complaints about performance. They often say that their queries are slow, or as an application developer, you're trying to understand what issues are causing the application to be slow and tying that back to the PostgreSQL database.
00:00:39.570
So, what kind of statistics does PostgreSQL give you to figure out what exactly is going on, what's wrong, and then how to fix it? We will discuss some PostgreSQL tools and tips, including pg_stat_statements, useful catalog tables, and recently added PostgreSQL features such as CREATE STATISTICS.
00:01:06.539
A bit about myself: I'm Samay Sharma, and I work as an engineer at Citus. Citus is basically an open-source extension designed to scale out PostgreSQL across multiple machines. When you have a single-node PostgreSQL database and start facing performance issues while not being able to keep up with the workload, you can use Citus to scale it out across multiple servers.
00:01:18.300
I manage a team of solutions engineers. We collaborate with clients' application developers who come to us with performance and scalability problems. They often want to grow their databases to several terabytes, tens of terabytes, or even hundreds of terabytes while maintaining the benefits of PostgreSQL. I help clients model their applications and offer recommendations on data modeling to assist them in scaling their databases.
00:01:34.890
I like to include a fun fact about myself on all my slides; I recently got married about four months ago. Some people say that's a fun fact, but it might not remain so after a while!
00:01:40.950
As I mentioned, Citus is open source. You can actually visit GitHub to check out Citus and if you like it, feel free to star us. If you want to learn more about Citus, I’ll be around during the conference for discussions. However, this talk is not about Citus, but rather about PostgreSQL, and everything I discuss applies to your standard PostgreSQL database.
00:01:54.420
The challenge is often relating issues back to pinpointing what’s going on within the database. From an application perspective, users might say their app page is not loading quickly enough, etc. You may figure out that the problem lies within the database that’s causing the slowdown. Now, how do you discover what that problem is, tie the root cause, and then solve it?
00:02:13.370
The simplest way to address this is by using the statistics exposed by PostgreSQL. PostgreSQL captures a lot of stats about your data. When you run 'ANALYZE', or when PostgreSQL automatically runs 'ANALYZE', it collects various statistics about your data, such as the selectivity ratio if a user inputs a filter, etc.
00:02:27.569
It tracks data about all of your columns and about your database as a whole. It also tracks and exposes activity data, which I’ve categorized as monitoring statistics. This includes what queries are currently running on the system, how many connections are coming into the database, which indexes are being used, which indexes are not being used, and how much I/O is being utilized. All that information is captured in different tables within PostgreSQL, allowing you to query them as regular tables to obtain the information you need.
00:03:13.310
If you are using a managed service, you might care less about some administrative statistics. However, your managed service provider must take care of server administration statistics, such as replication performance, database size, and the size of individual tables.
00:03:37.080
What I will do is start from a very generic problem: my application is slow. Everybody can relate to this feeling, as I don’t think anyone claims their application is fantastic and always fast.
00:03:50.430
So, let’s begin with the problem. Symptoms include application users complaining about slow performance. Whatever monitoring tool you’re using points out that the majority of time spent is in database calls. It might be a specific set of database calls or every page they load that turns out to be slow; your page is making hundreds of database calls. How do you identify which query is slow and figure out what exactly is happening?
00:04:18.850
A good starting point is to examine your database's cache hit ratio. The cache hit ratio measures how many of your queries are being served from the cache versus how many require reading from disk. It serves as a good measure to evaluate the health of your database.
00:04:45.510
For transactional applications, you ideally want a cache hit ratio greater than 95%, with around 99% being optimal. This ensures most common operations don’t have to hit disk. Depending on the application, such as an analytics application storing 10 terabytes of data, you might accept a lower cache hit ratio since buying 10 terabytes of memory can be expensive.
00:05:01.500
The question is how to measure this and monitor it effectively. Throughout this talk, I’ll cover a few PostgreSQL tables and discuss the different information they can provide, as well as how you can leverage this information to improve application performance.
00:05:11.700
An example query you can run on pg_stat_io_user_tables would be to count how many heap blocks were read. From a terminology perspective, a heap is where PostgreSQL stores your data; indexes are not part of the heap. So, when I refer to heap blocks read, I’m indicating how many blocks were read from the heap.
00:05:32.790
To calculate the hit ratio, you take the number of blocks hit divided by the total of hits plus reads. By examining this ratio alongside the information from pg_stat_io_user_tables, you can draw insights on data storage efficiency. This table tells you how many times your index blocks were read from the PostgreSQL buffer cache versus when they could not be found in the cache, thus requiring a read from disk.
00:06:14.610
In addition to obtaining the table hit ratio, you can also measure the index hit ratio. If your indexes are not fitting in memory, it could indicate an area that requires optimization. The table also captures other information, such as statistics on TOAST (The Oversized-Attribute Storage Technique). Postgres compresses very large columns and stores them separately, which you can analyze to identify if reading from disk is a bottleneck.
00:06:39.270
Assuming you discover your cache hit ratio is low, this could imply various underlying issues. For example, insufficient auto-vacuuming can lead to performance inefficiencies and dead tuples accumulating in your database, causing significant bloat. Bloat is essentially the chunk of unused data.
00:07:04.920
Auto-vacuum runs automatically to clean up the database by removing dead rows. When you update or delete a row in PostgreSQL, it marks it as not visible to other transactions. Later, auto-vacuum cleans up these rows. You can tune auto-vacuum to be either more or less aggressive according to your requirements.
00:07:23.640
If auto-vacuum is not tuned properly, it won't effectively clean up old rows, negatively impacting your cache hit ratios. Bloat typically occurs when you have a large amount of data and little of it is actual usable data due to poor vacuuming. For instance, you could have a terabyte database where only 200 gigabytes are usable, which is problematic.
00:07:47.280
Additionally, inefficient queries might be leveraging sequential scans instead of index scans. PostgreSQL typically opts for sequential scans if the queries don’t have appropriate indexes in place. Thus, if you continuously scan your database for pinpoint queries without proper indexing, your performance would certainly degrade.
00:08:16.070
Moreover, if you have many unused indexes on your table and aren't using them appropriately, maintaining these can hinder your write throughput. After optimizing the single-node PostgreSQL instance to the best extent possible, if performance issues persist, you can consider solutions like scaling out with Citus.
00:08:36.130
Moving forward, let’s look into a few tables that will help you identify these problems in your database. We'll start with the standard database monitoring table: pg_stat_database. This table provides comprehensive pieces of information to understand the database activity.
00:08:55.900
For example, it indicates the number of backends connected, the database ID, and, importantly, details about transaction commits and rollbacks, as well as the number of blocks read and hit. By reviewing this information, you can pinpoint issues with your database throughput.
00:09:13.950
Furthermore, it tracks how many tuples were inserted, updated, and deleted. This data allows you to analyze the workload and helps identify if your database is insert-heavy or if you are experiencing a lot of updates or deletes.
00:09:30.180
However, these figures alone aren't enough; they only provide an overview of transaction activity. To isolate which part of the workload could be causing issues, you will want to drill down to per-table statistics using pg_stat_user_tables.
00:10:05.360
This table contains information such as how many times sequential scans were initiated versus index scans. You can track how many tuples each of these scans has read, which can help identify whether your queries need more indexing.
00:10:25.290
Another statistical point to measure is the number of live versus dead tuples, which can also point to bloat problems related to your auto-vacuum settings. You might discover significant bloat if you haven’t been maintaining your application correctly.
00:10:52.650
As I noted previously, you can search for bloat in PostgreSQL to get a detailed analysis of how much space is consumed by dead tuples and what needs cleaning. This process can reveal substantial opportunities for optimizing storage.
00:11:15.180
Next, the last vacuum and data analysis timestamps can also help evaluate how often your database is being cleaned up adequately. If it has been days or longer since your last vacuum, it’s likely that significant bloat has accumulated and could negatively impact performance.
00:11:41.860
To better manage bloat, it's a good practice to set auto-vacuum to kick in after smaller changes, particularly for large tables. This enables the database to maintain cleanliness more effectively. Addressing any concerns about sequential and index scans can significantly enhance your overall application performance.
00:12:05.850
Using information from pg_stat_user_tables, you can approximate dead tuples, ascertain which tables are I/O bound, and whether you’re running too many sequential scans. These pointers can help inform optimizations.
00:12:20.680
After you implement these basic tuning measures and still experience slow performance, you must dive deeper into diagnosing slow queries. pg_stat_statements provides essential insights into execution statistics of all SQL statements executed by the server.
00:12:48.230
It categorizes these statements by query format, allowing you to analyze performance across similar queries instead of each execution of a unique query, which might obscure performance trends.
00:13:05.180
For example, pg_stat_statements provides metrics such as execution time, the number of calls, and I/O statistics for your queries. This information serves as a crucial tool for recognizing which queries demand urgent optimization.
00:13:24.210
Utilize queries from pg_stat_statements to identify the top ten most time-consuming queries in terms of total execution time. Rather than just relying on average execution time, analyze total time to highlight the queries that cumulatively consume extensive resources.
00:13:45.550
By focusing your optimization efforts on frequently invoked queries that consume substantial execution time, you can achieve significant performance improvements. In situations where I/O is an issue, search for queries that spend the most time performing disk reads to further optimize your database.
00:14:12.780
Moreover, queries that frequently run, even if they’re fast, can be worth analyzing. Batch processing or aggregating results can prevent extensive resource consumption.
00:14:33.650
Running analyze can often help optimize performance, as does the process of creating appropriate indexes based on patterns observed in the pg_stat_statements report. Once you’ve identified and examined slow queries, using ‘EXPLAIN ANALYZE’ will further assist in optimizing these queries.
00:14:51.290
The essential takeaway from optimized query analysis is that ‘EXPLAIN’ provides the planner's prediction for query execution time based on estimated resource allocation. However, ‘EXPLAIN ANALYZE’ reveals the actual execution time and rows affected, providing valuable feedback.
00:15:17.670
Once you understand where the most time is spent in the query, you can direct your efforts toward tuning that specific section. Pay close attention to discrepancies between estimated results and actual performance.
00:15:48.074
If there’s a vast difference, it’s possible that PostgreSQL has not collected enough data for accurate statistics. In these cases, consider reanalyzing your database using a higher statistics target, which tells PostgreSQL to analyze a broader sample of data, ultimately improving the accuracy of query plans.
00:16:17.010
For instance, when querying with two correlated columns, PostgreSQL may not consider their relationship unless you explicitly declare multi-column statistics. When proper statistics are in place, PostgreSQL can leverage that data to create more efficient execution plans.
00:16:46.850
To summarize, utilize pg_stat_statements to identify critical queries that warrant optimization. When you have slow queries, make sure to rely on ‘EXPLAIN ANALYZE’ to achieve a better understanding of your query performance.
00:17:15.200
Pay attention to discrepancies between estimated and actual results, as it often illustrates statistical inaccuracies. Addressing these inaccuracies can lead to performance improvements that yield significant results.
00:17:41.720
This concludes the main content of my talk. As a bonus, I want to mention quick estimates leveraging PostgreSQL statistics. This isn’t tightly related to performance tuning but can help achieve good estimates without executing costly queries.
00:18:02.990
You can use stats to understand null fractions, distinct values, and most common values. These insights have practical uses, such as approximating query results and summarizing data distributions without seeking exact numbers.
00:18:25.370
For example, counting actual rows in a billion-row table can be inefficient, so using PostgreSQL's statistics to gauge a rough estimate can save time and resources.
00:18:44.090
Ultimately, PostgreSQL exposes a variety of statistics that help identify performance-related issues, while the specific pg_stat tables guide you toward areas for improvement. Focus on top queries, understand their I/O characteristics and the caching behavior, and use the statistical insights to enhance your overall database performance.
00:19:09.040
Finally, I’d like to share that Citus Data has been acquired by Microsoft. We are working on launching Citus on Azure, so if anyone is interested in keeping updated on availability, just visit the link to express interest.