Optimizing your app by understanding your PostgreSQL database

by Samay Sharma

In this RailsConf 2019 talk titled "Optimizing your app by understanding your PostgreSQL database," Samay Sharma emphasizes the importance of comprehending PostgreSQL statistics to enhance application performance. The focus is on identifying issues causing slow application performance through database understanding. Key points discussed include:

Performance Issues: Common complaints from users often relate to slow queries, prompting developers to dig deeper into PostgreSQL performance metrics.
Understanding PostgreSQL Statistics: PostgreSQL provides various statistics that can aid in diagnosing performance issues, such as cache hit ratios and monitoring statistics.
Key Tables: Vital PostgreSQL tables like pg_stat_database, pg_stat_user_tables, and pg_stat_statements are highlighted for their roles in providing insights about database performance, query execution, and user activity.
Cache Hit Ratio: A crucial measure indicating how efficiently the database serves queries from memory rather than disk. Ideally, transactional applications should aim for a cache hit ratio greater than 95%.
Bloat Management: Auto vacuum processes need to be properly tuned to manage unused data and optimize cache utilization to maintain performance.
Index Usage: Developers should monitor both used and unused indexes to understand their impact on performance. Misconfigured indexes may slow down queries due to additional maintenance overhead.
Tuning Queries: The use of the EXPLAIN ANALYZE command helps identify the slowest parts of queries and offers insights for optimization. Query plans should be closely examined, especially for discrepancies between estimated and actual row returns.
Statistics Correlation: The presentation emphasizes leveraging statistics for better query optimization, including the ability to declare multi-column statistics for improved accuracy in query execution plans.

In conclusion, understanding and utilizing PostgreSQL's rich set of tools and statistical data can significantly improve application performance by effectively diagnosing and resolving database-related bottlenecks and inefficiencies.

Samay also shares that Citus Data has been acquired by Microsoft, indicating ongoing advancements in PostgreSQL technologies and support. The session wraps with an invitation for further discussion and a demonstration of PostgreSQL's capabilities in optimization and maintenance.

00:00:20.900 This talk is about optimizing your app by understanding your PostgreSQL database.

00:00:26.490 From the title, what I want to communicate in this talk is that we will cover the problems you face from application users' complaints about performance. They often say that their queries are slow, or as an application developer, you're trying to understand what issues are causing the application to be slow and tying that back to the PostgreSQL database.

00:00:39.570 So, what kind of statistics does PostgreSQL give you to figure out what exactly is going on, what's wrong, and then how to fix it? We will discuss some PostgreSQL tools and tips, including pg_stat_statements, useful catalog tables, and recently added PostgreSQL features such as CREATE STATISTICS.

00:01:06.539 A bit about myself: I'm Samay Sharma, and I work as an engineer at Citus. Citus is basically an open-source extension designed to scale out PostgreSQL across multiple machines. When you have a single-node PostgreSQL database and start facing performance issues while not being able to keep up with the workload, you can use Citus to scale it out across multiple servers.

00:01:18.300 I manage a team of solutions engineers. We collaborate with clients' application developers who come to us with performance and scalability problems. They often want to grow their databases to several terabytes, tens of terabytes, or even hundreds of terabytes while maintaining the benefits of PostgreSQL. I help clients model their applications and offer recommendations on data modeling to assist them in scaling their databases.

00:01:34.890 I like to include a fun fact about myself on all my slides; I recently got married about four months ago. Some people say that's a fun fact, but it might not remain so after a while!

00:01:40.950 As I mentioned, Citus is open source. You can actually visit GitHub to check out Citus and if you like it, feel free to star us. If you want to learn more about Citus, I’ll be around during the conference for discussions. However, this talk is not about Citus, but rather about PostgreSQL, and everything I discuss applies to your standard PostgreSQL database.

00:01:54.420 The challenge is often relating issues back to pinpointing what’s going on within the database. From an application perspective, users might say their app page is not loading quickly enough, etc. You may figure out that the problem lies within the database that’s causing the slowdown. Now, how do you discover what that problem is, tie the root cause, and then solve it?

00:02:13.370 The simplest way to address this is by using the statistics exposed by PostgreSQL. PostgreSQL captures a lot of stats about your data. When you run 'ANALYZE', or when PostgreSQL automatically runs 'ANALYZE', it collects various statistics about your data, such as the selectivity ratio if a user inputs a filter, etc.

00:02:27.569 It tracks data about all of your columns and about your database as a whole. It also tracks and exposes activity data, which I’ve categorized as monitoring statistics. This includes what queries are currently running on the system, how many connections are coming into the database, which indexes are being used, which indexes are not being used, and how much I/O is being utilized. All that information is captured in different tables within PostgreSQL, allowing you to query them as regular tables to obtain the information you need.

00:03:13.310 If you are using a managed service, you might care less about some administrative statistics. However, your managed service provider must take care of server administration statistics, such as replication performance, database size, and the size of individual tables.

00:03:37.080 What I will do is start from a very generic problem: my application is slow. Everybody can relate to this feeling, as I don’t think anyone claims their application is fantastic and always fast.

00:03:50.430 So, let’s begin with the problem. Symptoms include application users complaining about slow performance. Whatever monitoring tool you’re using points out that the majority of time spent is in database calls. It might be a specific set of database calls or every page they load that turns out to be slow; your page is making hundreds of database calls. How do you identify which query is slow and figure out what exactly is happening?

00:04:18.850 A good starting point is to examine your database's cache hit ratio. The cache hit ratio measures how many of your queries are being served from the cache versus how many require reading from disk. It serves as a good measure to evaluate the health of your database.

00:04:45.510 For transactional applications, you ideally want a cache hit ratio greater than 95%, with around 99% being optimal. This ensures most common operations don’t have to hit disk. Depending on the application, such as an analytics application storing 10 terabytes of data, you might accept a lower cache hit ratio since buying 10 terabytes of memory can be expensive.

00:05:01.500 The question is how to measure this and monitor it effectively. Throughout this talk, I’ll cover a few PostgreSQL tables and discuss the different information they can provide, as well as how you can leverage this information to improve application performance.

00:05:11.700 An example query you can run on pg_stat_io_user_tables would be to count how many heap blocks were read. From a terminology perspective, a heap is where PostgreSQL stores your data; indexes are not part of the heap. So, when I refer to heap blocks read, I’m indicating how many blocks were read from the heap.

00:05:32.790 To calculate the hit ratio, you take the number of blocks hit divided by the total of hits plus reads. By examining this ratio alongside the information from pg_stat_io_user_tables, you can draw insights on data storage efficiency. This table tells you how many times your index blocks were read from the PostgreSQL buffer cache versus when they could not be found in the cache, thus requiring a read from disk.

00:06:14.610 In addition to obtaining the table hit ratio, you can also measure the index hit ratio. If your indexes are not fitting in memory, it could indicate an area that requires optimization. The table also captures other information, such as statistics on TOAST (The Oversized-Attribute Storage Technique). Postgres compresses very large columns and stores them separately, which you can analyze to identify if reading from disk is a bottleneck.

00:06:39.270 Assuming you discover your cache hit ratio is low, this could imply various underlying issues. For example, insufficient auto-vacuuming can lead to performance inefficiencies and dead tuples accumulating in your database, causing significant bloat. Bloat is essentially the chunk of unused data.

00:07:04.920 Auto-vacuum runs automatically to clean up the database by removing dead rows. When you update or delete a row in PostgreSQL, it marks it as not visible to other transactions. Later, auto-vacuum cleans up these rows. You can tune auto-vacuum to be either more or less aggressive according to your requirements.

00:07:23.640 If auto-vacuum is not tuned properly, it won't effectively clean up old rows, negatively impacting your cache hit ratios. Bloat typically occurs when you have a large amount of data and little of it is actual usable data due to poor vacuuming. For instance, you could have a terabyte database where only 200 gigabytes are usable, which is problematic.

00:07:47.280 Additionally, inefficient queries might be leveraging sequential scans instead of index scans. PostgreSQL typically opts for sequential scans if the queries don’t have appropriate indexes in place. Thus, if you continuously scan your database for pinpoint queries without proper indexing, your performance would certainly degrade.

00:08:16.070 Moreover, if you have many unused indexes on your table and aren't using them appropriately, maintaining these can hinder your write throughput. After optimizing the single-node PostgreSQL instance to the best extent possible, if performance issues persist, you can consider solutions like scaling out with Citus.

00:08:36.130 Moving forward, let’s look into a few tables that will help you identify these problems in your database. We'll start with the standard database monitoring table: pg_stat_database. This table provides comprehensive pieces of information to understand the database activity.

00:08:55.900 For example, it indicates the number of backends connected, the database ID, and, importantly, details about transaction commits and rollbacks, as well as the number of blocks read and hit. By reviewing this information, you can pinpoint issues with your database throughput.

00:09:13.950 Furthermore, it tracks how many tuples were inserted, updated, and deleted. This data allows you to analyze the workload and helps identify if your database is insert-heavy or if you are experiencing a lot of updates or deletes.

00:09:30.180 However, these figures alone aren't enough; they only provide an overview of transaction activity. To isolate which part of the workload could be causing issues, you will want to drill down to per-table statistics using pg_stat_user_tables.

00:10:05.360 This table contains information such as how many times sequential scans were initiated versus index scans. You can track how many tuples each of these scans has read, which can help identify whether your queries need more indexing.

00:10:25.290 Another statistical point to measure is the number of live versus dead tuples, which can also point to bloat problems related to your auto-vacuum settings. You might discover significant bloat if you haven’t been maintaining your application correctly.

00:10:52.650 As I noted previously, you can search for bloat in PostgreSQL to get a detailed analysis of how much space is consumed by dead tuples and what needs cleaning. This process can reveal substantial opportunities for optimizing storage.

00:11:15.180 Next, the last vacuum and data analysis timestamps can also help evaluate how often your database is being cleaned up adequately. If it has been days or longer since your last vacuum, it’s likely that significant bloat has accumulated and could negatively impact performance.

00:11:41.860 To better manage bloat, it's a good practice to set auto-vacuum to kick in after smaller changes, particularly for large tables. This enables the database to maintain cleanliness more effectively. Addressing any concerns about sequential and index scans can significantly enhance your overall application performance.

00:12:05.850 Using information from pg_stat_user_tables, you can approximate dead tuples, ascertain which tables are I/O bound, and whether you’re running too many sequential scans. These pointers can help inform optimizations.

00:12:20.680 After you implement these basic tuning measures and still experience slow performance, you must dive deeper into diagnosing slow queries. pg_stat_statements provides essential insights into execution statistics of all SQL statements executed by the server.

00:12:48.230 It categorizes these statements by query format, allowing you to analyze performance across similar queries instead of each execution of a unique query, which might obscure performance trends.

00:13:05.180 For example, pg_stat_statements provides metrics such as execution time, the number of calls, and I/O statistics for your queries. This information serves as a crucial tool for recognizing which queries demand urgent optimization.

00:13:24.210 Utilize queries from pg_stat_statements to identify the top ten most time-consuming queries in terms of total execution time. Rather than just relying on average execution time, analyze total time to highlight the queries that cumulatively consume extensive resources.

00:13:45.550 By focusing your optimization efforts on frequently invoked queries that consume substantial execution time, you can achieve significant performance improvements. In situations where I/O is an issue, search for queries that spend the most time performing disk reads to further optimize your database.

00:14:12.780 Moreover, queries that frequently run, even if they’re fast, can be worth analyzing. Batch processing or aggregating results can prevent extensive resource consumption.

00:14:33.650 Running analyze can often help optimize performance, as does the process of creating appropriate indexes based on patterns observed in the pg_stat_statements report. Once you’ve identified and examined slow queries, using ‘EXPLAIN ANALYZE’ will further assist in optimizing these queries.

00:14:51.290 The essential takeaway from optimized query analysis is that ‘EXPLAIN’ provides the planner's prediction for query execution time based on estimated resource allocation. However, ‘EXPLAIN ANALYZE’ reveals the actual execution time and rows affected, providing valuable feedback.

00:15:17.670 Once you understand where the most time is spent in the query, you can direct your efforts toward tuning that specific section. Pay close attention to discrepancies between estimated results and actual performance.

00:15:48.074 If there’s a vast difference, it’s possible that PostgreSQL has not collected enough data for accurate statistics. In these cases, consider reanalyzing your database using a higher statistics target, which tells PostgreSQL to analyze a broader sample of data, ultimately improving the accuracy of query plans.

00:16:17.010 For instance, when querying with two correlated columns, PostgreSQL may not consider their relationship unless you explicitly declare multi-column statistics. When proper statistics are in place, PostgreSQL can leverage that data to create more efficient execution plans.

00:16:46.850 To summarize, utilize pg_stat_statements to identify critical queries that warrant optimization. When you have slow queries, make sure to rely on ‘EXPLAIN ANALYZE’ to achieve a better understanding of your query performance.

00:17:15.200 Pay attention to discrepancies between estimated and actual results, as it often illustrates statistical inaccuracies. Addressing these inaccuracies can lead to performance improvements that yield significant results.

00:17:41.720 This concludes the main content of my talk. As a bonus, I want to mention quick estimates leveraging PostgreSQL statistics. This isn’t tightly related to performance tuning but can help achieve good estimates without executing costly queries.

00:18:02.990 You can use stats to understand null fractions, distinct values, and most common values. These insights have practical uses, such as approximating query results and summarizing data distributions without seeking exact numbers.

00:18:25.370 For example, counting actual rows in a billion-row table can be inefficient, so using PostgreSQL's statistics to gauge a rough estimate can save time and resources.

00:18:44.090 Ultimately, PostgreSQL exposes a variety of statistics that help identify performance-related issues, while the specific pg_stat tables guide you toward areas for improvement. Focus on top queries, understand their I/O characteristics and the caching behavior, and use the statistical insights to enhance your overall database performance.

00:19:09.040 Finally, I’d like to share that Citus Data has been acquired by Microsoft. We are working on launching Citus on Azure, so if anyone is interested in keeping updated on availability, just visit the link to express interest.