00:00:17.330
Hi everyone, today we're going to talk a little bit about Postgres.
00:00:23.220
Quick show of hands: who already uses Postgres? Awesome! And who does not? Alright, if you're too shy, we can chat afterwards. I'm happy to convince you.
00:00:37.260
I take the exact opposite of the philosophy that says the database is a dumb hashing tool. We will get into why I think it’s important to know a little about your database to work better with it and to handle it at pretty much any scale without having to go back to school and take a bunch of database courses.
00:01:00.250
This is really a 101 session with a lot of links and references to help you treat it a little better. First, a little bit about me: I was one of the first product managers at Roku and spent about five and a half years there, primarily building and running Heroku Postgres.
00:01:13.470
We ran around 1.5 million Postgres databases across a team of about eight of us, which I think is a pretty good ratio of databases to engineers for such a large site.
00:01:22.110
I will talk a little bit more about the site later, but the short story is that we turned Postgres into a distributed database, and it is transparent to your application.
00:01:33.079
We were at Clara right about a year ago, and now I'm at Microsoft. If you like Postgres and want to learn a bit more about it, I create a newsletter called Postgres Weekly, which is highly curated.
00:01:46.500
It sounds like Ruby Weekly and others, but it doesn't include a hundred items; it’s more about the top 10, 15, or 20 items that are really targeted toward application developers, not just the DBA crowd.
00:02:12.210
Based on the show of hands in this room, I think everyone's hand went up. Postgres is more popular than ever; it’s kind of the largest open-source database and is growing in popularity.
00:02:28.620
If you look at sites like Hacker News and see who’s hiring, I don't know if you count that as a barometer of success, but Postgres ranks number one among databases.
00:02:43.470
It's really well-loved and well-liked as a relational database. First, this is a really fun excerpt from the Postgres mailing list from 2006 about Tom Lane.
00:03:02.370
Anyone know who he is? For those who don’t, Tom Lane helped create TIF, was on the board for creating the spec for JPEG, co-authored the JPEG spec, and then co-authored the PNG spec.
00:03:27.840
He got tired of working with images and decided to work with databases instead. He has been working with Postgres for about the last 15 to 20 years and is the primary contributor to much of Postgres.
00:03:39.720
He is among the top five or so open-source contributors. This was him, 14 years ago on the mailing list, coming back from a conference, stating that the biggest mistake Postgres ever made was its name.
00:04:02.610
He argued that it is not Postgres SQL; it’s okay to just call it Postgres. It’s painful to this day, and I’m actually thinking of starting a campaign with all the core developers to call it Postgres to annoy them until they change it.
00:04:34.140
If you're new to Postgres or someone asks you why you like it, I think many of us engineers often jump onto a bandwagon and say, \'Oh, it's cool, of course I’m using it.\' But when asked why, we may not provide a good answer.
00:04:48.820
It’s a really feature-rich database. For instance, data types are something to consider. I’m genuinely excited about this. You wouldn’t think that it stands out in terms of a database, but it’s quite useful.
00:05:10.660
Postgres supports powerful transactional DDL. If you’re running a migration and it fails, you can actually roll it back and not end up in a mid-migration mess.
00:05:32.570
Foreign data wrappers allow you to connect from inside Postgres to another database and query directly. So, if you have stuff in Redis and want to interact with it in Postgres, you can query directly from those other databases.
00:05:54.020
The Listen/Notify feature provides pub/sub directly inside your database. Some of these features will not be covered in detail in this talk, but if you’re curious about them, I’m happy to chat afterwards or you can look them up.
00:06:06.870
Now, let’s delve into some background on Postgres. It's unique as an open-source database. Within Rails, there is no single owner; it’s based on a very liberal license which is rare for databases.
00:06:29.370
You can take it, modify it, and resell it. This happens frequently; for example, Greenplum, Redshift, Aster Data, and others were all originally based on Postgres. They took it, modified it, and now don’t look much like Postgres anymore.
00:06:46.670
Postgres has gained popularity in academia because you can create PhD projects on it, ship them, change them, and do whatever you want with it.
00:07:00.670
The community is interesting; there’s a core team that acts like a steering committee with a committer base of about 40 people who are committed to Postgres. This number has actually doubled over the last three years.
00:07:20.520
There are major contributors and minor contributors, so you can contribute significantly to Postgres without necessarily becoming a committer.
00:07:28.520
Discussions and patches get handled through mailing lists. There’s no central Git repository or pull request model; it has functioned like this for the past 20 years.
00:07:41.520
If you want to learn more about Postgres, the PG User mailing list is a good place to start. If you want to deep dive into Postgres, the core hackers list is where they discuss deep kernel issues.
00:07:51.230
It is fascinating, though I don’t understand half of what they discuss there. If you want to level up, it's a great place to go and subscribe.
00:08:01.250
Postgres regularly releases updates, usually every fall. Previously, major releases had decimal point versions—9.2, 9.4, etc. Now, major versions are numbered without decimals, such as 10, 11, 12.
00:08:13.750
Minor versions are released with security fixes or patches, typically about once a month. They used to have marquee features, like JSON support, but it’s evolved.
00:08:34.750
In version 9.4, JSONB (Binary JSON) was introduced, allowing better storage and indexing, which is extremely powerful.
00:08:50.260
Let’s discuss more features and why Postgres stands out. One of its strengths is its liberal approach to new data types. There are around 180 types supported.
00:09:03.350
For instance, if your application is using arrays, you can compactly store data in arrays in the database instead of joining them to another table.
00:09:17.480
Data types such as JSON and JSONB are highly useful. JSON is better for preserving whitespace in API data logs, while JSONB is better for querying.
00:09:38.180
Working with time zones is made easier with Postgres, as it can automatically shift time zones.
00:09:48.890
Timestamp is an incredibly useful data type in Postgres. UUIDs are handy if you’re using them in your app and can be directly stored in the database.
00:10:04.300
Range types are beneficial if you’re developing applications that need to manage range constraints, like calendar applications where overlapping classes shouldn't occur.
00:10:18.940
SQL support in Postgres is feature-rich, including common table expressions (CTEs) known as \'with\' clauses. Who enjoys writing SQL?
00:10:35.380
Hands? More hands for reading SQL. SQL is a powerful language, but with CTEs, your queries can be more readable and maintainable.
00:10:50.300
Window functions are very useful for tasks like ranking based on categories. Postgres has many built-in functions for data manipulation.
00:11:07.430
Regarding indexing, if you're like me and don’t have a CS degree, navigating through different index types can be daunting.
00:11:22.250
In general, B-tree is a commonly used index type, and with each major Postgres release, new index types are introduced.
00:11:37.390
There’s a group in the community that works on about index types, finding uses for specific data structures. For example, GIN (Generalized Inverted Index) is great for JSONB.
00:11:50.180
Full-text search and geospatial indexing are also prominent features of Postgres, allowing complex data interactions like overlapping text or polygons.
00:12:03.440
When dealing with spatial data, I often hear that Postgres is great for working with large datasets of information that cluster together.
00:12:18.920
When creating procedural SQL, there is PL/pgSQL, and the option to use PL/v8, which incorporates the full v8 JavaScript engine directly into Postgres.
00:12:35.430
This means you can write JavaScript within the database, opening up new capabilities and making complex logic easier to implement in your application.
00:12:51.430
Postgres is also becoming a data platform, moving away from being just a strict relational database. It is slow and stable but unique with its extension framework.
00:13:02.670
You can add low-level hooks to change Postgres behavior, create new data types, or even new indexes. One notable extension is PostGIS.
00:13:17.780
PostGIS is widely regarded as the advanced open-source geospatial database, which hooks into Postgres, extending it with new data types and functions.
00:13:31.330
If you are looking for time series capabilities, there are several extensions that support this as well. If you need to migrate from Oracle to Postgres, there’s an extension to help with compatibility.
00:13:49.920
The ability to extend Postgres allows you to make extensive changes without waiting for a new core release. There are about 100 extensions available now.
00:14:05.319
For pro tips, if you use terminal interfaces, PSQL is really good. How many of you have configured your .bashrc?
00:14:18.360
Most hands. How many of you have a PSQL config? Anyone?
00:14:23.419
PSQL is powerful enough to do things like auto-format your text, keeping it wrapped within your screen’s width, and address query timing for performance tracking.
00:14:40.458
If you're ever asked to generate a report and need to retrieve the query from history, PSQL saves the history of every query run against your database.
00:14:55.073
If the query is complex, you can save it and execute it later. This feature is especially handy when re-running past reports.
00:15:10.726
If there are common queries you often write, you can name them so that executing them is as easy as just typing their name.
00:15:24.369
If you’re not a CI person, another data tool to consider is pgAdmin, which provides a GUI interface for Postgres.
00:15:38.370
Hopefully, by now, you’re convinced about using Postgres. If not, I encourage you to check out some articles I’ve shared.
00:15:53.374
This talk has largely been about managing Postgres at various scales. To start, small databases—less than 10 gigabytes—are easy to handle with Postgres.
00:16:11.488
You can leverage its data types, use arrays, timestamp data types, and JSONB effectively. Start putting indexes in place early.
00:16:28.642
Make sure to have a backup strategy—if you haven’t tested your backups in six months, it’s wise to do so.
00:16:43.615
Spend some time setting up your environment with PSQL and other tools. This will alleviate crisis situations when you are triaging unexpected problems.
00:17:01.900
When you reach a medium scale—beyond 200 gigs to a terabyte scale—you’ll want to track performance closely. Key metrics to monitor include your cache hit ratio.
00:17:20.460
You should aim for a cache hit ratio of 99% or higher. If it drops below 99%, you may need to analyze your indexing.
00:17:35.500
Also, track your index hit ratio. When you issue a query, you don’t want Postgres scanning every single row.
00:17:51.740
Leverage indexes for your most popular queries. This leads to better performance when you analyze the number of rows in each table.
00:18:05.520
Sometimes you may need to drop unused indexes. It helps to periodically review the indexes and drop those that haven’t been used recently.
00:18:20.840
Index management is crucial to optimizing your database’s performance and keeping it robust.
00:18:34.280
Connections in Postgres can be a sore spot. Each new connection incurs some time overhead due to TLS negotiation; I estimate this takes about 50 milliseconds.
00:18:47.350
Creating a pool with Active Record can lead to resource wastage if many idle connections accumulate, especially with high connection volumes.
00:19:02.070
PG Bouncer is a great solution you can implement to optimize this situation. It helps manage connections more effectively by closing idle connections.
00:19:20.370
If you are running a sizable workload, I would highly recommend using PG Bouncer.
00:19:31.500
Now, as we delve deeper into the medium scale, focus on practical database needs—adding indexes, monitoring your cache ratios, and adjusting memory as needed.
00:19:46.330
If you want to optimize specific queries, use tools like PG Stat Statements to review every query that has been run against your database.
00:19:58.090
PG Stat Statements shows how many times each query was executed and the average and total time taken, allowing you to identify slower queries.
00:20:13.523
If you’re above 50 to 100 gigs, consider shifting your backup strategy to encompass physical backups.
00:20:30.810
Most people do a PG dump once a day; this is effective for logical backups.
00:20:43.540
However, as you grow, you may want to adopt a physical backup strategy, such as using PG Backrest for efficient growth.
00:21:00.060
As your database grows larger, think about sharding strategies to manage data effectively. Sharding allows splitting a large database into manageable bits.
00:21:17.230
Citus is an extension that can help you turn multiple Postgres databases into a distributed setup.
00:21:38.030
Ensure to hash your data upfront so you correctly distribute workloads across your shards based on customer IDs or other hashable identifiers.
00:21:54.870
This approach will help keep data organized and ensure that your queries remain efficient as your application scales.
00:22:10.770
Managing relational databases can be challenging due to multiple relationships and models, but there's a way to simplify the architecture.
00:22:28.990
Grouping related data is essential; consider colocating your tables to help ensure transactional consistency without excessive overhead.
00:22:43.610
You wouldn’t want to rely on a reference table with vast amounts of records, as it complicates the writing process across multiple nodes.
00:23:01.050
In terms of collocation, keep your tables tightly linked through keys, which can yield better performance and less friction when querying the data.
00:23:16.510
Monitor for N+1 queries that can deplete your cache hit ratio, leading to performance issues.
00:23:32.900
Look for opportunities to offload workloads to read replicas for reporting and analytics that can tolerate slight lags.
00:23:49.320
Ultimately, if sharding is necessary, look into modern practices as the landscape of distributed databases has evolved.
00:24:01.910
Lessons can be learned from successful implementations, like Instagram, to manage scaling issues effectively.
00:24:16.050
In summary, for small- to medium-sized databases, keep a close eye on your cache hit ratios, use proper indexing, and test your backups regularly.
00:24:34.640
Monitoring connection limits with PG Bouncer can alleviate many performance issues.
00:24:48.160
For larger scales, refine your backup strategy, select the right sharding model proactively, and ensure column colocations are effectively laid out.
00:25:04.060
This proactive management approach can ensure that your applications remain robust and responsive as they scale.
00:25:16.370
Thank you for your time!