Five Sharding Data Models and Which is Right

00:00:12.210 It's time. I think we'll go ahead and get rolling. Hi everyone, I'm Craig. I work at Citus Data, and if you're not familiar, we turn Postgres into a sharded and distributed database.

00:00:23.460 I won't talk too much about Citus in this talk, but we have a booth in the expo hall. So if you're curious, I'm happy to answer questions about Citus afterwards.

00:00:30.210 Generally, I'm going to talk about sharding and data modeling. There are a lot of mixed opinions on sharding, but it's pretty well proven that it scales. How you model your data upfront is probably the single biggest factor impacting your success or failure with it.

00:00:42.450 First, a little bit of background on who I am. I curate Postgres Weekly, so if you're a Postgres fan, is anyone here not using Postgres or not a fan? A couple of hands? Okay, I'll convince you afterwards.

00:00:56.370 But I curate Postgres Weekly, which is a newsletter similar to Ruby Weekly and other weekly newsletters, targeted more towards active developers than DBAs.

00:01:09.659 I run our database as a service at Citus, and previously, I was at Heroku, running Heroku Postgres for about five years, so I'm happy to answer questions on that as well.

00:01:15.299 So what is sharding? There's been quite a bit of talk this week, I think. Like DHH was kind of proud of how they didn't shard, yet they split things into smaller pieces by using different databases.

00:01:26.850 Rails 6 is getting some support for additional databases, but it's not necessarily sharding; it's just multiple databases that Rails is aware of, and you get to figure out what to do with them.

00:01:39.240 Sharding, by definition according to Wikipedia, is the practice of breaking your database into smaller parts, usually for performance. This could be for write performance or read performance, your pick. But it’s usually about performance. The nice thing about sharding is that breaking your database up from one huge database into a bunch of smaller parts makes everything smaller and faster.

00:01:56.610 With sharding, things like vacuuming and indexing are all faster, so everything is small and easy to work with again.

00:02:07.560 It's very well proven at scale; Google, Instagram, Salesforce, take your pick. Any large company of scale has gone down this path. You don't necessarily have to be at their scale to find the benefits.

00:02:20.910 I've seen sharding start to make sense at as early as 10 gigs of data, though more commonly, it's at 100 gigs or a terabyte.

00:02:32.880 First, what is a shard? I think this is the most common misconception when talking to people about this.

00:02:38.910 We'll get into data modeling in a little bit, but first, what is actually a shard and how should you approach it? There’s like a physical VM, an instance on Amazon, or a Postgres cluster.

00:02:50.760 A lot of times people think VMs are these shards. They’re actually not. The idea of a shard is that you're going to have multiple shards existing within one VM.

00:03:03.410 There's a separation between nodes and running multiple Postgres instances on it. So if you've got a Postgres database with multiple tables or schemas, all those are technically shards.

00:03:10.230 This is pretty important early on, so that you don't tie yourself to the exact hardware. The idea with sharding is that you want to create a bunch of smaller chunks that you can easily move around.

00:03:20.880 You can move them between different nodes, scale the hardware up, scale it down, add nodes, and move the hardware and tables across. This makes your life a lot easier.

00:03:32.730 Starting with defining the right shard count upfront is pretty key. To visualize it, if I've got two different physical instances and I want to create 32 shards, underneath, I might create a table like events.

00:03:40.860 I'm going to call them events_1, events_2, events_3, up to 32. I'm going to create a bunch of shards on one server first, and if I'm using tables, create odd numbered events on one node.

00:03:50.760 On the other node, I'll place the even-numbered events. It’s really simple, but what's nice is that when I transition from 2 nodes to 4 nodes, I can just take half of those shards and move them.

00:04:10.230 You can do this using logical replication, create a read replica, copy them over, and then update the metadata. Overall, it’s more manageable than if you’ve got sharding per node.

00:04:28.169 Splitting that up inside the same table is a lot harder, so upfront, you're going to want to create a larger number of shards.

00:04:41.400 Onto the five data models we're going to walk through.

00:04:47.370 Here's the things to think about, tips, ways to rationalize through it, and ways to rule it out. The number one thing I try to do is rule out a data model by saying, 'This won't work,' and find reasons why instead of forcing it in.

00:05:05.849 When you force it in, you've likely had a bad sharding experience. If you've ever tried sharding and it just failed miserably, you probably tried to force a square block into a round peg.

00:05:13.080 So a couple of bits on the approaches... Most people, when they think of sharding, think of two different approaches. One of them is much more common at scale, while the other is one that gets talked about more.

00:05:29.510 The first approach is hash sharding. As soon as something comes in, you hash that ID. In this case, if we have 32 different shards, we split the resulting hash value into 32 different buckets, keeping a metadata table that says this resulting value goes into this shard.

00:05:41.449 Any questions on that? It’s a bit more than that, but at a basic level, you're going to have space for some ID.

00:06:01.589 If you're using Postgres, there's an internal hashing function available for all data types. You can use it for text and integers or even roll your own.

00:06:17.040 The hashing function you choose will not determine the success or failure. What you want to do is define your shards upfront, going larger than you expect.

00:06:32.660 If you think, 'At our peak, when we're at Google scale, we’ll run a thousand servers,' that's probably a good number. A thousand shards or two thousand sounds reasonable.

00:06:48.770 You also probably don't want to stick with two. You'll outgrow two nodes incredibly quickly. Some median-based number, based on how large you think you're going to grow, is a good start.

00:07:06.599 You can split up shards later, but that's a lot more work. Once you go through the sharding process, you want to set it up for two, three, or even five years to avoid having to redo a bunch of work.

00:07:23.789 Again, two is bad, and two million is also bad. With tables too small, they're empty, and running a migration on them now costs you in production.

00:07:40.539 I prefer higher numbers like 128 or 256. You don't have to grow by factors of two, but it usually works pretty well.

00:07:53.750 A common misconception is that people just route the values. You have user ID 1 and say, 'I'm going to take users 1 through 10 and put them in shard number 1.' Then have shard number 2 with users 2 through 20.

00:08:06.330 This is a huge problem since your oldest customers have the most data, and they end up all being saturated together. By hashing things upfront, you distribute your data more evenly.

00:08:17.460 You might still have a hotspot, but it’s a better way of ensuring natural data distribution upfront.

00:08:31.500 Take user ID 1, compute its hash in Postgres to get 46,000. User ID 2 computes to 27,000. I will have a table split for the range of hash values, so between 26,000 and 28,000, I put those in shard number thirteen.

00:08:44.100 I maintain a lookup table in my application or in my database metadata. If you're using Citus, you want to think about this; it's built-in for you.

00:09:00.780 It’s the same idea: all the resulting hash values will live in this shard, evenly distributing data. As things grow, new ones redistribute, so it works out nicely.

00:09:12.930 Now onto range partitioning. I've seen a lot more people implement this than true hash partitioning. It's common in time series.

00:09:29.820 When using range partitioning, ensure you create a new bucket. For instance, if you’re running an ad network and counting impressions by week,

00:09:43.680 you create a new table for every week of data that comes in. Make sure you keep creating those new ranges, and you'll be set.

00:09:55.080 Range partitioning becomes straightforward; you just create new tables for each time period.

00:10:06.330 Questions? There’s quite a bit more to cover; those basic approaches apply pretty broadly to sharding.

00:10:18.840 The more interesting part is the data model. If you don't get the data model right, you're going to have a lot of problems joining data across different nodes.

00:10:27.760 Moving data across the wire is still slow; we haven't fixed that in databases. So how can you model your database on your application?

00:10:41.610 First, let’s discuss geography. The big question here is, is there a really clear geographical boundary? Usually, 99% of the time, you don't want to do this.

00:10:54.509 There's always an exception, like when I was getting picked up at SFO Airport one time. It was the day before Thanksgiving; the passenger said, 'I can't go see those people.' So he drove from San Francisco to LA.

00:11:06.680 Normally, you think of Ubers having confined regional boundaries, so most drivers won’t cross large distances. This usually works well, like with services that have defined geographical boundaries.

00:11:21.340 However, bad examples arise when one side has geography while the other side does not. For example, text messages can go from one area code to another.

00:11:34.010 I grew up in an age where I kept my cell phone number, so I might have an Alabama number in California. I don’t pay attention to area codes anymore, which can span geographic boundaries.

00:11:47.520 The key issue here is that joins across geographic boundaries slow things down, and you should find out how often you join across those boundaries.

00:12:01.260 For something like Instacart, moving users’ grocery stores when they change addresses is generally okay because it's a rare operation.

00:12:14.640 You want to balance granular versus broad approaches. For example, state-level data may not work because states vary in size significantly.

00:12:29.700 Zip codes might be better for data distribution. Common use cases for this include Instacart and Uber.

00:12:39.570 If you're in a market starting in one geography and then expanding, this model could work well for you.

00:12:49.920 However, I see some issues with range sharding. For example, California gets one database, Texas another, which becomes problematic.

00:13:02.940 This leads to challenges with managing 50 different databases; you miss the proper hashing work that should have been done.

00:13:15.750 For instance, you should still hash the IDs. Give states an ID or hash their name to create manageable portions.

00:13:27.750 Multi-tenancy is my favorite model as it's typically the easiest. If you have a SaaS B2B business, each customer's data is distinct; companies like Salesforce make sure customers see only their data.

00:13:41.610 It's crucial to understand your data distribution. If one tenant holds half your data, it's problematic. A 10% share is manageable, but deep evaluations here are key for success.

00:14:05.500 Common use cases include CRM and marketing automation SaaS, where customers' data is distinct.

00:14:24.210 Start early by adding a tenant ID or customer ID to every table. This eases joining tables and sets things up nicely down the line.

00:14:37.740 Ensure your primary and foreign keys are composite. Rails supports this reasonably well, and this approach will show performance gains.

00:14:50.120 By enforcing an ID on all queries—ensuring a WHERE clause checks customer ID—this prevents data leakage. There are gems like Active Record Multi-Tenant that help enforce this.

00:15:03.720 To illustrate, in a Salesforce schema with leads, accounts, and opportunities, you’ll add an org ID to all tables. Ensure your primary key combines both identifiers.

00:15:15.960 Starting this early saves headaches later, preventing complications when backfilling large tables.

00:15:29.620 Also, watch out for anomalies in your data. I've seen situations where duplicates occur across records, leading to confusion on which is correct.

00:15:45.500 A word of caution: if you choose a schema-per-customer approach, take care. It’s great for ten customers, okay for a hundred, but could be brutal at a thousand.

00:16:01.619 You may face long migration times, as operations on thousands of schemas can take hours. Connection limits become a limiting factor too.

00:16:16.830 If you're only planning for ten customers, this approach is valid, but consider future scalability.

00:16:29.240 Now, let’s discuss optimizing your workload. You aim for parallel queries across multiple shards. You should optimize for performance and parallelism, particularly in use cases like ad networks.

00:16:48.139 For an example like web analytics, it’s common to separate by visitor ID or session ID. Most web analytics tools focus on aggregations, which fare well.

00:17:03.929 Some points to consider: SQL capabilities are limited in sharding; window functions are cumbersome as they require aggregation back to a single node.

00:17:20.530 Adopting a Hadoop MapReduce mindset allows for running jobs in parallel within nodes. This has proven to show significant performance gains.

00:17:41.500 Examples include count stars across shards, summing data, and computing averages efficiently. But median calculations may get tricky.

00:18:01.710 Now let’s highlight distinct counts, utilizing algorithms like HyperLogLog. It provides approximate distinct counts efficiently.

00:18:18.630 Ordered lists also have use cases, like finding top K values using tools such as Elasticsearch.

00:18:30.170 Graph databases are suitable for scenarios with numerous connections, such as social networking and fraud detection.

00:18:43.580 In these databases, you'll assess connections between posts and user interactions. The most common approach involves writing data twice.

00:18:55.540 Querying based on associations or object properties depends on your goals. Neo4j is another popular option for managing graph databases.

00:19:10.170 Time series data often revolves around time dimension as well. However, having a timestamp alone isn't enough to designate it as a time series.

00:19:20.110 You need to assess if you’re frequently querying across time ranges or pruning old data. These factors determine if time series is a fit.

00:19:35.470 You want to leverage range partitioning when sharding. Define ranges upfront and keep up with partitions to prevent bottlenecks later.

00:19:48.710 Updating your partitions and ensuring old data is deleted keeps performance optimal. If you relied solely on automation in the past, you’ll need to ensure that it continues.

00:20:06.360 When using native partitioning in PostgreSQL, use extensions like PG Partman to manage this more effectively.

00:20:18.199 Ultimately, sharding has evolved and is more manageable than before. If you identify and implement best practices early on like adding tenant IDs, you save a lot of future work.

00:20:34.170 It’s essential to select the right approaches and be open to ruling out models that won’t work for you.

00:20:45.390 Sharding was challenging in the past, but improved tools and libraries now facilitate this process, making it much easier than it once was.

00:20:58.630 Though it’s never a trivial task, properly implemented sharding scales and mitigates future worries of data handling.

00:21:15.030 Alright, I think that's it, and I might actually have a couple of minutes for questions.

00:21:28.650 You.