Scaling Redis writes with cluster mode

00:00:03.439 Hello everyone! Yes, my name is KJ, and as she introduced me, I’m an engineer at Zendesk.

00:00:09.360 I’m a committer on the Ruby core team, and today I'm talking about our fun adventures scaling our Redis writes.

00:00:16.199 I didn’t have time to rename the talk to something like "Valkyrie" or the other title "Rict", so it’s just "R" here.

00:00:24.720 Please feel free to interrupt me if something I say doesn’t make sense, because I'm good at not making sense.

00:00:30.160 Also, I should talk a bit about what we do at Zendesk for those who don’t know. We are a customer service software platform.

00:00:37.719 We deal with everything from email-based ticketing workflows to advanced AI chat features, and all that good stuff.

00:00:51.520 So what is our story about today? Here’s a photo of our office in San Francisco, which is relevant to this talk.

00:01:05.880 I work on a team responsible for the overall health, reliability, and well-being of our big million-line monolith application.

00:01:11.119 We’re a distributed team, with some members in Australia and many others spread far and wide throughout the United States.

00:01:24.079 We usually do all the remote work things, but we had the opportunity last year to come together at our office in San Francisco.

00:01:30.040 This team offsite was great; we went bowling, had some food, and accomplished a lot of productive work, like setting our vision for the future and engaging in pair programming.

00:01:43.240 However, there was one thing we got to do together in San Francisco that I was not expecting: in-person incident response.

00:01:56.799 It felt like the good old days where everyone is in a war room trying to fix an ongoing incident.

00:02:02.920 The incident around that table was related to one of our many large instances affectionately known as "MK".

00:02:08.080 MK probably gives you an idea of what goes on there. I need to thank my colleagues for suggesting I needed more memes in my presentation.

00:02:15.040 The problem with MK was that something had gone wrong — it was using too much CPU.

00:02:25.519 The alarm went off, we got paged, and upon investigation, it looked like the issue was related to usage.

00:02:33.040 This Redis instance was actually an ElastiCache cluster mode-enabled Redis, but it only had one shard.

00:02:38.319 So it wasn't really functioning as a proper cluster with just one node. We considered that, figured it was time to write a check to Mr. Bezos.

00:02:51.879 We clicked the button there and said we wanted more shards, but this didn’t fix everything; in fact, it made everything much worse.

00:03:03.920 The original problem wasn’t that severe, but when we scaled out the Redis cluster, we started to see these weird and wonderful errors.

00:03:09.959 These were errors we had never seen before, such as the "cross slot pipelining error" and "move 3016" in this picture. They were very serious, causing real problems, and we had no idea why they were happening.

00:03:29.519 So, we undid the scaling and reverted back to a one-node cluster.

00:03:35.959 After this, we were left questioning why things had gone so wrong.

00:03:41.239 We had to spend some time learning how Redis works and understanding what cluster mode entails and why it didn’t work well for us.

00:03:47.680 I’m going to give you a speed run of the different ways to deploy Redis that we learned about during this fun learning exercise.

00:03:56.239 The simplest version is a single Redis instance.

00:04:02.480 If you want something more complex, you could have one server with Redis on it. Choose your server provider of choice, install Redis, and there you have it.

00:04:14.280 This is simple and cheap; if it's sufficient, you can save yourself thirty minutes.

00:04:19.320 However, sometimes we need more robustness than this setup can provide; we might need high availability.

00:04:32.280 If our single Redis server goes down, we need to ensure we still have access to Redis.

00:04:38.800 Sometimes, there could be a risk of data loss here, and we might need replication as well.

00:04:44.360 This would mean creating a setup where if the Redis server goes down, we don't lose the data it was holding.

00:04:49.960 Relevant to this discussion, the goal is to scale out to get as much Redis throughput as possible from one server, but if we need more, a different approach will be required.

00:05:06.960 This brings us to Redis Sentinel, an early approach for high availability.

00:05:12.919 Redis Sentinel is distributed as part of Redis but operates as a separate process. This means you have some servers with Redis running, and you can install Sentinel alongside them.

00:05:24.080 The Sentinels will collaborate, have an election, and decide who will be the master and who will be the replicas.

00:05:30.840 Sentinel then configures Redis to establish who is the master and has to be replicated.

00:05:36.080 This setup requires application support as the application must know which of the Redis nodes is the master node.

00:05:41.680 The application must ask Sentinel, which will provide the current master node's address, allowing the application to connect appropriately.

00:05:55.000 This solution eliminates a single point of failure; if one of the Redis nodes fails, Sentinel promotes one of the replicas to master.

00:06:01.919 The application learns about this promotion because it understands Sentinel.

00:06:07.039 Additionally, this approach can be used to scale read operations; if you have a read-heavy workload and can tolerate slightly stale data, your application can opt to read from these replica nodes.

00:06:14.759 Unfortunately, this multi-master setup doesn’t work for write operations under Sentinel; every write must still go through one Redis instance.

00:06:22.960 This leaves us with a problem given our write-heavy workloads.

00:06:28.000 Fortunately, there’s a solution in the Redis gem called Redis Distributed.

00:06:36.320 Redis Distributed operates on the notion that we can take multiple Redis servers and split the keyspace between them.

00:06:42.479 This is accomplished by using a hashing algorithm to determine which keys go where.

00:06:50.160 The downside, however, is that this approach can lead to a loss of high availability.

00:06:56.760 If one of those Redis servers goes down, we would lose half of our keys, which is not ideal.

00:07:04.479 To address the need for both distribution and high availability, we can combine the idea of Redis Sentinel with Redis Distributed.

00:07:13.920 This gives us independent Redis Sentinel clusters, allowing us to maintain multiple shards.

00:07:22.160 With these powers combined, we can have horizontal write scalability; we increase throughput with additional nodes.

00:07:28.479 However, we have to keep in mind how to add new nodes effectively.

00:07:34.080 Redis Distributed has challenges when it comes to scaling; adding a new node requires you to determine how to rebalance the keyspace.

00:07:40.599 The system doesn’t automatically accommodate the redistribution of keys.

00:07:46.320 So while Redis Distributed allows you to start at a certain scale, it doesn’t allow easy changes to that scale.

00:07:52.360 This limitation highlights the need for Redis Cluster, which is designed to solve the scalability issue.

00:08:01.599 In Redis Cluster mode, we divide the keyspace into 16,000 slots determined by a hashing function, specifically the CRC16 hash.

00:08:09.000 The Redis nodes come together and elect a master for each slot, creating a proper cluster.

00:08:15.600 In this arrangement, every slot is served by only one primary node, complemented by one or more replicas.

00:08:21.280 Your application must know which keys are sent to which servers to ensure effective distribution.

00:08:27.679 To support this architecture, we need client library support. The Redis gem provides the capability to understand how to route keys correctly.

00:08:36.240 The gem can request cluster node details, returning a list of nodes and their responsible slots.

00:08:42.240 This allows the application to send keys to the correct nodes according to the slot they hash to.

00:08:48.720 Moreover, Redis Cluster allows changes to the topology; if we add a node, the system can redistribute the slots.

00:08:57.520 This process involves copying the current values of the keys and using a consensus mechanism to declare new ownership of slots.

00:09:05.879 The application will also be informed about these changes through redirection commands.

00:09:11.760 If a request for a key goes to the wrong node, the node will indicate which node is responsible, enabling correct routing.

00:09:18.960 This setup leads to horizontal scalability, as we can scale out writes and reads easily as we add more nodes.

00:09:30.000 However, Redis Cluster requires complex client libraries to handle tricky situations.

00:09:37.080 These situations may include what happens if keys are in the process of being moved or if there's a network partition.

00:09:43.840 These issues do not arise when using a single Redis server, but higher complexity is required for clusters.

00:09:53.600 At this point, we can understand the taxonomy of Redis and where Redis Cluster fits in.

00:10:00.279 Now, let's return to our incident and analyze what went wrong.

00:10:06.880 Our Redis Cluster was configured with just one shard, which means all slots were served by a single master.

00:10:12.560 In practical terms, this setup didn’t really function as a true cluster.

00:10:18.400 When we tried to scale out by adding more nodes, we consequently faced numerous issues.

00:10:24.400 Now let's examine the cross-slot pipelining error that we encountered.

00:10:30.480 Redis commands can operate on multiple keys at once. For instance, using Sidekiq Pro involves extensive multi-key operations.

00:10:42.920 Nevertheless, in Redis Cluster, a multi-key operation only works if all keys belong to the same slot.

00:10:55.080 For example, if you attempt to set values for two keys, but they fall into different slots, you’ll receive an error.

00:11:01.919 To work around this, you can use hashtags within your key names.

00:11:09.760 This means that if part of your key name contains a brace, Redis only hashes the part inside the brace.

00:11:16.120 This allows keys with matching hashtags to belong to the same slot, making them eligible for multi-key operations.

00:11:23.680 Pipelining is another efficiency approach in Redis.

00:11:31.239 Typically, you would request multiple keys' values one by one, but you can also send all the requests at once and collect responses later.

00:11:42.240 This minimizes the request-response roundtrip time and can result in a performance boost.

00:11:48.760 However, the cross-slot pipelining error can arise in situations where multiple keys are involved across different slots.

00:11:56.800 It should work seamlessly in theory, but in practice, we discovered it was not functioning correctly.

00:12:02.760 The good news is that the Redis gem v5 has improved cluster support under a new maintainer, and we learned we needed to upgrade to that version.

00:12:19.440 Thus, after implementing the upgrade in our testing environment, we stopped encountering those cross-slot pipelining errors.

00:12:27.080 However, new errors emerged that were related to transactions.

00:12:32.680 Let’s now discuss transactions in Redis and what issues we faced.

00:12:40.000 Transactions in Redis can be conceptualized as conditional execution.

00:12:48.760 When you execute a transaction, it will only apply changes if the specified keys haven't been modified since the start of the transaction.

00:12:55.440 In scenarios with competing modifications, the transaction will fail, preventing problems like negative ticket sales.

00:13:05.680 However, one limitation is that in a Redis cluster, transactions work only when all keys belong to the same slot.

00:13:14.880 While testing our new branch, we encountered ambiguous node errors, which we had experienced previously.

00:13:24.880 In fact, transaction support in the Redis Cluster gem was incomplete.

00:13:31.679 It facilitated multi-exec but lacked complete watch functionality necessary for conditional execution.

00:13:40.239 We were faced with a choice; we could hack around this problem, but we aimed to contribute a proper solution upstream.

00:13:48.720 We saw it as an opportunity not only to pay back to open source but also to ensure our solutions aligned well with the community.

00:14:03.359 The discussions also revolved around creating a stable foundation for future development, avoiding unnecessary workarounds.

00:14:11.159 Our intention was to ensure that we could contribute meaningful improvements to the Redis ecosystem.

00:14:24.639 In providing clarity on how transactions should execute in Redis, we sought to engender mutual understanding.

00:14:31.040 While we may not have gotten every solution correct initially, each attempt presented vital learning opportunities.

00:14:39.920 I eventually engaged the maintainer in a more productive discussion to clarify our design objectives.

00:14:47.760 We learned the importance of breaking down changes into smaller, manageable pull requests.

00:14:53.200 This ensures that both we and the maintainer can effectively analyze changes.

00:14:58.479 At the end of the day, collaboration and understanding remain essential when working with open source libraries.

00:15:06.399 The lesson learned is that engaging with existing projects and their maintainers is rewarding for all parties.

00:15:12.959 Regardless of your familiarity with the code, testing, reporting issues, or suggesting solutions carries immense value.

00:15:19.999 Through this process, I found that successful open-source engagement produces benefits for both the contributor and the project.

00:15:29.560 At Zendesk, as a company, we understood our responsibility in this ecosystem.

00:15:35.679 We aimed to ensure the solutions we utilized were maintainable, rather than relying on specific hacks.

00:15:46.240 The investment put into enhancing our libraries promotes long-term stability.

00:15:51.680 Contributing to open-source gems is also valuable for personal knowledge growth.

00:15:58.800 With the constant evolution of technology, nurturing expertise empowers engineers to solve problems.

00:16:05.760 Eventually, our collective contributions led to significant improvements in the Redis community.

00:16:12.079 This culminated in finishing our first major upstream contribution.

00:16:18.840 As part of our efforts toward upgrading to Redis v5, we officially improved transaction support.

00:16:26.159 In our testing environment, we saw efficiency gains, alongside smoother operations.

00:16:32.599 Now, the collective efforts have rewarded us to understand where open-source contributions impact our operational success.

00:16:39.440 While we aimed to achieve stability and scaling with Redis, we also experienced positive performance optimizations.

00:16:47.120 In fact, we observed our performance metrics improved — about 3 to 4 milliseconds gained in response times.

00:16:54.040 Although that was not our primary goal, it served as a welcome surprise.

00:17:01.160 However, we couldn’t forget there were still processes in progress for getting our patches fully merged.

00:17:08.120 Initially, I was disappointed when I realized that we had to ship a fork to production.

00:17:14.360 Working upstream can indeed be slower than anticipated.

00:17:22.000 This experience taught me that maintainers often have limited capacity.

00:17:29.520 We may not always receive quick turnaround on our contributions.

00:17:37.000 However, contributing to a project in this manner remains a worthy endeavor.

00:17:45.079 We must consider that maintaining open-source projects can be demanding work.

00:17:52.319 We often miss the behind-the-scenes efforts that contributors invest, balancing their own responsibilities.

00:18:00.720 For this reason, we need to exhibit patience while supporting these projects.

00:18:09.159 As contributors, we must therefore be strategic about the implementation of upstream dependencies.

00:18:17.760 The lesson learned is to engage with upstream contributors, testing and reporting issues.

00:18:26.760 Our experiences not only provide insights for us but also improve the projects we use collectively.

00:18:34.240 Let's summarize some key takeaways about Redis and contributions to open source.

00:18:44.920 On the Redis front, if you're dealing with a write-heavy workload, you might want to consider using Redis Cluster.

00:18:52.680 Start with a single node cluster and scale it out only when absolutely necessary, saving you costs.

00:19:00.760 If you pursue this path, the Redis Cluster client gem provides support for operations like pipelining and transactions.

00:19:07.419 And if you encounter any issues, please report them so we can work together to resolve them.

00:19:16.960 More generally, always keep your gems up to date; you don’t know what advantages you might be losing.

00:19:24.080 Additionally, engage with upstream maintainers, as this can lead to fruitful collaborations.

00:19:30.800 Remember, everyone has something to contribute to open-source projects.

00:19:40.240 Even simple feedback, like sharing your usage experiences, is invaluable.

00:19:46.240 Thank you for your attention, and feel free to find me later to discuss further.

00:19:51.440 You can also follow me on social media, and I hope to connect with you soon!