00:00:03.439
Hello everyone! Yes, my name is KJ, and as she introduced me, I’m an engineer at Zendesk.
00:00:09.360
I’m a committer on the Ruby core team, and today I'm talking about our fun adventures scaling our Redis writes.
00:00:16.199
I didn’t have time to rename the talk to something like "Valkyrie" or the other title "Rict", so it’s just "R" here.
00:00:24.720
Please feel free to interrupt me if something I say doesn’t make sense, because I'm good at not making sense.
00:00:30.160
Also, I should talk a bit about what we do at Zendesk for those who don’t know. We are a customer service software platform.
00:00:37.719
We deal with everything from email-based ticketing workflows to advanced AI chat features, and all that good stuff.
00:00:51.520
So what is our story about today? Here’s a photo of our office in San Francisco, which is relevant to this talk.
00:01:05.880
I work on a team responsible for the overall health, reliability, and well-being of our big million-line monolith application.
00:01:11.119
We’re a distributed team, with some members in Australia and many others spread far and wide throughout the United States.
00:01:24.079
We usually do all the remote work things, but we had the opportunity last year to come together at our office in San Francisco.
00:01:30.040
This team offsite was great; we went bowling, had some food, and accomplished a lot of productive work, like setting our vision for the future and engaging in pair programming.
00:01:43.240
However, there was one thing we got to do together in San Francisco that I was not expecting: in-person incident response.
00:01:56.799
It felt like the good old days where everyone is in a war room trying to fix an ongoing incident.
00:02:02.920
The incident around that table was related to one of our many large instances affectionately known as "MK".
00:02:08.080
MK probably gives you an idea of what goes on there. I need to thank my colleagues for suggesting I needed more memes in my presentation.
00:02:15.040
The problem with MK was that something had gone wrong — it was using too much CPU.
00:02:25.519
The alarm went off, we got paged, and upon investigation, it looked like the issue was related to usage.
00:02:33.040
This Redis instance was actually an ElastiCache cluster mode-enabled Redis, but it only had one shard.
00:02:38.319
So it wasn't really functioning as a proper cluster with just one node. We considered that, figured it was time to write a check to Mr. Bezos.
00:02:51.879
We clicked the button there and said we wanted more shards, but this didn’t fix everything; in fact, it made everything much worse.
00:03:03.920
The original problem wasn’t that severe, but when we scaled out the Redis cluster, we started to see these weird and wonderful errors.
00:03:09.959
These were errors we had never seen before, such as the "cross slot pipelining error" and "move 3016" in this picture. They were very serious, causing real problems, and we had no idea why they were happening.
00:03:29.519
So, we undid the scaling and reverted back to a one-node cluster.
00:03:35.959
After this, we were left questioning why things had gone so wrong.
00:03:41.239
We had to spend some time learning how Redis works and understanding what cluster mode entails and why it didn’t work well for us.
00:03:47.680
I’m going to give you a speed run of the different ways to deploy Redis that we learned about during this fun learning exercise.
00:03:56.239
The simplest version is a single Redis instance.
00:04:02.480
If you want something more complex, you could have one server with Redis on it. Choose your server provider of choice, install Redis, and there you have it.
00:04:14.280
This is simple and cheap; if it's sufficient, you can save yourself thirty minutes.
00:04:19.320
However, sometimes we need more robustness than this setup can provide; we might need high availability.
00:04:32.280
If our single Redis server goes down, we need to ensure we still have access to Redis.
00:04:38.800
Sometimes, there could be a risk of data loss here, and we might need replication as well.
00:04:44.360
This would mean creating a setup where if the Redis server goes down, we don't lose the data it was holding.
00:04:49.960
Relevant to this discussion, the goal is to scale out to get as much Redis throughput as possible from one server, but if we need more, a different approach will be required.
00:05:06.960
This brings us to Redis Sentinel, an early approach for high availability.
00:05:12.919
Redis Sentinel is distributed as part of Redis but operates as a separate process. This means you have some servers with Redis running, and you can install Sentinel alongside them.
00:05:24.080
The Sentinels will collaborate, have an election, and decide who will be the master and who will be the replicas.
00:05:30.840
Sentinel then configures Redis to establish who is the master and has to be replicated.
00:05:36.080
This setup requires application support as the application must know which of the Redis nodes is the master node.
00:05:41.680
The application must ask Sentinel, which will provide the current master node's address, allowing the application to connect appropriately.
00:05:55.000
This solution eliminates a single point of failure; if one of the Redis nodes fails, Sentinel promotes one of the replicas to master.
00:06:01.919
The application learns about this promotion because it understands Sentinel.
00:06:07.039
Additionally, this approach can be used to scale read operations; if you have a read-heavy workload and can tolerate slightly stale data, your application can opt to read from these replica nodes.
00:06:14.759
Unfortunately, this multi-master setup doesn’t work for write operations under Sentinel; every write must still go through one Redis instance.
00:06:22.960
This leaves us with a problem given our write-heavy workloads.
00:06:28.000
Fortunately, there’s a solution in the Redis gem called Redis Distributed.
00:06:36.320
Redis Distributed operates on the notion that we can take multiple Redis servers and split the keyspace between them.
00:06:42.479
This is accomplished by using a hashing algorithm to determine which keys go where.
00:06:50.160
The downside, however, is that this approach can lead to a loss of high availability.
00:06:56.760
If one of those Redis servers goes down, we would lose half of our keys, which is not ideal.
00:07:04.479
To address the need for both distribution and high availability, we can combine the idea of Redis Sentinel with Redis Distributed.
00:07:13.920
This gives us independent Redis Sentinel clusters, allowing us to maintain multiple shards.
00:07:22.160
With these powers combined, we can have horizontal write scalability; we increase throughput with additional nodes.
00:07:28.479
However, we have to keep in mind how to add new nodes effectively.
00:07:34.080
Redis Distributed has challenges when it comes to scaling; adding a new node requires you to determine how to rebalance the keyspace.
00:07:40.599
The system doesn’t automatically accommodate the redistribution of keys.
00:07:46.320
So while Redis Distributed allows you to start at a certain scale, it doesn’t allow easy changes to that scale.
00:07:52.360
This limitation highlights the need for Redis Cluster, which is designed to solve the scalability issue.
00:08:01.599
In Redis Cluster mode, we divide the keyspace into 16,000 slots determined by a hashing function, specifically the CRC16 hash.
00:08:09.000
The Redis nodes come together and elect a master for each slot, creating a proper cluster.
00:08:15.600
In this arrangement, every slot is served by only one primary node, complemented by one or more replicas.
00:08:21.280
Your application must know which keys are sent to which servers to ensure effective distribution.
00:08:27.679
To support this architecture, we need client library support. The Redis gem provides the capability to understand how to route keys correctly.
00:08:36.240
The gem can request cluster node details, returning a list of nodes and their responsible slots.
00:08:42.240
This allows the application to send keys to the correct nodes according to the slot they hash to.
00:08:48.720
Moreover, Redis Cluster allows changes to the topology; if we add a node, the system can redistribute the slots.
00:08:57.520
This process involves copying the current values of the keys and using a consensus mechanism to declare new ownership of slots.
00:09:05.879
The application will also be informed about these changes through redirection commands.
00:09:11.760
If a request for a key goes to the wrong node, the node will indicate which node is responsible, enabling correct routing.
00:09:18.960
This setup leads to horizontal scalability, as we can scale out writes and reads easily as we add more nodes.
00:09:30.000
However, Redis Cluster requires complex client libraries to handle tricky situations.
00:09:37.080
These situations may include what happens if keys are in the process of being moved or if there's a network partition.
00:09:43.840
These issues do not arise when using a single Redis server, but higher complexity is required for clusters.
00:09:53.600
At this point, we can understand the taxonomy of Redis and where Redis Cluster fits in.
00:10:00.279
Now, let's return to our incident and analyze what went wrong.
00:10:06.880
Our Redis Cluster was configured with just one shard, which means all slots were served by a single master.
00:10:12.560
In practical terms, this setup didn’t really function as a true cluster.
00:10:18.400
When we tried to scale out by adding more nodes, we consequently faced numerous issues.
00:10:24.400
Now let's examine the cross-slot pipelining error that we encountered.
00:10:30.480
Redis commands can operate on multiple keys at once. For instance, using Sidekiq Pro involves extensive multi-key operations.
00:10:42.920
Nevertheless, in Redis Cluster, a multi-key operation only works if all keys belong to the same slot.
00:10:55.080
For example, if you attempt to set values for two keys, but they fall into different slots, you’ll receive an error.
00:11:01.919
To work around this, you can use hashtags within your key names.
00:11:09.760
This means that if part of your key name contains a brace, Redis only hashes the part inside the brace.
00:11:16.120
This allows keys with matching hashtags to belong to the same slot, making them eligible for multi-key operations.
00:11:23.680
Pipelining is another efficiency approach in Redis.
00:11:31.239
Typically, you would request multiple keys' values one by one, but you can also send all the requests at once and collect responses later.
00:11:42.240
This minimizes the request-response roundtrip time and can result in a performance boost.
00:11:48.760
However, the cross-slot pipelining error can arise in situations where multiple keys are involved across different slots.
00:11:56.800
It should work seamlessly in theory, but in practice, we discovered it was not functioning correctly.
00:12:02.760
The good news is that the Redis gem v5 has improved cluster support under a new maintainer, and we learned we needed to upgrade to that version.
00:12:19.440
Thus, after implementing the upgrade in our testing environment, we stopped encountering those cross-slot pipelining errors.
00:12:27.080
However, new errors emerged that were related to transactions.
00:12:32.680
Let’s now discuss transactions in Redis and what issues we faced.
00:12:40.000
Transactions in Redis can be conceptualized as conditional execution.
00:12:48.760
When you execute a transaction, it will only apply changes if the specified keys haven't been modified since the start of the transaction.
00:12:55.440
In scenarios with competing modifications, the transaction will fail, preventing problems like negative ticket sales.
00:13:05.680
However, one limitation is that in a Redis cluster, transactions work only when all keys belong to the same slot.
00:13:14.880
While testing our new branch, we encountered ambiguous node errors, which we had experienced previously.
00:13:24.880
In fact, transaction support in the Redis Cluster gem was incomplete.
00:13:31.679
It facilitated multi-exec but lacked complete watch functionality necessary for conditional execution.
00:13:40.239
We were faced with a choice; we could hack around this problem, but we aimed to contribute a proper solution upstream.
00:13:48.720
We saw it as an opportunity not only to pay back to open source but also to ensure our solutions aligned well with the community.
00:14:03.359
The discussions also revolved around creating a stable foundation for future development, avoiding unnecessary workarounds.
00:14:11.159
Our intention was to ensure that we could contribute meaningful improvements to the Redis ecosystem.
00:14:24.639
In providing clarity on how transactions should execute in Redis, we sought to engender mutual understanding.
00:14:31.040
While we may not have gotten every solution correct initially, each attempt presented vital learning opportunities.
00:14:39.920
I eventually engaged the maintainer in a more productive discussion to clarify our design objectives.
00:14:47.760
We learned the importance of breaking down changes into smaller, manageable pull requests.
00:14:53.200
This ensures that both we and the maintainer can effectively analyze changes.
00:14:58.479
At the end of the day, collaboration and understanding remain essential when working with open source libraries.
00:15:06.399
The lesson learned is that engaging with existing projects and their maintainers is rewarding for all parties.
00:15:12.959
Regardless of your familiarity with the code, testing, reporting issues, or suggesting solutions carries immense value.
00:15:19.999
Through this process, I found that successful open-source engagement produces benefits for both the contributor and the project.
00:15:29.560
At Zendesk, as a company, we understood our responsibility in this ecosystem.
00:15:35.679
We aimed to ensure the solutions we utilized were maintainable, rather than relying on specific hacks.
00:15:46.240
The investment put into enhancing our libraries promotes long-term stability.
00:15:51.680
Contributing to open-source gems is also valuable for personal knowledge growth.
00:15:58.800
With the constant evolution of technology, nurturing expertise empowers engineers to solve problems.
00:16:05.760
Eventually, our collective contributions led to significant improvements in the Redis community.
00:16:12.079
This culminated in finishing our first major upstream contribution.
00:16:18.840
As part of our efforts toward upgrading to Redis v5, we officially improved transaction support.
00:16:26.159
In our testing environment, we saw efficiency gains, alongside smoother operations.
00:16:32.599
Now, the collective efforts have rewarded us to understand where open-source contributions impact our operational success.
00:16:39.440
While we aimed to achieve stability and scaling with Redis, we also experienced positive performance optimizations.
00:16:47.120
In fact, we observed our performance metrics improved — about 3 to 4 milliseconds gained in response times.
00:16:54.040
Although that was not our primary goal, it served as a welcome surprise.
00:17:01.160
However, we couldn’t forget there were still processes in progress for getting our patches fully merged.
00:17:08.120
Initially, I was disappointed when I realized that we had to ship a fork to production.
00:17:14.360
Working upstream can indeed be slower than anticipated.
00:17:22.000
This experience taught me that maintainers often have limited capacity.
00:17:29.520
We may not always receive quick turnaround on our contributions.
00:17:37.000
However, contributing to a project in this manner remains a worthy endeavor.
00:17:45.079
We must consider that maintaining open-source projects can be demanding work.
00:17:52.319
We often miss the behind-the-scenes efforts that contributors invest, balancing their own responsibilities.
00:18:00.720
For this reason, we need to exhibit patience while supporting these projects.
00:18:09.159
As contributors, we must therefore be strategic about the implementation of upstream dependencies.
00:18:17.760
The lesson learned is to engage with upstream contributors, testing and reporting issues.
00:18:26.760
Our experiences not only provide insights for us but also improve the projects we use collectively.
00:18:34.240
Let's summarize some key takeaways about Redis and contributions to open source.
00:18:44.920
On the Redis front, if you're dealing with a write-heavy workload, you might want to consider using Redis Cluster.
00:18:52.680
Start with a single node cluster and scale it out only when absolutely necessary, saving you costs.
00:19:00.760
If you pursue this path, the Redis Cluster client gem provides support for operations like pipelining and transactions.
00:19:07.419
And if you encounter any issues, please report them so we can work together to resolve them.
00:19:16.960
More generally, always keep your gems up to date; you don’t know what advantages you might be losing.
00:19:24.080
Additionally, engage with upstream maintainers, as this can lead to fruitful collaborations.
00:19:30.800
Remember, everyone has something to contribute to open-source projects.
00:19:40.240
Even simple feedback, like sharing your usage experiences, is invaluable.
00:19:46.240
Thank you for your attention, and feel free to find me later to discuss further.
00:19:51.440
You can also follow me on social media, and I hope to connect with you soon!