Scaling a monolith isn't scaling microservices

by Kerstin Puschke

In the video titled "Scaling a Monolith Isn't Scaling Microservices," speaker Kerstin Puschke discusses the differences between background jobs and message-oriented middleware in the context of workload distribution across different architectures, particularly focusing on monolithic and distributed systems.

The key points discussed in the presentation include:

- Background Jobs: Puschke explains the concept of background jobs in Ruby applications, highlighting their advantages in decoupling user-facing requests from backend tasks, which is crucial for enhancing response times and user experience.

- Task Queue Systems: She details how popular background job backends like Rescue and Sidekiq enable asynchronous operations, illustrating how they can be beneficial in high-demand scenarios like e-commerce applications.

- Message-Oriented Middleware: Puschke contrasts background jobs with message-oriented middleware, such as RabbitMQ, which helps facilitate communication between different services in distributed architectures. This middleware allows for enhanced scalability and decoupling of services compared to traditional background jobs.

- Event Streams and Logs: The speaker also talks about event logs, exemplified by Kafka, which maintains sequences of events in a stateless manner, allowing greater flexibility in data handling and consumer roles.

- Challenges with Asynchrony: The challenges around asynchrony, such as ensuring message delivery and maintaining order in job processing, are addressed, emphasizing the importance of idempotent operations to avoid issues like double charges in financial transactions.

- Real-World Applications: Puschke provides insights into how Shopify successfully employs these techniques, managing vast numbers of background jobs, particularly during peak sales periods like Black Friday, showcasing a Kafka-powered dashboard for real-time visuals of sales data.

In conclusion, Puschke urges developers to carefully choose between these architectural patterns based on their application's needs, stressing that while background jobs are ideal for monolithic systems, message-oriented middleware and event logs offer significant advantages in distributed environments.

This video serves as an insightful guide for developers and engineers looking to optimize their systems and understand the implications of different architectural decisions in software scalability.

00:00:00.030 Okay, I totally killed it. Never mind. I want to remind you that the person I'm talking about is actually in the room. We can do this thing tomorrow at 11 a.m. in front of the Rouse House. This person, come on stage; maybe they'll wait for you for a tour. If you're interested in a non-sexual tour of Vienna, it's very fictional. Apparently, I'm very good at tours.

00:00:44.970 When a new user signs up for any kind of web application, the application creates an account for them. However, it usually also sends some kind of welcome email. Sending emails is slow; the user should not have to wait for that email to be sent. They should be able to use the service right away. Luckily, sending welcome emails is not essential for account creation.

00:01:07.290 So, the application can return to the user right after creating the account and send the welcome email later, outside of the main request-response cycle, by offloading it into a background job. Background jobs are great for monolithic applications, but they're not that useful in a distributed architecture. Today, I’m going to talk about how to choose the right tool for workload distribution in different architectures.

00:01:30.780 I'm going to start with background jobs: how they work, the features they provide, and some of the challenges they might pose, including when to choose them. I will also share how Shopify uses background jobs to scale one of the largest Rails applications in existence.

00:01:37.540 I'm going to compare background jobs to message-oriented middleware like RabbitMQ and briefly touch on event logs like Kafka, explaining how they differ from more traditional messaging. Finally, I'll summarize the comparisons of these three technologies.

00:01:49.440 Let’s start with background jobs in the Ruby world. Rescue and Sidekiq are popular background job backends, but there are also Delayed Job and some others. A background job is a unit of work to be done in the future. The application server can fork a worker process to run the job, and it shouldn't have to wait for the job to finish.

00:02:01.050 The worker, on the other hand, should be able to communicate any errors. To achieve this, we need some kind of asynchronous communication between our two processes. We can facilitate that by placing a message queue between them. The sender and the receiver of messages can interact with a queue at different times, with the message being the job — the tasks to be done by the worker.

00:02:12.120 This is why the queue is often also called a task queue. The application server doesn't have to wait for the worker to process a message; it simply drops a message into the queue and can immediately move on to serve user-facing requests.

00:02:26.330 We can set up more than one worker to handle more than one task simultaneously. A background job backend essentially functions as a task queue along with broker code for routing messages and managing the workers. It encapsulates the asynchronous communication between our processes, allowing us to keep our background job code simple.

00:02:46.060 Because our background jobs don't have to be aware that they are being run asynchronously in the future, let's look at the features of background jobs. Background jobs allow us to decouple user-facing requests from back-end tasks, particularly if those tasks are time-consuming. This helps improve response times and provides a better user experience.

00:03:05.920 If the application experiences a sudden spike, for example in image uploads, it won't hurt performance if the actual image processing is done as a background job. The queue might grow, but it's a buffer between the application server and the worker. The speed of queueing more jobs is not constrained by how quickly they are processed, allowing users to upload images as fast as usual.

00:03:17.680 If we face a very large task, we can split it into multiple subtasks and queue them as individual background jobs. By setting up more workers, they can work on these subtasks in parallel. However, our code remains simple, as our background jobs don't need to know they're being run in parallel.

00:03:34.250 All this code needs to know is how to handle one of these subtasks at a time. If a worker encounters any errors while executing a job, that job can be requeued and retried later. For example, at Shopify, we have background jobs closely related to the buyer’s checkout process. These jobs are certainly more urgent than, say, firing webhooks.

00:03:56.560 Backend systems like Rescue and Sidekiq allow us to prioritize jobs by using multiple named queues. Jobs queued into a high-priority queue are processed first, but if this queue is empty, our workers will not idle or waste resources; they will work on lower-priority jobs.

00:04:16.300 The main application powering Shopify queues tens of thousands of jobs per second, so we rely heavily on background jobs and the features they provide. Background jobs encapsulate asynchronous communication between two processes, but they cannot change the fact that we're dealing with asynchronous communication, which poses some challenges.

00:04:31.420 If a job succeeds, the worker confirms that it's safe to remove this job from the message queue. But what happens if the job is not confirmed? We can allow another worker to pick up that job, but we may end up running it twice if the first worker didn't crash while being slow.

00:04:46.220 On the other hand, if we don't allow a second worker to pick up the unconfirmed job, there is a risk that our job won’t run at all in case the first worker actually did crash. We need to choose between at least once and at most once delivery. Most once delivery seems appealing, but running the job twice could be worse than not running it at all, usually pointing to a flaw in the code.

00:05:06.300 For example, if the background job charges some customer, we definitely want to avoid charging that customer twice. So, we attempt to use most once delivery. A better approach might involve carefully tracking all charges we are making and ensuring the background job code checks this state before making any charges.

00:05:29.670 This way, running it a second time won't create a second charge, allowing us to safely opt for at least once delivery. Our jobs leave the queue in a predefined order, but that doesn't necessarily mean they will be processed in the same order. We never know which worker is faster if we have more than one worker, and if a job encounters an error, it will be requeued and retried later.

00:05:49.650 As a result, our jobs might run out of order. If the order of jobs matters, we shouldn't queue all of them right away, as that risks them running in the wrong order. Instead, the first job, after completing its tasks, should queue the second job as a follow-up, ensuring they run sequentially.

00:06:07.620 A background job does not have to be as fast as user-facing requests, but long-running jobs can cause problems. For example, if you're using Rescue, it prevents a worker from shutting down while a job is still running. This method isn't very cloud-friendly, as we can't rely on the same resources being available for many hours.

00:06:26.660 Sidekiq handles this a little better; when a worker shutdown is requested, Sidekiq aborts a running job and requeues it. But that isn't necessarily good enough; the job may be aborted before finishing. So, if you deploy faster than the job can run, that job will never succeed.

00:06:43.350 At Shopify, with a core team deploying about 60 times a day, this isn't a theoretical possibility, but a real issue we need to address. Sidekiq won't solve this problem when many long-running jobs need to iterate over large data collections.

00:07:00.300 When we split those jobs into smaller parts and set checkpoints after each iteration to allow requeuing the job, this approach makes our jobs interruptible without losing progress. It enables us to shut down our workers at any time to deploy them, and they become cloud-friendly. It also helps with disaster prevention.

00:07:20.370 For instance, if our database is experiencing problems, we can throttle or back off jobs to prevent making ongoing incidents worse or causing new ones. This method also ensures data integrity. If we need to move data around between different database shards, we can interrupt a job to do so safely.

00:07:42.820 You can achieve some of these results by splitting the job into smaller tasks and queuing them independently, but it complicates tracking retries and failures for the overall task. A resume isn’t a retry. By encapsulating these scaling issues into a separate abstraction layer, we can keep our background job classes clean and simple since they remain unaware of the scaling issues.

00:08:03.390 This abstraction layer was recently open-sourced. If you'd like to take a closer look, you can find it on GitHub.

00:08:20.750 So, when should you choose background jobs? The background job backend code is a Ruby gem running on both the application server and worker. The tasks pulled into our task queue are instances of a job class, which is a Ruby object. Both the application server and worker need to understand this object; they need to implement that class.

00:08:36.270 This compatibility makes background jobs ideal for monolithic applications where our application server and worker run instances of the same codebase. The background job backend facilitates communication between different processes. If our background job, for instance, is performing image processing and we want to extract that into a separate service, we require communication between services.

00:08:53.640 This is something background jobs aren't particularly good at. To summarize, background jobs excel at decoupling user requests from back-end tasks in monolithic architectures.

00:09:05.680 Message-oriented middleware differs in that it facilitates communication between different services. A very widely used implementation is RabbitMQ, with protocols like AMQP and MQTT. This kind of middleware works based on task queues yet, the logic for message routing and managing workers is now inside the middleware, separate from our codebase.

00:09:34.280 The consumer of messages only needs to speak the middleware protocol, losing the need to pass Ruby objects around. Instead, they agree on a purely data-based interface, such as a JSON payload with specific attributes.

00:09:58.150 As the middleware is based on task queues, it effectively brings background job features to a distributed architecture. By placing a database interface between our services, we decouple them effectively, making it easy to replace the service with a different implementation later.

00:10:25.920 If the middleware employs a wire-level protocol like AMQP, we also gain interoperability, allowing us to connect services based on entirely different technologies. Background jobs function primarily as commands—"send this email," "process this image," or "fire this webhook." We can use command messages for similar purposes in a distributed architecture.

00:10:52.630 For example, we can send a message to an image processing service, ordering it to process a specific image. We can also send event messages, where the message producer is indifferent to what the consumer does with the information it receives—common in propagating updates.

00:11:12.630 Take this example of a distributed architecture: if my business partner goes out of business, I need to update their records, cancel their pending orders, and halt their support contracts. Client services need to be informed about this update. In a distributed architecture, it's also typical for services to work with data they don’t own.

00:11:36.090 To avoid repeatedly requesting this data, a service may cache a local copy, which necessitates learning about any updates to this data for proper cache invalidation. A common approach to propagate updates is by firing webhooks or making REST requests. However, if a client service is temporarily down, it may miss a request and never become aware of the update.

00:12:01.680 This also leads to separate requests for each client service, as HTTP routing operates one-to-one. If we add a new service, we need an additional request. Therefore, the service firing the webhooks must be aware of all client services.

00:12:23.900 Instead of using HTTP for propagating updates, we can utilize our middleware and send a message about the update. If the message consumer is temporarily down, the message will stay in the queue until it returns, ensuring no information is lost during downtime.

00:12:45.320 Messaging middleware typically forms a topic rather than a single point-to-point queue. Imagine each consumer having its own message queues, while the middleware adeptly duplicates messages into the correct queues, removing the one-to-one routing limitation.

00:13:02.470 The topic yields publish/subscribe capabilities: a new service can subscribe to messages of interest by creating its own queues, allowing message producers to remain oblivious to these consumers. Additionally, new message producers can begin sending messages without consumers knowing about them.

00:13:18.450 For instance, when every service reports suspicious activity to a centralized fraud score tracker, the tracker itself remains unaware of the other services. Likewise, message producers stay anonymous. Similar to background jobs, challenges arise due to the asynchronicity of communication.

00:13:44.520 Our messages possess a data-based interface, which may eventually require breaking changes. Coordinating such changes becomes challenging when messages are routed from multiple producers to various consumers, making N-to-M routing, while technically possible, impractical.

00:14:03.080 Moreover, domain and business logic can be better represented by multiple different messages that convey smaller pieces of information rather than one large message that covers everything. When consumers are neither functioning, the messages remain in the queue.

00:14:20.110 This is excellent if a consumer is temporarily down, but if it’s permanently gone, messages begin to pile up in the queue, which can lead to issues for the broker. Hence, a well-behaved system should ensure the removal of unnecessary queues.

00:14:36.060 As with background jobs, we typically lack guarantees for exactly once delivery. Writing idempotent message consumers enables us to choose at least once delivery safely.

00:14:57.580 Messages might also get processed out of order; if the order does matter, we should avoid queuing them all at once. Instead, let the first message, after handling its designated task, queue the next message as a follow-up.

00:15:15.470 Regarding updating information, we must be cautious about receiving multiple updates within a short time, as the last message received may not necessarily reflect the most recent update. Two different methods for propagating updates exist.

00:15:30.590 One approach, known as event-carried state transfer, involves a payload containing the updated data. The receiving service doesn't need to request updates from the service that owns the message. However, we need to include information to help the consumer determine the correct order of events without relying on the message order.

00:15:50.730 A simple timestamp can suffice, but since messages might get processed out of order, the order alone isn't reliable. Alternatively, we can send simple event notifications, where the payload includes only the ID of the updated resource. These messages are commutative, meaning their order doesn't matter.

00:16:06.210 However, the message consumer will need to follow up with a request to obtain the most recent changes, which limits their visibility into the intermediate state of the resource.

00:16:30.060 This approach does not work well if a consumer aims to build a history or timeline of events. Once messages are processed, they are removed from the queue, which means I cannot replay the stream of messages.

00:16:46.560 Replayability is a valuable feature when constructing systems based on events. In an event sourcing system, I don't store the current state of a resource; instead, I keep track of a series of events to recalculate the resource's current state.

00:17:06.400 To leverage my messaging system for this, the message consumer must persist the messages being processed, ensuring they can be replayed later. However, this approach means that each consumer builds its own event log, resulting in no single source of truth. Consequently, consumers may end up living in different realities.

00:17:36.730 Message-oriented middleware suits distributed architecture better than background jobs because it's designed for communication between services. It works well in mostly stateful architectures, but isn’t ideal for event sourcing.

00:18:00.600 To summarize, the features of message-oriented middleware hinge on queues and topics. Topics abstract the complexities of publish/subscribe systems, while queues eliminate the need to process messages synchronously or in the future, enabling us to write simple and effective message consumers.

00:18:37.720 Yet, the entire system might grow somewhat more complicated, requiring attention to delivery and order guarantees. Finally, messaging works best in stateful distributed architectures.

00:19:05.370 Event logs, also referred to as commit logs, take a wholly different approach to events. One of the most popular examples currently is Kafka. An event log preserves the sequence of events in an append-only log, allowing consumers to read from that log at will.

00:19:20.890 The broker itself maintains no queues; it remains stateless. Therefore, whether a consumer disappears doesn't generate overflow issues for the broker. Kafka also enhances throughput by allowing event replay.

00:19:46.760 Since events are not removed after processing, consumers can replay them as often as they wish. Utilizing similar event logs creates a unified source of truth if we want to pursue event sourcing.

00:20:10.080 This makes event logs a suitable choice for event sourcing as well as real-time applications. Over the Thanksgiving weekend, commonly called Black Friday and Cyber Monday, the largest shopping weekend in the U.S., Shopify built a Kafka-powered dashboard.

00:20:38.140 This dashboard visualized in real-time how Shopify merchants generated over 1 billion USD in sales during this extensive weekend. Black Friday and Cyber Monday 2018 are expected to be even bigger.

00:20:54.650 If you'd like to help us take these things to the next level, please come talk to me afterward; we're hiring. Thank you so much!