00:00:16.480
Okay, thanks everyone for coming to the first talk. I'm John, and I work literally right down the block at GroupMe, which is like half a block away on 18th Street. GroupMe is a large and growing mobile chat application, and we want to share a bit about what we have done over the past two to three years to grow our infrastructure along with the number of people using our service. We're going to discuss an experience report of what we've done and what we've learned.
00:00:34.320
I want to start with a problem statement: why are we talking about services? Why does this matter? Why does every conference have at least one talk about services? I think it boils down to this: our Ruby on Rails application is difficult to change. Our app has grown in scope and complexity over the years, making changes harder and harder with the number of people working on it and the code involved. In the beginning, we had this beautifully polished gem—our application was coherent and we had thought through every area of its construction. It was a pleasure to work on.
00:01:08.880
However, as time goes on, it becomes less coherent. This cover art from a game called Katamari Damacy illustrates this perfectly. The goal of the game is to build a ball of unrelated things, and this is akin to what many Rails applications become. They turn into a mass of components that are not very related to each other but are all touching and tightly coupled. We start using the term 'monolith' to describe our application, which represents all the mistakes of the past. When we talk about services, we refer to a bright shiny future where we'll fix all those problems.
00:01:39.680
This scenario is common across tech conferences. In the last ten months, companies like Square, Yammer, and Living Social, which employ hundreds of engineers, have discussed their original monolithic applications and how they are slowly trying to extract services from them to increase agility. It's so prevalent in the Rails community that I've seen the term 'monorail' thrown around, the first time on Jeff Hodge's blog—a portmanteau of Ruby on Rails and monolith. It's all too easy to continue throwing everything into one application.
00:02:12.000
Even when we look beyond our community, there are precedents for this. Warner Verbals noted in a 2006 ACMQ article that Amazon faced a similar situation from 1996 to 2001. When you visited amazon.com back then, you were interacting with a frontend web application connected to a backend database. They stretched this system as far as it could go. By 2001, the famous edict was made: everything would be done as services. It's worth noting that in 2000, their Taipei revenue was around 3 billion dollars a year, indicating just how far they could stretch their monolith.
00:03:01.040
Back to our community: many talks tend to focus on synchronous services. When we envision services, we often picture a fleet of small Sinatra applications that each own a vertical slice of our domain and communicate through HTTP. However, I want to focus on a different type of service: asynchronous services that we have extracted from our application to help scale it out. When I mention 'asynchronous,' I mean message passing—services that interact with each other by dropping messages onto a medium.
00:03:44.240
In this discussion, we will talk about some patterns that have emerged from our extractions of these services. Regardless of the tools we use, there are specific interaction patterns between the services we want to cover. We'll highlight some benefits of services in general, especially these message-based asynchronous services, and outline the operational concerns they introduce. Then, we'll share some insights we've gained over the past couple of years regarding these patterns.
00:04:32.000
The first pattern I want to discuss is what I've been calling 'service mailbox.' This is a pattern where we have a kind of medium between our services. Clients write individual messages into a queue, which is monitored by a service that processes the messages as it can. When I refer to a queue, I specifically mean a FIFO (First In, First Out) queue for data structuring purposes—items enter the queue and exit it in a FIFO manner.
00:05:10.639
In regards to the queue, it exposes at least two operations: an enqueue operation, where an item is added to the back of the queue, and a complimentary dequeue operation, where the first item scheduled for delivery exits the queue and is returned atomically. So, we end up with our application enqueuing messages into this shared mailbox, and our service dequeuing from the same mailbox. To give you some context about how this works, I'll discuss specific features of GroupMe.
00:05:55.760
The core of GroupMe is message delivery. It is a group chat application where users join groups, and messages sent to a group are relayed to the other members. For instance, if we have a user, Steve, who sends a message to the Best Friends group, we relay that message to the other three group members. Likewise, if another person replies, that message is relayed to the same three members—it's a straightforward one-to-many broadcast group system.
00:06:43.360
We support various delivery mechanisms to get these messages to users. Primarily, we use smartphone push notifications for platforms like iOS, Android, and Windows Phone. Additionally, we support numerous SMS networks and protocols to deliver messages to feature phone users who are not on smartphones. There are many business requirements regarding message delivery because it is our core feature.
00:07:19.520
The first requirement is simple: messages should be delivered exactly once. We don’t want duplicate messages sent to users, and equally important, we don't want messages to be dropped. We aim to ensure that when a message enters our system, it will exit exactly once. Furthermore, we strive to deliver messages as quickly as possible, focusing on minimizing the time it takes from when a message enters our system to when it exits into the recipient's system. Finally, if messages remain queued for more than two hours, we begin to drop them, as they may become irrelevant.
00:08:51.600
In the beginning, GroupMe was a 24-hour hackathon project, and we're still working with the same code base that started there. Initially, the prototype of GroupMe was an SMS product where users were assigned a telephone number representing a group. Messages sent to that number would be broadcast to the group. Initially, this process was done synchronously, where requests would come in from Twilio and messages would be relayed back within the context of a web request-response cycle. This worked well until traffic increased, prompting us to background that work to enhance user response time.
00:09:47.840
We found that backgrounding allowed us to operate independently of the request-response cycle, returning users more quickly. We utilized Rescue to handle these background tasks while also adding more services—not just SMS delivery, but push delivery using tools like Apple's APNS or Android's C2DM. However, we faced challenges with volume, resulting in choppy and low throughput. Rescue is a fork-based system that forks after each job, which ultimately hurt our throughput due to variability in job processing times.
00:11:05.439
We realized that the throughput for our messaging system needed improvement, as it varied a great deal based on how our message senders handled job processing. This resulted in a strong focus on getting messages out expediently, yet we were limited by how many processes we could adequately run. Our Rails processes consumed significant memory and database connections, presenting scalability challenges.
00:11:55.680
To solve these problems, we introduced a dedicated service whose only job was message delivery—what we call the transport service. This transport service is essentially a high-speed event machine rescue worker. During this process, we were careful to make migration as simple as possible since it was our first experience with extractions. Utilizing Redis for our job queue was essential, as it was already integrated into our infrastructure and very well understood by our team.
00:12:43.680
We built the transport service as a drop-in replacement for Rescue, focused on high-speed operations. It monitors a queue, delivering messages to the appropriate service through Redis. This setup allows us to achieve the exact 'once' semantics every time a message is delivered. Furthermore, we noticed higher and more consistent throughput because we could optimize task handling separately, without the overhead of running large Rails processes.
00:13:44.720
Another pattern we often use is what we call 'event stream.' This differs from a shared mailbox; it represents a shared data stream among our services. Thinking about an event stream is similar to thinking about a file on disk. Files offer operations like appending data to the end and reading from specific offsets, which is analogous to our event stream.
00:14:27.200
So, if we visualize this, we imagine a shared stream in the middle of our services, with publishers at the top and consumers at the bottom. The stream could carry various types of data, such as click events on our website or user interactions with our app. For instance, we have a news feed feature akin to any other social application, where the characteristics differ notably from message delivery.
00:15:02.880
Using the best friends group as an example, users can like messages to show approval. When a message is liked, a push notification is sent to the original sender, and we keep a counter of likes. In the original implementation, this was done synchronously with costly database queries that soon became problematic as user numbers grew.
00:15:47.680
Due to performance issues and competition for database time with critical real-time features like message delivery, we sought to optimize. We introduced a separate service, the news feed service, to handle this task more efficiently. Using Apache Kafka for this service allowed us to publish user events as they occurred without straining the overall system.
00:16:54.480
Kafka is a distributed, high-throughput pub-sub messaging system developed by LinkedIn. Its design allows messages to be appended to a stream by producers that consumers can read from at their leisure. This architecture means we can publish user events to this event stream while letting the news feed service take responsibility for generating news feed items effectively.
00:17:45.680
One key advantage of using Kafka is that it supports multiple consumers. Consequently, messages can be delivered multiple times or replayed if needed, allowing for significant flexibility. Performance improves as services can process the data stream in parallel, providing us with the ability to create real-time analytical tools utilizing that data with minimal effort.
00:18:41.920
As we discuss the benefits of these approaches, scalability is paramount. This is a significant consideration when moving from a singular Rails application to smaller, more granular services. Each process consumes a specific amount of resources, and while we talk about infinite horizontal scalability, the reality is that constraints apply.
00:19:20.000
For example, database connections are valuable resources; overextending them can lead to limitations in performance and scalability. These smaller building blocks allow for targeted capacity increases, meaning if we need more message deliverers, we can add them without relying on a single monolithic application, providing more flexibility.
00:19:59.800
Agility is another significant advantage of service-oriented design. Independent deployment of updates across services means we can adjust without extensive coordination. Asynchronous services introduce a buffer in our network, which eases back pressure when necessary, decreasing the risk of cascading failures during service deployments.
00:20:44.100
This buffer allows for loose coupling between the services and makes data management more efficient. Teams can independently update services using shared queues or event streams, facilitating experimentation without complex ETL processes. Additionally, making data accessible across services promotes healthy data ownership that can mitigate silos of knowledge.
00:21:30.919
Back pressure is a crucial concept with third-party interactions. By limiting the rate at which we send messages, we align with partner delivery mandates and govern our rate across channels to prevent unintentional Denial of Service (DoS) situations. It’s essential to soak up bursts of traffic and distribute them evenly across our service network.
00:22:13.840
Implementing the Brutally Pace component as an event machine worker allows us to manage incoming message deliveries while adhering to defined throughput limits. For example, we can control service delivery to not exceed a set number of jobs per second and back off automatically when failures occur.
00:23:04.960
However, we have operational challenges to consider. Our buffer sizes cannot be infinite; limited memory means we need to monitor closely. Despite past experiences with multi-hour outages, we've been fortunate, as our time-to-live (TTL) of two hours helps prune overly old messages.
00:23:54.360
Interoperability is another benefit of the service-oriented approach. A recent example would be our handling of messages received via SMS through a protocol called SIP, which has a small messaging aspect. We set up a listener to normalize incoming SIP messages and then inject them into our Redis mailbox. Here, all business logic resides within the consumer service.
00:24:40.240
Alongside amazing advantages, we face concerns that warrant attention. Monitoring our evolving infrastructure has been a critical task to understand how and when to intervene. Quality of service metrics tell us how long each initiated action takes to complete. For instance, we measure message delivery times from the moment a message enters our network until it's delivered to mobile devices.
00:25:25.040
We closely graph and evaluate these latency metrics with an awareness of how network changes affect performance. Additionally, we look at queue sizes, but particularly for Redis, non-zero metrics don't inherently indicate issues. We use queue size data in conjunction with quality of service metrics to gauge overall system health.
00:26:10.880
Delivery guarantees are particularly pivotal—defining how many times a message will necessarily be delivered per our system design is vital. We can aim for deliveries that fall within specific ranges: at most once, exactly once, or at least once based on various considerations. Each of these guidelines directly impacts how we construct services based on several external or internal factors, including side effects.
00:27:39.520
Interface synchronization also becomes essential, especially as everyone aligns around shared service messaging. We've found that having explicit ruby objects that encompass messages assists in controlling and updating those interfaces. By developing library dependencies, we ensure all components are testing and synced together for future changes.
00:28:30.080
It's essential to address failure modes of components early on during the design phase. For instance, we think of what happens when critical services like Redis become momentarily unreachable. Do we return errors to users? Do we accept the message and buffer it in-memory? An explicit grasp of these situations allows us to build resilience and robustness into our messaging system.
00:29:43.520
Durability is integral. This means balancing the trade-offs when a message is lost and addressing performance concerns during those periods. By default, systems like Redis can be less durable unless configured carefully, and while Kafka offers an efficient medium for messages, it lacks built-in acknowledgments to confirm receipt. Hence, we're deliberate about reducing the chance of loss through structured, thoughtful service design.
00:30:32.320
As we reflect on our transformation, we’ve gained many insights over this journey. For example, it’s possible to extend the life of a monolithic application beyond initial expectations. Many well-known businesses have succeeded under a monolithic approach before transitioning to services only when necessary, which can help manage complexity without rushing into premature architectures.
00:31:34.000
Messaging becomes an effective tool in service-oriented designs. It provides alternative means for inter-service communication. Background work is prime for service extraction because asynchronous operations often lend themselves more readily to redesigns rather than forcing synchronous tasks into a service model.
00:32:55.880
Targeting your existing asynchronous tasks for extraction is a sensible place to begin. Making thoughtful extractions promotes scalability, robustness, and overall operational benefits. However, with every transition to services, a new set of challenges arise, including communication across the network, synchronized business logic, and resilience in the face of potential service failures.
00:34:02.360
Be explicit about failure cases in the design process, and remember that the system must bend rather than break. We must account for the inevitability of service failures in chaotic networks, improving our systems incrementally. Measuring performance has illuminated latencies that would otherwise go unnoticed, allowing us to build more resilient systems. Often, lessons learned through failures are irreplaceable drivers of development.
00:35:45.120
We've chosen smaller building blocks, such as Redis and Kafka, over traditional messaging brokers. This choice has led to greater flexibility, as we can leverage each tool's strengths directly without conforming to a more complex messaging broker's architecture.
00:36:34.360
In our experience, we've learned to focus on the interfaces between our systems—how they communicate and how messages should be organized. These shape the core of our service design. This will lead to reduced integration headaches, ensuring stable and predictable service-oriented systems.
00:37:18.960
Errors in ad hoc JSON serialization pose significant challenges; however, the development of small, distinct value objects can fortify the resilience of communication between services. Having shared Ruby libraries tested against both producers and consumers further enhances that stability.
00:37:55.840
Ultimately, there are many ways to accomplish these objectives. Our approach with services has adapted to our systems, but it's worth noting that results will vary, and each team must find their own suitable way forward when tackling similar challenges.
00:38:37.600
That's all I have! Thank you so much for your time and attention.