So You’ve Got Yourself a Kafka: Event-Powered Rails Services

00:00:10 Hello, everybody! We're going to get started. I hope that you're here to listen to me talk about Kafka because that's the topic for today.

00:00:19 My name is Stella Cotton, and I’m an engineer at Heroku. As I mentioned, I will be discussing Kafka today.

00:00:27 Many of you might have heard that Heroku offers Kafka as a service. We have various hosted plans, ranging from small to large capacities.

00:00:40 There's a dedicated engineering team working on making Kafka run efficiently on Heroku at super high capacities. However, I am not part of that team.

00:00:54 I have a different background and I'm not an expert in running or tuning Kafka clusters for high loads.

00:01:05 Like many of you, I am a regular Rails engineer. When I joined Heroku, I wasn't familiar with Kafka at all.

00:01:17 It seemed like a mysterious technology that suddenly became ubiquitous, with numerous hosted Kafka solutions appearing, not just on Heroku but also from other providers.

00:01:30 The importance of Kafka became apparent to me upon joining Heroku, where it plays an integral role in our product architecture.

00:01:41 Today, I’d like to cover three main areas, which I hope will help other Rails engineers become more familiar with Kafka.

00:01:55 Firstly, we will explore what Kafka is, followed by how it can enhance your services, and finally, I’ll touch on some practical considerations and challenges that I faced when starting to use event-driven systems.

00:02:07 So, what exactly is Kafka? According to the documentation available at Apache.org, Kafka is described as a distributed streaming platform.

00:02:14 However, this definition may not resonate with web developers. One of the classic use cases for Kafka is data flow management.

00:02:25 For instance, if you're running an e-commerce website, you’ll want insights into user behavior, such as tracking each page visited and each button clicked.

00:02:38 In scenarios with a high user volume, this can generate substantial data. To transmit this data from web applications to a data store utilized by your analytics team, you need an efficient solution.

00:02:50 One effective way to manage this high volume of data is to utilize Kafka, which is powered by the concept of an append-only log.

00:03:03 When we talk about logs, web developers typically think of application logs. These logs record notable events in chronological order, with each record appended to the previous one.

00:03:14 Once a record is written to the log, it remains indefinitely, only being truncated when necessary. Kafka operates similarly as an append-only log.

00:03:25 In Kafka, you have producers, which are applications that generate log events. You can have one producer or multiple producers contributing to the same log.

00:03:40 Unlike application logs, where entries are primarily intended for human consumption, in Kafka, the consumers are applications that read these events.

00:03:55 You can have a single consumer or many, but a critical distinction with Kafka is that the events remain available for all consumers until the specified retention period expires.

00:04:10 Let’s return to our e-commerce app example. Each time a user takes action on the platform, we create an event. We write these events to what Kafka refers to as a user event log, also known as a log topic.

00:04:30 Multiple services can write events to this user event topic. Each Kafka record that is created consists of a key, a value, and a timestamp.

00:04:45 Kafka does not perform any validation on the data; it merely transmits the binary data irrespective of its format. While JSON is a common format, Kafka supports various data structures.

00:04:59 Communication between clients writing to Kafka and those reading from it occurs via a persistent TCP socket connection.

00:05:10 This persistent connection avoids the overhead of a TCP handshake for every single event, which enhances performance.

00:05:21 Now, how do our Rails apps interact with the Kafka cluster? There are a few Ruby libraries available for this purpose. Ruby Kafka is a lower-level library designed for producing and consuming events.

00:05:39 It offers significant flexibility, but it entails more configuration.

00:05:45 For a simpler interface with less boilerplate, options like Delivery Boy, Racecar, and Phobos are available. These are maintained by Zendesk and serve as wrappers around Ruby Kafka.

00:05:57 There is also a different standalone library called Kafka, which similarly has a wrapper named Waterdrop, built upon the same implementation used by Delivery Boy.

00:06:12 My team utilizes a custom gem that predates Ruby Kafka. Hence, I haven’t employed these libraries in production, but Heroku recommends the use of the Kafka gem in our Dev Center documentation.

00:06:27 In this discussion, I’ll provide a brief overview of how to use Delivery Boy.

00:06:40 To publish events to our Kafka topic, we start by installing the gem and running a generator that creates a config file, familiar to those who have worked with databases.

00:06:50 The goal of Delivery Boy is to facilitate swift setup without requiring extensive configuration, except for specifying the brokers' address and port.

00:06:59 Once that configuration is done, we can write an event to our user event topic outside the thread of our web execution.

00:07:08 With this, we can specify the topic name, such as 'user event.'

00:07:15 Each of these topics can consist of one or more partitions. In general, it’s advisable to have at least two partitions to allow for scaling as more events are written.

00:07:31 Multiple services can write to the same partitions, and it is the producer's responsibility to determine which partition will receive a specific event.

00:07:47 Delivery Boy can help distribute events across those partitions randomly or based on a specific key.

00:08:01 Using a partition key ensures that specific types of events are always directed to the same partition, enabling Kafka to maintain order within that partition.

00:08:12 For instance, to keep all user events related to a specific user in the same partition, you would pass the user ID as the key.

00:08:25 Kafka uses a hashing function to map this key to an integer, ensuring that events consistently arrive in the same order as long as the total count of partitions remains unchanged.

00:08:40 However, challenges arise when you increase the number of partitions. Now that we are writing events to our user topic, what else needs to be done?

00:08:57 If you are operating a Kafka cluster, you can also utilize another gem called Racecar. This gem is another wrapper around Ruby Kafka and has minimal upfront configuration.

00:09:09 The configuration file created by Racecar looks quite similar to Delivery Boy's, but it will also generate a folder of consumers within your application.

00:09:20 An event consumer subscribes to the user event topic, using the subscribe method, and prints out all the data returned.

00:09:38 This consumer code will operate in its own process, separate from the web application execution.

00:09:44 Racecar creates a group of consumers, which are a collection of one or more Kafka consumers.

00:09:52 Each consumer in the group reads from the user event topic, with each assigned its own partitions. They maintain their position in those partitions using an offset, similar to a digital bookmark.

00:10:07 The significant advantage here is that if one consumer fails, the topics get reassigned to the other consumers, ensuring high availability as long as one consumer is running.

00:10:16 Having discussed the underlying structure of Kafka, let's delve into the technical features that underscore its value. Kafka is renowned for its ability to manage extremely high throughput.

00:10:38 One notable characteristic is that Kafka doesn't rely on a message broker to track consumers. Unlike traditional enterprise queuing systems like AMQP, Kafka's event infrastructure directly manages consumer tracking.

00:10:57 As the number of consumers grows, this decreases the load on the infrastructure, which can be a performance boost.

00:11:06 Another point of complication in traditional systems is acknowledging message processing, as it can be difficult for the broker to confirm messages received by consumers.

00:11:20 With Kafka, the responsibility of tracking the position lies with the consumers themselves. This results in constant time, O(1), performance when reading and writing data.

00:11:32 Consumers have the flexibility to read from a specified offset, start at the beginning, or read all events from the end. The performance does not degrade with the volume of data.

00:11:45 Kafka boasts horizontal scalability, allowing it to run as a cluster across multiple servers. Its architecture is both reliable and scalable, with data written to disk and replicated across brokers.

00:12:02 For additional details on data distribution and replication, I recommend checking out a comprehensive blog post on the subject.

00:12:10 For real-world context, companies like Netflix, LinkedIn, and Microsoft send over a trillion messages per day through their Kafka clusters.

00:12:23 While Kafka effectively handles data transfer, we're primarily focused on the context of services, and there are aspects that make Kafka beneficial in service-oriented architectures.

00:12:34 Kafka can serve as a fault-tolerant replacement for RPC, or remote procedure calls. In simple terms, RPC involves one service communicating with another using an API call.

00:12:49 Consider the example of an e-commerce site; when a user places an order, the order service's API is triggered, creating an order record, charging the credit card, and sending out a confirmation.

00:13:04 In a monolithic architecture, this typically involves blocking execution during the API call with potential reliance on background jobs such as Sidekiq.

00:13:17 As systems scale in complexity, you might extract functionalities into different services, using RPC for communication. However, challenges arise when these systems become more intricate.

00:13:33 In RPC systems, the upstream service must account for the availability of downstream services. For instance, if the email service fails, the upstream service must either retry the API calls or handle the failures gracefully.

00:13:47 In an event-oriented architecture, the upstream service (the Orders API) publishes an event through Kafka indicating that an order was created.

00:14:05 Thanks to Kafka's 'at least once' delivery guarantee, this event remains available for downstream consumers, allowing them to process it when they are back online.

00:14:19 This separation allows for greater independence in your architecture, albeit at the cost of clarity since the connection between services becomes less direct.

00:14:34 As you grow comfortable with Kafka and its trade-offs, you'll find opportunities to incorporate events into your service architecture.

00:14:49 Martin Fowler emphasizes that event-driven applications can vary significantly in their architecture, which can lead to confusion during discussions about challenges and trade-offs.

00:15:05 He outlines several architectural patterns frequently associated with event-driven systems, which I will summarize quickly, but further information can be found on his website.

00:15:21 The first pattern is known as event notification, the simplest form of event-driven architecture, where one service merely alerts downstream services that an event occurred.

00:15:37 Events contain minimal information, effectively triggering the downstream service to make a separate network call for additional context.

00:15:50 The second pattern is called event-carried state transfer, where the upstream service enriches the event with additional data, allowing the downstream consumer to maintain its own local state.

00:16:02 This works well as long as the downstream service requires no additional context that isn't supplied by the upstream service. If more data is needed, a network request might be necessary.

00:16:16 The third pattern is event-sourced architecture, which involves storing every event and mechanisms to replay them to recreate the application's state without traditional databases.

00:16:29 Fowler highlights the complexity of using audit logs or high-performance trading systems where an application state might demand that every interaction is captured.

00:16:40 There are additional challenges that arise from relying on event sourcing, particularly when dealing with changes in business logic that affect historical calculations.

00:16:54 Fowler's final architectural pattern addresses the principle of command-query responsibility segregation (CQRS), which is often discussed in the context of event-driven architectures.

00:17:10 CQRS involves separating write and read functionality, where services handling data modifications are distinct from those querying it, which can enhance performance in systems with significant read-write disparities.

00:17:22 While this segregation adds complexity, it can optimize performance effectively, especially when there’s a significant difference between reads and writes.

00:17:36 I encourage you to explore Fowler's work further for a deep understanding of the various trade-offs and considerations associated with these architectural patterns.

00:17:52 Having discussed what Kafka is and how it can transform service communication, let’s now explore practical considerations that can arise when integrating events into Rails applications.

00:18:01 This list isn't exhaustive, but I'll focus on two key components that surprised me when I began using Kafka in my projects.

00:18:18 The first consideration is slow consumers. In an event-driven system, it’s crucial for your service to keep pace with the upstream services producing events.

00:18:31 If consumers lag behind, the system can become slow, leading to a cascading effect where timeouts occur and overall system latency worsens.

00:18:47 One potential issue is with socket connections to the Kafka brokers; if events aren't processed promptly and the round trip isn’t completed, your connection may time out.

00:19:03 In turn, re-establishing connections can add further delays to an already sluggish system.

00:19:19 To speed up the consumer, one effective approach is to increase the number of consumers within your consumer group. This enables parallel processing of events.

00:19:33 In Racecar, you can initiate a consumer process by specifying a class name, such as UserEventConsumer. By running multiple instances of this class, you create separate consumers that join the same consumer group.

00:19:49 It's important to have at least two consumers in place, ensuring that if one goes down, another can take over processing.

00:20:07 Moreover, scaling is contingent on your topic partitions; therefore, the number of consumers should correlate with the number of available partitions.

00:20:15 However, it's important to note that you cannot infinitely scale consumers. Shared resources like databases inevitably present constraints.

00:20:30 Metrics and alerts are vital. Tracking how far behind your consumers are is essential for maintaining optimal system performance.

00:20:50 Ruby Kafka is instrumented with Active Support notifications, and it also supports StatsD and Datadog, providing useful reporting on consumer performance.

00:21:03 Another strategy I employ is to run failure scenarios on our staging environment before launching new services. This involves temporarily moving our offset back to simulate downtime.

00:21:20 This exercise helps gauge how quickly we can catch up after an outage, which is crucial for ensuring service reliability during disruptions.

00:21:36 Furthermore, when designing a consumer system, understanding the implications of ‘exactly once’ versus ‘at least once’ processing is vital.

00:21:53 With newer versions of Kafka, it is possible to design systems that guarantee exactly once processing; however, with the current setup in Ruby Kafka, you should assume at least once delivery.

00:22:05 This duality means that consumers must be designed to handle duplicate events gracefully. Implementing idempotency in your database can help manage this.

00:22:20 You can include a unique identifier in each event to prevent processing duplicates, leading to a more resilient application architecture.

00:22:30 The second significant surprise for me was Kafka's permissive approach to data types. You can send any byte format, and Kafka won't perform any verification.

00:22:42 This flexibility can be a double-edged sword; if an upstream service changes the event it produces, it could break downstream consumers unexpectedly.

00:23:03 It’s important to choose a consistent data format before implementing events in your architecture. It helps reduce potential issues in the future.

00:23:15 Schema registries can aid in managing your data formats, providing a centralized overview and facilitating the evolution of schemas over time.

00:23:31 While JSON is widely recognized and human-readable, it can result in larger payloads due to the requirement of sending keys alongside values.

00:23:48 Conversely, Avro, while less human-readable, offers benefits such as robust schema support and the ability to evolve data formats in a backward-compatible manner.

00:24:02 Avro not only supports primitive and complex types but also includes embedded documentation to clarify the purpose of each field, which is advantageous for maintaining clarity.

00:24:12 If you're interested in Avro, there’s an excellent blog post from the Salsify engineering team that provides a deep dive into its usage.

00:24:22 That's it for my talk today! However, I have a couple more slides to discuss, so please bear with me.

00:24:35 If you're eager to learn more about how Heroku manages its hosted Kafka products or its internal usage, we have two talks given by my colleagues Jeff and Paolo.

00:24:48 Additionally, my coworker Gabe will be giving a talk on PostgreSQL-related performance regarding Kafka in the next session, so be sure to check it out.

00:25:03 I’ll be at the Heroku booth after this talk and during lunch. Feel free to come by to ask me questions, grab some swag, and check out available job openings.