Talks
I Can't Believe It's Not A Queue: Using Kafka with Rails

I Can't Believe It's Not A Queue: Using Kafka with Rails

by Terence Lee

In the talk "I Can't Believe It's Not A Queue: Using Kafka with Rails" given by Terence Lee at RailsConf 2016, the focus is on integrating Apache Kafka into Ruby on Rails applications, particularly highlighting its advantages over traditional message queues. The foundational concepts of Kafka are discussed, explaining its role as a distributed, partitioned, and replicated commit log service designed for high performance and scalability.

Key points discussed in the video include:

  • The Basics of Kafka: Kafka operates on a pub/sub messaging model, making it a suitable substitute for traditional messaging systems like AMQP or Redis. It handles massive amounts of messages efficiently, boasting capabilities of processing hundreds of thousands to millions of messages per second.

  • Architecture Insights: Kafka is designed for durability, with features like commit logs that allow services to replay data after failures. This resilience is crucial for maintaining data integrity in applications that gather metrics or logs.

  • Kafka Vocabulary and Components: Terms like producers, consumers, and topics are defined. Producers send messages to topics, while consumers read from them, with partitions allowing for load distribution across brokers. Keying messages by identifiers ensures related data stays together, enhancing processing efficiency.

  • Integration with Rails: The presentation outlines the setup of Kafka in a Rails environment, including producer and consumer configurations. Terence demonstrates how to create a simple Rails application that streams order events to Kafka and consumes logs generated from a Heroku application for analysis and metrics extraction.

  • Practical Examples: Notably, the talk references how firms like Skylight and Heroku utilize Kafka for application performance monitoring and event tracking. The versatility of Kafka is further illustrated through its applications in financial messaging and user activity tracking.

  • Heroku Kafka: The talk touches on Heroku’s Kafka offering, detailing how it can be integrated with existing applications easily. It includes instructions on how to create topics and monitor cluster health through the CLI.

In conclusion, Terence Lee emphasizes Kafka's potential as a critical infrastructure tool for Rails applications aiming to scale. Its durability, fault-tolerance, and ability to handle high throughput make it an asset for developers looking to improve their messaging systems. The session encourages the audience to consider Kafka for their applications, highlighting its practical benefits and real-world applications in various industries like finance and e-commerce.

Overall, the talk serves as a comprehensive introduction to using Kafka alongside Rails, equipped with actionable insights and examples to facilitate its integration into real-world applications.

00:00:09.889 Welcome to the Kafka talk. Just for your information, this is a sponsored talk, so if you're looking for talks by other speakers, you might want to check those out.
00:00:15.000 Today is May 4th, so happy Star Wars Day! I didn't bring my Jedi robes, but I do own them, and they're quite awesome.
00:00:22.800 I'm Terence Lee, and if you haven't heard of me, I go by the Twitter handle Hoenn_02. Feel free to tweet at me if you have questions about the talk!
00:00:32.090 I'm known for my blue hat, which I've been wearing for over a decade. It's a bit worn, but I have blue hat stickers that look much more pristine, so if you'd like one, come talk to me!
00:00:45.600 Additionally, I've been helping with a Ruby karaoke movement. PJ and I have organized many events, and for the last three years, I've managed to attend a karaoke event at every Ruby conference.
00:00:56.480 Karaoke is a lot of fun, and if you're too intimidated to sing, you can just enjoy the atmosphere. Many people join karaoke as their first experience!
00:01:04.409 We'll be doing karaoke again tomorrow at 7 p.m. at Hotkey, so I hope to see you there! It will be a deafening experience. Watching Charlie Nutter sing Shakira is something you won't want to miss.
00:01:16.080 I work for Heroku, which is why this is a sponsored talk. I collaborate with Richard Schneemann on the Ruby experience, but let's move away from the introductions and talk about Kafka.
00:01:27.900 I assume most of you are here because you have heard of Kafka. You may not entirely know what it is, but there's definitely some buzz about it, and you might be wondering what you can do with this technology.
00:01:38.189 In this talk, we'll go through a short introduction to Kafka, covering topics and vocabulary related to what Kafka is.
00:01:44.130 We'll discuss how you can use Kafka with Ruby, and I built a demo app specifically for this talk that we'll go through in detail. My coworker and friend Tom will also cover other use cases of Kafka beyond the demo.
00:01:56.200 So, what is Kafka? If you visit the Apache Kafka documentation, the first paragraph states that Kafka is a distributed, partitioned, replicated commit log service.
00:02:01.439 It provides the functionality of a messaging system, but with a unique design. When I first read this as I was learning about Kafka, it left me quite confused, as I didn't fully understand what it meant.
00:02:14.890 At a high level, Kafka is essentially a pub/sub messaging system. If you're familiar with Redis and have used its pub/sub features or AMQP, you can think of Kafka in that context.
00:02:27.370 Kafka is distributed, which we will see is an important aspect as we progress through this talk. It's designed with guarantees of being fast, scalable, and durable.
00:02:32.740 Regarding performance, Kafka can process hundreds of thousands to millions of messages per second on a small cluster.
00:02:40.420 In contrast, if you think about AMQP, you’re likely only talking about tens of thousands of messages on a single node.
00:02:49.550 It's a significant difference in performance when we consider Kafka as a pub/sub message bus. For instance, I don’t know if you've used Skylight, which is a performance monitoring product from the Tilda folks.
00:03:03.240 As you're building that type of application, scalability is crucial, and Kafka is central to their architecture. One of their core tenets is durability.
00:03:10.480 It's common to think that you need to restart processes due to crashes, but Kafka allows you to maintain a commit log.
00:03:14.890 This means that if your services go down, you can replay from the last committed state and not lose any data — an important aspect when relying on a service for metrics.
00:03:22.010 I’ve talked with the Skylight team, and they've been very satisfied with Kafka's performance in their application architecture.
00:03:27.380 Now that we've discussed that, let's step back and go over some vocabulary you will encounter while dealing with Kafka. Previously, we talked about producers and consumers.
00:03:36.560 If you’re familiar with pub/sub systems, you might know that producers are the part of your Rails app that generates the work to be done, while consumers are the workers that process the jobs.
00:03:52.580 In the Kafka context, producers generate messages, which are written to a Kafka topic, and consumers process those messages.
00:03:59.690 Inside a Kafka cluster, you have multiple nodes known as brokers. The core unit of data in Kafka is called a message, which is really just a byte array.
00:04:07.030 This means it can represent any type of data. You might use strings, JSON, or any other structured format as the message value.
00:04:12.370 You can think of topics in Kafka as streams of messages that you can subscribe to. Topics organize messages related to specific data.
00:04:20.240 Additionally, partitions allow high throughput by letting you distribute the load across multiple brokers — a necessary feature when dealing with millions of messages.
00:04:30.320 Each partition guarantees order and immutability of the messages.
00:04:38.030 Once written, you can only add to the end of a partition; you cannot alter or reorder these messages.
00:04:44.240 Kafka allows you to uniquely identify a message inside its partition by an offset number, which indicates its position within the partition.
00:04:52.230 You can seek any message by its offset, which is handy for consumers wanting to replay specific segments of the stream.
00:05:01.850 You can optionally key messages, which is important because keying ensures that messages with the same key go to the same partition.
00:05:09.560 For example, if you were to key messages by user ID, all messages related to that user would reside in the same partition.
00:05:15.500 When we talk about consumers, we also refer to consumer groups. Consumer groups allow multiple consumers to work together on processing messages from the same topic.
00:05:27.380 Kafka guarantees that each message will be delivered at least once to a consumer group that is actively processing messages.
00:05:35.780 If messages are produced while certain consumers are down, Kafka will reassign those messages to active consumers. So you have a flexible and fault-tolerant system.
00:05:49.230 In a typical Kafka cluster, you may have multiple partitions and various consumer groups handling the same streams of data. This means you can efficiently split workload dynamically.
00:05:59.520 Similarly, there's a heartbeat system in Kafka that helps detect when consumers are not available, allowing Kafka to redistribute partitions among the available consumers.
00:06:09.500 We've just skimmed the vocabulary of Kafka, and the concepts will become clearer as we dive into the demo. There are a couple of main Ruby libraries for interacting with Kafka.
00:06:19.740 The first one is JRuby Kafka, which has been around the longest. It serves as a wrapper around official Kafka libraries for the Java API.
00:06:27.990 If you're using JRuby Kafka, you'll need to get comfortable reading Java API documentation, which can be a bit of a learning curve.
00:06:39.460 However, if you're using a standard Ruby application, there’s the Ruby-Kafka library developed by ZenDisguise.
00:06:46.350 It's a newer project but features a very clean interface, which I find quite impressive. Older libraries like Poseidon don't support consumer groups or the newer Kafka APIs.
00:06:57.990 For example, creating a simple producer in Kafka is quite similar to setting up Redis. However, you need to specify seed brokers.
00:07:03.370 You don't need to know all the brokers in your cluster—just the seeds, and Kafka will figure out the rest of the nodes.
00:07:15.670 Once you've set up the producer, you can produce messages asynchronously, giving you the ability to handle them efficiently.
00:07:22.820 If handling failures during delivery, you'll receive exceptions you can catch and manage.
00:07:29.480 This is an important point to consider when you're trying to ensure messages are delivered at least once, which is crucial during Rails requests.
00:07:38.030 Kafka also supports async producers that work with safe threading mechanisms, distributing messages through a queue.
00:07:44.480 You can set parameters to control how frequently messages are delivered, either by count or by time interval.
00:07:53.470 As we've discussed, messages can consist of various formats, including JSON. You can convert hashes to JSON for processing.
00:08:01.290 In your Rails application, you set up an initializer in the config directory to configure Kafka.
00:08:07.690 Pass in the Rails logger for proper log distribution and instantiate an async producer.
00:08:13.150 Finally, when you shut down the Rails server, make sure to shut down Kafka as well to avoid resource leakage.
00:08:19.950 Once that's done, your controller can easily create an event stream of all orders, inserting this data efficiently into Kafka.
00:08:31.220 Now that we know how to produce messages to Kafka, let's discuss how to consume the data.
00:08:39.900 The consumer API is vital, especially the consumer group API, which allows your application to manage message consumption effectively.
00:08:46.829 You initiate a connection with Kafka in a similar fashion to producers, passing in the seed brokers. Specify the group ID for your consumer group.
00:08:58.110 Subscribe to the relevant topics, like a greetings topic, to start processing messages, which can be done with a simple iteration in Ruby.
00:09:13.000 This subscribe method blocks the thread, continuously pulling messages until instructed otherwise.
00:09:22.770 Kafka 0.9 introduced SSL support to encrypt messages over transport, which is supported natively by the Ruby Kafka client.
00:09:31.810 To demonstrate how to use Kafka for processing metrics from web traffic, imagine monitoring a Heroku app.
00:09:42.310 Heroku apps can consume logs via HTTP, allowing you to create a log drain that forwards data into Kafka.
00:09:54.490 On the opposite end, another Ruby app can serve as a consumer that processes and stores this data efficiently.
00:10:01.600 In this example, we are primarily focusing on how to extract and handle that log data through our consumer.
00:10:09.790 For instance, we use Heroku logs to pull router logs, which reveal essential information like path, connection time, service time, and status code.
00:10:18.990 This information helps in understanding application performance and metrics.
00:10:26.400 By setting up a log drain, we can forward these logs to another app and format requests to capture relevant data.
00:10:39.470 Processing these log messages is essential, and you can leverage the syslog format for consistent parsing.
00:10:48.230 Using a Ruby gem for processing syslogs enables us to efficiently handle incoming data into Kafka.
00:10:55.140 In building our Sinatra app, we respond to incoming requests with a confirmation that we received the log data.
00:11:02.660 Within this app, we need to maintain Kafka connections efficiently. Since Kafka connections are not thread-safe, we should create a connection pool.
00:11:10.020 Then, we can produce log messages to Kafka, ensuring efficient handling and batching based on thresholds.
00:11:19.890 This allows us to minimize resource usage while ensuring the delivery of log data into Kafka.
00:11:27.670 Switching gears to Heroku's Kafka offering, we recently launched a public beta. It's designed to simplify access to the Heroku Kafka cluster.
00:11:39.830 Accessing Kafka on Heroku is just like any other add-on; you can create a Kafka cluster easily.
00:11:48.250 Once your cluster is set up, you'll receive a URL for your Kafka cluster to facilitate easy communication.
00:11:56.280 In the CLI, however, you’ll need to install a Kafka plugin to perform operations like creating topics.
00:12:07.250 Creating topics in a production environment is essential; for instance, you’ll want to create a router topic for your application.
00:12:14.740 You can use the info command to inspect your Kafka cluster's health and traffic details.
00:12:25.170 Additionally, the log tail feature allows you to monitor traffic to and from your topic in real time.
00:12:35.270 Now that we can produce and monitor messages, we can look into how to consume data effectively.
00:12:42.780 This will include setting up a metrics consumer that listens on topics and processes received messages.
00:12:50.220 By setting an appropriate consumer group name in your Kafka configuration, you can streamline data consumption.
00:12:57.820 You can handle offsets to manage where your processing starts in relation to the message stream.
00:13:05.880 Utilizing Redis alongside Kafka allows you to store and analyze data for your application effectively.
00:13:12.930 By iterating over messages received from Kafka, you can insert valuable metrics into your Redis store.
00:13:20.270 This approach aids in calculating live averages, status codes, and various performance metrics over time.
00:13:29.700 For instance, in real-time monitoring, you can assess how many 200 or 500 responses your application generates.
00:13:39.700 These calculations become especially useful when analyzing application performance across time windows.
00:13:47.690 You may also consider replaying consumer messages from production to your staging environment, allowing you to test how your application performs under real-world loads.
00:13:56.250 This insight can prepare you better for scaling your application and handling high traffic.
00:14:04.480 If you're using Heroku, scaling your consumer process isn’t cumbersome. Simply adjust the number of dynos to send more consumers to the Kafka cluster.
00:14:12.850 Consumer groups make this possible, as they automatically distribute messages among the available consumers.
00:14:20.330 The demo code demonstrating these principles is available on GitHub, ready for you to explore.
00:14:29.110 You will see everything, setup with Docker, so it’s easier to run all components locally.
00:14:35.390 JRuby is used within this setup, particularly for replay functionalities. If you have any questions, feel free to visit our booth.
00:14:46.390 At this point, I’d like to pass the presentation to Tom Crayford, who will discuss how we use Kafka at Heroku and share alternative use cases.
00:14:54.040 One of the most significant applications for Kafka is messaging, especially in finance. Financial firms leverage Kafka for messaging around stock trades, quotes, and more.
00:15:02.880 Stock market data needs durability, low latency, and high throughput, all of which Kafka provides effectively.
00:15:12.290 Moreover, Kafka is prominent in activity tracking, initially utilized by LinkedIn for tracking user behaviors.
00:15:19.350 Data from user interactions is vital for numerous downstream teams such as analytics and recruiting, consolidating all data in a single feed.
00:15:27.490 Shopify also uses Kafka internally, demonstrating its versatility across industries by tracking user behaviors and ecommerce actions.
00:15:36.170 Additionally, Heroku's internal metrics dashboard is powered by Kafka, processing massive volumes of data daily.
00:15:46.670 Applications constantly generate HTTP requests and the metrics platform has to handle numerous messages, efficiently calculating performance metrics.
00:15:54.750 Kafka's ability to accept and batch process messages plays a crucial role in maintaining performance in processing spikes.
00:16:02.450 Heroku's metrics system updates in real time based on live data feed, showcasing the utility of Kafka in application performance.
00:16:10.200 The internal event bus at Heroku is also built on Kafka technology. It is used for tracking changes and significant events throughout the API.
00:16:16.890 Teams across the organization consume this event bus, which ensures they remain informed about important application changes.
00:16:24.050 This architecture not only facilitates communication but optimizes the processes across many teams.
00:16:31.760 Some 13 internal teams utilize this Kafka-driven event bus architecture, showcasing its broad applicability.
00:16:39.510 Hopefully, after this talk, you find Kafka captivating enough to consider its potential in your application architecture.
00:16:46.540 Rails is in a strong position to adopt Kafka as it scales for more critical infrastructure, while still offering necessary durability.
00:16:55.970 Kafka's frameworks will help manage customers' data traffic effectively, ensuring a seamless user experience.
00:17:02.990 I encourage you to explore Kafka for your application needs.
00:17:09.780 I want to shout out to Joe for his help with the Java components of our demo, and to Godfrey for the lovely logo at the talk's front.
00:17:18.560 We will have community office hours during the happy hour at our booth, featuring Rails contributors and Heroku representatives.
00:17:30.000 If you have any questions about this talk, feel free to reach out. We’d love to discuss more about Kafka and its usage with Rails.
00:17:38.000 Thank you for your attention!