00:00:09.889
Welcome to the Kafka talk. Just for your information, this is a sponsored talk, so if you're looking for talks by other speakers, you might want to check those out.
00:00:15.000
Today is May 4th, so happy Star Wars Day! I didn't bring my Jedi robes, but I do own them, and they're quite awesome.
00:00:22.800
I'm Terence Lee, and if you haven't heard of me, I go by the Twitter handle Hoenn_02. Feel free to tweet at me if you have questions about the talk!
00:00:32.090
I'm known for my blue hat, which I've been wearing for over a decade. It's a bit worn, but I have blue hat stickers that look much more pristine, so if you'd like one, come talk to me!
00:00:45.600
Additionally, I've been helping with a Ruby karaoke movement. PJ and I have organized many events, and for the last three years, I've managed to attend a karaoke event at every Ruby conference.
00:00:56.480
Karaoke is a lot of fun, and if you're too intimidated to sing, you can just enjoy the atmosphere. Many people join karaoke as their first experience!
00:01:04.409
We'll be doing karaoke again tomorrow at 7 p.m. at Hotkey, so I hope to see you there! It will be a deafening experience. Watching Charlie Nutter sing Shakira is something you won't want to miss.
00:01:16.080
I work for Heroku, which is why this is a sponsored talk. I collaborate with Richard Schneemann on the Ruby experience, but let's move away from the introductions and talk about Kafka.
00:01:27.900
I assume most of you are here because you have heard of Kafka. You may not entirely know what it is, but there's definitely some buzz about it, and you might be wondering what you can do with this technology.
00:01:38.189
In this talk, we'll go through a short introduction to Kafka, covering topics and vocabulary related to what Kafka is.
00:01:44.130
We'll discuss how you can use Kafka with Ruby, and I built a demo app specifically for this talk that we'll go through in detail. My coworker and friend Tom will also cover other use cases of Kafka beyond the demo.
00:01:56.200
So, what is Kafka? If you visit the Apache Kafka documentation, the first paragraph states that Kafka is a distributed, partitioned, replicated commit log service.
00:02:01.439
It provides the functionality of a messaging system, but with a unique design. When I first read this as I was learning about Kafka, it left me quite confused, as I didn't fully understand what it meant.
00:02:14.890
At a high level, Kafka is essentially a pub/sub messaging system. If you're familiar with Redis and have used its pub/sub features or AMQP, you can think of Kafka in that context.
00:02:27.370
Kafka is distributed, which we will see is an important aspect as we progress through this talk. It's designed with guarantees of being fast, scalable, and durable.
00:02:32.740
Regarding performance, Kafka can process hundreds of thousands to millions of messages per second on a small cluster.
00:02:40.420
In contrast, if you think about AMQP, you’re likely only talking about tens of thousands of messages on a single node.
00:02:49.550
It's a significant difference in performance when we consider Kafka as a pub/sub message bus. For instance, I don’t know if you've used Skylight, which is a performance monitoring product from the Tilda folks.
00:03:03.240
As you're building that type of application, scalability is crucial, and Kafka is central to their architecture. One of their core tenets is durability.
00:03:10.480
It's common to think that you need to restart processes due to crashes, but Kafka allows you to maintain a commit log.
00:03:14.890
This means that if your services go down, you can replay from the last committed state and not lose any data — an important aspect when relying on a service for metrics.
00:03:22.010
I’ve talked with the Skylight team, and they've been very satisfied with Kafka's performance in their application architecture.
00:03:27.380
Now that we've discussed that, let's step back and go over some vocabulary you will encounter while dealing with Kafka. Previously, we talked about producers and consumers.
00:03:36.560
If you’re familiar with pub/sub systems, you might know that producers are the part of your Rails app that generates the work to be done, while consumers are the workers that process the jobs.
00:03:52.580
In the Kafka context, producers generate messages, which are written to a Kafka topic, and consumers process those messages.
00:03:59.690
Inside a Kafka cluster, you have multiple nodes known as brokers. The core unit of data in Kafka is called a message, which is really just a byte array.
00:04:07.030
This means it can represent any type of data. You might use strings, JSON, or any other structured format as the message value.
00:04:12.370
You can think of topics in Kafka as streams of messages that you can subscribe to. Topics organize messages related to specific data.
00:04:20.240
Additionally, partitions allow high throughput by letting you distribute the load across multiple brokers — a necessary feature when dealing with millions of messages.
00:04:30.320
Each partition guarantees order and immutability of the messages.
00:04:38.030
Once written, you can only add to the end of a partition; you cannot alter or reorder these messages.
00:04:44.240
Kafka allows you to uniquely identify a message inside its partition by an offset number, which indicates its position within the partition.
00:04:52.230
You can seek any message by its offset, which is handy for consumers wanting to replay specific segments of the stream.
00:05:01.850
You can optionally key messages, which is important because keying ensures that messages with the same key go to the same partition.
00:05:09.560
For example, if you were to key messages by user ID, all messages related to that user would reside in the same partition.
00:05:15.500
When we talk about consumers, we also refer to consumer groups. Consumer groups allow multiple consumers to work together on processing messages from the same topic.
00:05:27.380
Kafka guarantees that each message will be delivered at least once to a consumer group that is actively processing messages.
00:05:35.780
If messages are produced while certain consumers are down, Kafka will reassign those messages to active consumers. So you have a flexible and fault-tolerant system.
00:05:49.230
In a typical Kafka cluster, you may have multiple partitions and various consumer groups handling the same streams of data. This means you can efficiently split workload dynamically.
00:05:59.520
Similarly, there's a heartbeat system in Kafka that helps detect when consumers are not available, allowing Kafka to redistribute partitions among the available consumers.
00:06:09.500
We've just skimmed the vocabulary of Kafka, and the concepts will become clearer as we dive into the demo. There are a couple of main Ruby libraries for interacting with Kafka.
00:06:19.740
The first one is JRuby Kafka, which has been around the longest. It serves as a wrapper around official Kafka libraries for the Java API.
00:06:27.990
If you're using JRuby Kafka, you'll need to get comfortable reading Java API documentation, which can be a bit of a learning curve.
00:06:39.460
However, if you're using a standard Ruby application, there’s the Ruby-Kafka library developed by ZenDisguise.
00:06:46.350
It's a newer project but features a very clean interface, which I find quite impressive. Older libraries like Poseidon don't support consumer groups or the newer Kafka APIs.
00:06:57.990
For example, creating a simple producer in Kafka is quite similar to setting up Redis. However, you need to specify seed brokers.
00:07:03.370
You don't need to know all the brokers in your cluster—just the seeds, and Kafka will figure out the rest of the nodes.
00:07:15.670
Once you've set up the producer, you can produce messages asynchronously, giving you the ability to handle them efficiently.
00:07:22.820
If handling failures during delivery, you'll receive exceptions you can catch and manage.
00:07:29.480
This is an important point to consider when you're trying to ensure messages are delivered at least once, which is crucial during Rails requests.
00:07:38.030
Kafka also supports async producers that work with safe threading mechanisms, distributing messages through a queue.
00:07:44.480
You can set parameters to control how frequently messages are delivered, either by count or by time interval.
00:07:53.470
As we've discussed, messages can consist of various formats, including JSON. You can convert hashes to JSON for processing.
00:08:01.290
In your Rails application, you set up an initializer in the config directory to configure Kafka.
00:08:07.690
Pass in the Rails logger for proper log distribution and instantiate an async producer.
00:08:13.150
Finally, when you shut down the Rails server, make sure to shut down Kafka as well to avoid resource leakage.
00:08:19.950
Once that's done, your controller can easily create an event stream of all orders, inserting this data efficiently into Kafka.
00:08:31.220
Now that we know how to produce messages to Kafka, let's discuss how to consume the data.
00:08:39.900
The consumer API is vital, especially the consumer group API, which allows your application to manage message consumption effectively.
00:08:46.829
You initiate a connection with Kafka in a similar fashion to producers, passing in the seed brokers. Specify the group ID for your consumer group.
00:08:58.110
Subscribe to the relevant topics, like a greetings topic, to start processing messages, which can be done with a simple iteration in Ruby.
00:09:13.000
This subscribe method blocks the thread, continuously pulling messages until instructed otherwise.
00:09:22.770
Kafka 0.9 introduced SSL support to encrypt messages over transport, which is supported natively by the Ruby Kafka client.
00:09:31.810
To demonstrate how to use Kafka for processing metrics from web traffic, imagine monitoring a Heroku app.
00:09:42.310
Heroku apps can consume logs via HTTP, allowing you to create a log drain that forwards data into Kafka.
00:09:54.490
On the opposite end, another Ruby app can serve as a consumer that processes and stores this data efficiently.
00:10:01.600
In this example, we are primarily focusing on how to extract and handle that log data through our consumer.
00:10:09.790
For instance, we use Heroku logs to pull router logs, which reveal essential information like path, connection time, service time, and status code.
00:10:18.990
This information helps in understanding application performance and metrics.
00:10:26.400
By setting up a log drain, we can forward these logs to another app and format requests to capture relevant data.
00:10:39.470
Processing these log messages is essential, and you can leverage the syslog format for consistent parsing.
00:10:48.230
Using a Ruby gem for processing syslogs enables us to efficiently handle incoming data into Kafka.
00:10:55.140
In building our Sinatra app, we respond to incoming requests with a confirmation that we received the log data.
00:11:02.660
Within this app, we need to maintain Kafka connections efficiently. Since Kafka connections are not thread-safe, we should create a connection pool.
00:11:10.020
Then, we can produce log messages to Kafka, ensuring efficient handling and batching based on thresholds.
00:11:19.890
This allows us to minimize resource usage while ensuring the delivery of log data into Kafka.
00:11:27.670
Switching gears to Heroku's Kafka offering, we recently launched a public beta. It's designed to simplify access to the Heroku Kafka cluster.
00:11:39.830
Accessing Kafka on Heroku is just like any other add-on; you can create a Kafka cluster easily.
00:11:48.250
Once your cluster is set up, you'll receive a URL for your Kafka cluster to facilitate easy communication.
00:11:56.280
In the CLI, however, you’ll need to install a Kafka plugin to perform operations like creating topics.
00:12:07.250
Creating topics in a production environment is essential; for instance, you’ll want to create a router topic for your application.
00:12:14.740
You can use the info command to inspect your Kafka cluster's health and traffic details.
00:12:25.170
Additionally, the log tail feature allows you to monitor traffic to and from your topic in real time.
00:12:35.270
Now that we can produce and monitor messages, we can look into how to consume data effectively.
00:12:42.780
This will include setting up a metrics consumer that listens on topics and processes received messages.
00:12:50.220
By setting an appropriate consumer group name in your Kafka configuration, you can streamline data consumption.
00:12:57.820
You can handle offsets to manage where your processing starts in relation to the message stream.
00:13:05.880
Utilizing Redis alongside Kafka allows you to store and analyze data for your application effectively.
00:13:12.930
By iterating over messages received from Kafka, you can insert valuable metrics into your Redis store.
00:13:20.270
This approach aids in calculating live averages, status codes, and various performance metrics over time.
00:13:29.700
For instance, in real-time monitoring, you can assess how many 200 or 500 responses your application generates.
00:13:39.700
These calculations become especially useful when analyzing application performance across time windows.
00:13:47.690
You may also consider replaying consumer messages from production to your staging environment, allowing you to test how your application performs under real-world loads.
00:13:56.250
This insight can prepare you better for scaling your application and handling high traffic.
00:14:04.480
If you're using Heroku, scaling your consumer process isn’t cumbersome. Simply adjust the number of dynos to send more consumers to the Kafka cluster.
00:14:12.850
Consumer groups make this possible, as they automatically distribute messages among the available consumers.
00:14:20.330
The demo code demonstrating these principles is available on GitHub, ready for you to explore.
00:14:29.110
You will see everything, setup with Docker, so it’s easier to run all components locally.
00:14:35.390
JRuby is used within this setup, particularly for replay functionalities. If you have any questions, feel free to visit our booth.
00:14:46.390
At this point, I’d like to pass the presentation to Tom Crayford, who will discuss how we use Kafka at Heroku and share alternative use cases.
00:14:54.040
One of the most significant applications for Kafka is messaging, especially in finance. Financial firms leverage Kafka for messaging around stock trades, quotes, and more.
00:15:02.880
Stock market data needs durability, low latency, and high throughput, all of which Kafka provides effectively.
00:15:12.290
Moreover, Kafka is prominent in activity tracking, initially utilized by LinkedIn for tracking user behaviors.
00:15:19.350
Data from user interactions is vital for numerous downstream teams such as analytics and recruiting, consolidating all data in a single feed.
00:15:27.490
Shopify also uses Kafka internally, demonstrating its versatility across industries by tracking user behaviors and ecommerce actions.
00:15:36.170
Additionally, Heroku's internal metrics dashboard is powered by Kafka, processing massive volumes of data daily.
00:15:46.670
Applications constantly generate HTTP requests and the metrics platform has to handle numerous messages, efficiently calculating performance metrics.
00:15:54.750
Kafka's ability to accept and batch process messages plays a crucial role in maintaining performance in processing spikes.
00:16:02.450
Heroku's metrics system updates in real time based on live data feed, showcasing the utility of Kafka in application performance.
00:16:10.200
The internal event bus at Heroku is also built on Kafka technology. It is used for tracking changes and significant events throughout the API.
00:16:16.890
Teams across the organization consume this event bus, which ensures they remain informed about important application changes.
00:16:24.050
This architecture not only facilitates communication but optimizes the processes across many teams.
00:16:31.760
Some 13 internal teams utilize this Kafka-driven event bus architecture, showcasing its broad applicability.
00:16:39.510
Hopefully, after this talk, you find Kafka captivating enough to consider its potential in your application architecture.
00:16:46.540
Rails is in a strong position to adopt Kafka as it scales for more critical infrastructure, while still offering necessary durability.
00:16:55.970
Kafka's frameworks will help manage customers' data traffic effectively, ensuring a seamless user experience.
00:17:02.990
I encourage you to explore Kafka for your application needs.
00:17:09.780
I want to shout out to Joe for his help with the Java components of our demo, and to Godfrey for the lovely logo at the talk's front.
00:17:18.560
We will have community office hours during the happy hour at our booth, featuring Rails contributors and Heroku representatives.
00:17:30.000
If you have any questions about this talk, feel free to reach out. We’d love to discuss more about Kafka and its usage with Rails.
00:17:38.000
Thank you for your attention!