Build the Unified Logging Layer with Fluentd and Ruby

by Kiyoto Tamura

In the video titled "Build the Unified Logging Layer with Fluentd and Ruby," Kiyoto Tamura discusses the challenges of log collection in modern, service-oriented software architectures and introduces Fluentd as a solution. Fluentd is a Ruby-based data collector designed to unify log collection from diverse sources to multiple output systems, aiming to replace unreliable and brittle logging scripts with a more stable and extensible alternative. The talk covers Fluentd's architecture, real-world use cases, and its future developments.

Key Points Discussed:
- Introduction to Fluentd:
- Fluentd is primarily a data collection tool aimed at simplifying log management.
- It provides an extensible architecture that allows users to customize data parsing, buffering, and output destinations.

Reliability of Log Transfer:
- Fluentd treats log data as 'streams' rather than static files, allowing for more reliable data transfer and handling failures gracefully.
Data Tagging and Routing:
- Fluentd employs data tagging to route logs based on specific tags, facilitating efficient data management and storage.
Real-World Applications:
- Simple Log Forwarding: A configuration example shows how Fluentd can correlate user behavior from an app to logs and streamline data transfer to a backend (e.g., MongoDB).
- Server Management: Discusses case studies where Fluentd effectively manages log data across thousands of servers, improving resource management and data analytics.
- Lambda Architecture: Explains how Fluentd supports architectures that allow for both batch processing and real-time analytics, essential for modern data engineering.
- Docker Integration: Mentioned the rise of Docker and its challenges with log aggregation, illustrating how Fluentd can be used with Elasticsearch and Kibana for effective log management in containerized environments.
Fluentd Architecture:
- Detailed the basic architecture structure, covering input, buffering, parsing, and output modules, emphasizing the flexibility inherent in these components.
- Discussed various input and output plugins that enhance Fluentd's functionality and adaptability.
Future Developments:
- Highlighted upcoming features, including a dedicated filter plugin and other API updates expected in future versions, focusing on improving user experience and functionality.
- Introduced a new lightweight agent in Go, aimed at providing a simpler, more efficient way to manage logs across systems.

Conclusions:
Kiyoto concluded by emphasizing the growing use of Fluentd in production environments, inviting contributions and feedback from the audience, and highlighting the importance of community involvement in the tool's development.

00:00:18.260 All right, I see familiar faces, so this is what it must feel like performing in high school theater with parents around. I've been giving a lot of talks this year, and interestingly, I didn't give any talks until this year.

00:00:24.419 My most recent talk was at GoRico, where I made two huge mistakes during my live demo. It was worse than having it totally not work because I was almost there. The other mistake was that my presentation crashed towards the end.

00:00:36.330 I actually have a fix for both issues, so this time there is no live demo. You guys don't have to worry or get nervous.

00:00:41.250 I'm certainly less nervous this time. The other issue I had was using the wrong operating system for my presentation. I was using a Mac with Microsoft PowerPoint, which apparently isn't a good combination. This time I'm using Keynote, but I'm on Yosemite, so that might be an issue.

00:01:07.830 A quick show of hands: how many of you have heard of Fluentd? Yeah, because they just gave away stickers, so some of you may have just got some. If you want stickers, there are more here. How many of you use Fluentd in development or production? Excellent! There are two brave people who admitted using it.

00:01:14.009 I'm going to give a quick overview of what Fluentd is today, along with some use cases, and hopefully discuss its internal architecture as well as where we're headed in the next year or so. So, who am I? I am Kiyoto Tamura, and that's how you can harass me publicly. I work for a company called Treasure Data as a Developer Relations person. This means I get to come to cool events like this and listen to really smart people discuss open-source.

00:01:52.380 I also happen to be a maintainer of Fluentd. Staying true to my Japanese heritage, I'm going to start with a bit of self-deprecation, although it's just true. First of all, I didn’t know that I would be giving a talk on Lambda until this morning. Lots of talks have been about static typing, which I like as an idea, but I think it would be a little too hard for me to program that way. So we'll see how that goes.

00:02:29.850 The other thing is that I'm a Fluentd newbie, as you can see from the contributor graph where I'm the sixth person down the line. The question that naturally arises at this point is: why is the sixth person giving a talk instead of the other five who clearly did more work? I asked them, and they all made excuses. The guy next to me is my boss, the CEO, because he’s busy, unlike me.

00:03:10.000 I come from San Francisco, a very expensive city with a lot of rain, and it's really nice to be here in San Diego. This is my first time here, as well as my first RubyConf. Who here is attending for the first time? All right, I feel less lonely now!

00:03:29.310 I didn't mean to say you should clap literally, but that works too! I have to talk for the next 45 minutes, so you can just clap along if you want. Now, what is Fluentd? This description is probably the most straightforward: Fluentd is primarily a data collection tool.

00:03:48.450 If you have logs or any kind of event data, think of Fluentd. When I refer to event collection or data collection, usually you have a system that involves various scripts — some were written by people who are no longer with you, while others may be brittle and unreliable. This isn’t really anyone's fault; logging is typically not a priority. Eventually, as you ship new versions, it becomes quite painful. Fluentd tries to solve this problem by unifying log collection. Of course, achieving this is highly optimistic; it can take months to get there, but it does help.

00:04:47.629 The second point is that Fluentd is extensible. There is a guy named Satoru Hoshi who wrote the first version of Fluentd, and the core philosophy is that the core program should be reasonably small and handle only the most important functions—functions that are often overlooked but very important like error handling, message routing, and ensuring you utilize all your CPUs.

00:05:06.229 The idea is to delegate everything else to the users for specific use cases, such as reading data from particular data sources, parsing complex custom formats, and buffering data as needed. Users have control over where to write that data, because frankly, logging software can be pretty useless if it does not have a destination in mind. Finally, data formatting is also a core concern. Fluentd is about acknowledging common cases while delegating use-specific tasks to plugins.

00:06:00.499 It also strives to be reliable in two senses: reliable data transfer, and reliability in terms of how you process your data. Transfers can fail, and this is often particularly painful when processing daily batches. If a load fails one evening due to an encoding issue, you might find yourself needing to handle multiple days' worth of records subsequently, especially if the upstream data format also changes.

00:06:21.740 Fluentd handles this situation by thinking of log data as 'streams' instead of a collection of files. Data movement is done in smaller, more manageable bits and pieces, allowing you to better handle transfer failures. I will discuss this further when I go through the buffer plugin a bit later.

00:06:38.129 The second aspect of reliability is about the overall process and the way data engineering is structured. This image represents one version of an architecture diagram for Fluentd, but which one is more reliable? It’s not obvious. However, in one of these structures, all data sources flow through Fluentd, which is responsible for routing the data.

00:06:56.490 Fluentd uses the concept of data tagging, which allows data to be routed based on its associated tags. Later on, I’ll show a configuration file that demonstrates this. The pseudoscientific explanation I like to use is fairly straightforward: if you have m data sources that need to communicate with n storage backends, you shouldn't create a multiplicative scenario (m times n) but rather an additive one (m plus n). Establishing this leads to better and more efficient data pipelines.

00:07:17.400 So that's a super quick overview of what Fluentd is. Now, let's talk about real-world applications: how this tool is utilized in practical environments.

00:07:34.920 The simplest use case involves simple log forwarding. You may have some log files, along with a new mobile app, and you wish to track user behavior and correlate that to your logs. A sample Fluentd configuration file actually listens to the log file located at /var/log/HDB.log. It also listens to a TCP port and forwards that data to the backend—in this case, MongoDB. MongoDB is one of the most popular plugins out there, among others like S3 and Elasticsearch.

00:08:10.800 This is a pretty simple setup where you define the input source in your config file, and it forwards the data based on specified tags. If the tags match what you want, the data is flushed to that output.

00:08:41.760 The second use case looks similar and is especially common for companies that deploy thousands of servers. The reason for this is as your data volume increases, it's essential to separate the ingestion of raw data from the parsing and final sending of that data to the backend. Both tasks tend to consume a lot of CPU resources.

00:09:00.000 One user I know runs 2,500 servers using Fluentd with one aggregator for approximately 100 servers sending logs. They manage their servers in such a way that they can send data to multiple destinations.

00:09:19.770 My favorite use case is what I call a Lambda architecture, which I find really cool. Let me ask, how many of you have heard of the term Lambda architecture? Great! The idea behind it is to use multiple storage systems for both batch analytics and real-time computation.

00:09:48.790 The term was coined by Nathan Mars, who wrote a story about a real-time computation engine. For examples, you can think of ElasticSearch and Hadoop which provide different advantages for storing data. The architecture allows you to keep track of something in real-time and store the raw data elsewhere for detailed analysis later on.

00:10:05.600 Fluentd can serve as a frontend for both systems, bifurcating the data stream so it can be written to both backend systems effectively. On the left side of this architecture diagram, everything is the same as mentioned before, while the right side shows the use of a copy plugin which copies the data stream simultaneously.

00:10:21.060 Next, how many of you know what CP stands for? It's okay if you don’t know—it stands for Complex Event Processing. This refers to a field of real-time computation where you apply a series of computations to a data stream.

00:10:40.800 Returning to the Apache example, you may want to examine the correlation between site response times and data from other sources. Several systems are available to handle this, including Nora, which an associate of mine over there will discuss further.

00:11:04.680 In addition, there are proprietary solutions to consider. Fluentd is flexible enough to enable communication with various backends, allowing you to construct systems tailored to your needs.

00:11:23.400 The last use case I want to mention is relatively recent. How many of you know about Docker? How many have used Docker? Excellent! I have used it extensively to debug issues in Fluentd.

00:11:40.370 One challenge with Docker containers is the lack of a standard solution for aggregating container logs. Over the past few months, there has been an emerging trend involving using Fluentd alongside Elasticsearch and Kibana to manage logs in Kubernetes, which is a container orchestration tool.

00:11:52.660 The next segment is about Fluentd's architecture, which is where my knowledge tends to get a bit shaky. I encourage you to ask questions if you feel like it. At the highest level, this is what Fluentd looks like: as data flows through the system, the input connects to accept inputs, after which it is buffered, and finally, the output handles where to send the data.

00:12:15.940 You can configure your own parser to parse incoming data. The output is responsible for sending the data to specific destinations based on assigned tags. A good mental model to have is that the first two components primarily deal with inputting data into Fluentd, while the other three handle the output to external systems.

00:12:31.310 Input plugins are numerous and include the usual suspects: UDP, TCP, HTTP, etc. Another common one is called Tail, which is essentially a smarter version of the tail command for reading logs.

00:12:46.400 When you receive logs, you assign tags to those logs, which help determine where the data should go. An important feature of input plugins is that they are non-blocking, using their event loop independent of Fluentd's own event loop.

00:13:00.640 One of my favorite input plugins is called Tail input. Initially, I thought tailing a log file would be simple, but there are complexities. The ‘new tail input’ refers to a total rewrite, which is quite detailed given that there are about 700 lines of logic to handle node management, ensuring that when logs are rotated, old files are correctly captured along with new ones.

00:13:32.740 Another input plugin is the TCP input, which is also fairly simple. The main function of this plugin is facilitated by its superclass in the background, which takes care of the logic needed to parse messages from TCP connections. The parser processes incoming messages and generates timestamps.

00:14:14.980 Fluentd supports a variety of configurable parsers, such as JSON, which is the common data format used throughout Fluentd. The TCP plugin allows you to specify a format parameter, which can be a regular expression to invoke specific parsers like Apache log format.

00:14:42.870 Many of you might know about Grok parser, which is a third-party plugin that works with Fluentd. Make sure you use it at your own risk, but it demonstrates the flexibility Fluentd has when it comes to parsing data.

00:15:03.590 Buffering is an essential concept because sending bits and bytes over the internet, even in the same data center, can come with a lot of variability. Therefore, it is crucial to have a configurable buffering solution to suit your needs.

00:15:21.770 You can buffer to disk if that suits your needs or choose to buffer in memory if performance is more critical. The configuration involves chunking data, where chunks are adjustable data units.

00:15:39.890 When deploying Fluentd on multiple machines, one of the early challenges often involves tuning the buffer parameters. The goal is to make it reasonable based on your network conditions and architecture.

00:15:57.200 Output plugins work closely with buffering; when an output plugin tries to send data, that is when the data buffer is engaged. For instance, when sending data to AWS S3, the system uses buffers to ensure data consistency even if the network goes down.

00:16:15.440 Some plugins are not buffered, especially those dealing with direct writing to external systems, and many have come about because earlier versions of Fluentd lacked dedicated filter functionality.

00:16:30.840 Fluentd's community has contributed around 300 plugins in total, with over 200 focused on output plugins. This means that if you think about using Fluentd and you have a specific data source or storage service in mind, chances are that it's already supported.

00:16:51.790 If not, it's easy to get started, as there is ample documentation and sample code available. Here is an example of a basic plugin bundled into Fluentd's core; it writes to a local file. When configured, it chunks data into a file handle and nothing more.

00:17:08.260 If you want to write to your new NoSQL database, for example, you would just need to implement a write client for that database and proceed from there.

00:17:29.150 Sometimes, you might want to format your data differently than what Fluentd offers by default. Previously, it was primarily the output plugin authors' responsibility to manage the output formatting, but we've realized that many users wanted outputs as New Line Delimited JSON.

00:17:46.900 In version 0.10.49 and above, you can now use formatter plugins for certain output plugins. At this moment, S3 and file outputs support custom formatting.

00:18:02.800 One interesting formatter is the single value formatter, which allows users to extract a specific field from the JSON data. There are also formatter options for CSV, TSV, and other custom formats you may wish to implement.

00:18:20.000 To sum up, that was a quick tour of Fluentd and its architecture. We are currently working on a dedicated filter plugin for the next version, which will make executing filtering tasks much easier.

00:18:39.000 These functionalities were previously possible, but they were complex and confusing to understand. In the upcoming major update, filtering will become much simpler. This will enable users to remove unwanted fields or get specific data based simply on hostname or other criteria before sending it to external systems.

00:19:06.300 By showing what an ideal roadmap looks like, version 10.1 will include the filtering feature reliably while also introducing some significant API updates for the upcoming version 0.14. The goal is to consolidate many of these changes to release the true version 1.

00:19:31.600 It's a bit amusing to talk about reaching version one, especially since many companies are already using Fluentd in production. We never expected Fluentd to gain this level of use three years back.

00:19:52.540 We're always looking for people to join our team, whether within the community or in a paid role. We highly appreciate contributions, bug reports, and pull requests for the project. Lastly, I want to briefly mention a new UI tool that makes administering and locally testing Fluentd much easier.

00:20:23.050 It's essential, especially for those unfamiliar with the command line or just starting with Fluentd. Treasure Data also offers packages for Fluentd across major platforms, along with thorough QA.

00:20:35.400 It's worth noting that the main Fluentd package just released a new version with embedded UI, so check that out. While I know this is a Ruby conference, I still have to highlight a new project I'm excited about.

00:20:52.760 It's a lightweight agent that we’ve started developing using Go, which compiles natively for both Linux and Windows. Although it's less feature-rich than Fluentd, we think it will run better on Windows, while remaining open source under the Fluent project.

00:21:13.640 Contributions, feedback, bug reports, and pull requests for the Go version are happily welcomed. And that’s pretty much it! It seems everyone is ready for a break!

00:21:29.430 If you want, I can give you a head start. Thanks a lot!