Talks

Building a real-time analytics engine in JRuby

Building a real-time analytics engine in JRuby

by David Dahl

The video titled "Building a real-time analytics engine in JRuby" features speaker David Dahl from Burt, a company based in Gothenburg, Sweden, that specializes in online advertising analytics. In this session at wroc_love.rb 2013, Dahl discusses how Burt successfully built a real-time analytics engine using JRuby, highlighting various strategies and tools employed to manage high-volume data processing effectively on AWS.

Key points from the presentation include:

  • Initial Challenges: Burt's journey began with traditional methods of logging data into databases, which failed to meet demands during major campaigns, leading to the realization that pre-calculation and a distributed system were necessary.

  • Transitioning to JRuby: After facing limitations with MRI Ruby, Dahl's team switched to JRuby to leverage Java's capabilities for better threading and performance, while still using Ruby’s developer-friendly syntax.

  • System Architecture: The system architecture is designed around a distributed model where each application or service processes specific tasks, allowing for real-time data tracking and metric generation.

  • Use of RabbitMQ: RabbitMQ emerged as a crucial tool in their architecture for handling inter-process communication and service isolation, significantly aiding in buffering operations and avoiding data loss.

  • Database Strategy: Dahl shares insights on selecting appropriate databases like MongoDB and Cassandra, emphasizing that while MongoDB is effective, it can also be problematic if misapplied.

  • Concurrency Management: The presentation underscores the importance of Java’s concurrency tools, such as blocking queues and atomic variables, to manage multithreading complexities in JRuby applications.

  • Performance Over Optimization: The focus shifted from maximizing performance to scalable solutions, allowing Burt to utilize AWS’s Elastic MapReduce while accepting the inherent trade-offs of using JRuby over Java.

  • Lessons Learned: Dahl highlights the significance of making operations idempotent and utilizing a well-defined acknowledgment strategy for data processing to ensure system reliability and data integrity.

The presentation concludes with a Q&A segment where Dahl addresses specific inquiries about the performance comparisons between JRuby and Java, as well as RabbitMQ's role in their architecture. Overall, the talk is a comprehensive overview of leveraging JRuby in building a robust real-time analytics engine, showcasing how combining Ruby's flexibility with Java’s performance can yield effective results in web-based data analytics.

00:00:18.240 Please welcome our speaker, David Dahl.
00:00:28.800 Thank you! Can you hear me? Good. The guy in the top row can hear me.
00:00:35.120 So then, everyone can hear me, hopefully. All right, I'm going to talk a bit about JRuby—I know it's a controversial topic.
00:00:41.200 To start off, I am a senior developer at a company called Burt in Gothenburg, Sweden. We do analytics for online advertising.
00:00:48.399 We primarily use Ruby, and most specifically, JRuby. We manage all of this on AWS.
00:00:55.120 That's our case, and that's the story I'm going to share today. Just a brief overview of what we do: we take something like a poorly designed website.
00:01:06.960 This particular site is filled with ads, which highlights our goal: to make online advertising a viable option that people don’t hate and that actually works.
00:01:13.840 We want to help publishers make money so they can produce good content.
00:01:20.479 To achieve this, we first ensure that publishers understand what they are selling.
00:01:26.560 We track ads and create insightful graphs. It should be simple, right?
00:01:32.640 When we got started, we did what most people do: we logged everything into a database and created reports.
00:01:38.079 However, this approach broke down during our first major campaign—no surprise there.
00:01:44.159 We realized we needed to pre-calculate everything, which involved a lot of calculations.
00:01:49.759 Initially, we had everything in one major application, which made deployment extremely daunting.
00:01:55.680 At that point, we were also still using MRI Ruby, which has its own bottlenecks.
00:02:03.920 We realized that we needed to change our approach.
00:02:10.080 We conceptualized an idea that looked somewhat like this: imagine each dot here represents one application or service.
00:02:17.920 Every service performs a specific task, and we send data along a pipeline, refining it through each service.
00:02:24.960 Eventually, this process results in useful metrics that we can display in our web applications.
00:02:31.120 One critical lesson in building a distributed system is to always start with at least two of everything.
00:02:36.560 This helps solve the hard part of the problem.
00:02:40.480 Going from one to two services is challenging, but scaling up from two to three or four is much easier.
00:02:46.239 As we started implementing this idea, we got stuck.
00:02:51.920 Still using MRI, we planned to separate and buffer operations with RabbitMQ, which uses EventMachine.
00:02:58.720 We intended to handle storage with MongoDB, but these blocking operations led to issues.
00:03:04.640 Ruby's threading limitations meant we couldn't effectively manage concurrent processes.
00:03:12.000 After a lot of deliberation, we considered switching to Java.
00:03:17.040 Although enterprise solutions often have a bad reputation in our community, they have their merits.
00:03:25.600 Enterprise tools are proven and come with many libraries, especially for high-performance applications.
00:03:31.680 We soon discovered JRuby, which allows us to utilize Java’s capabilities while writing in Ruby.
00:03:41.680 With JRuby, we gain threading, a real garbage collector, and just-in-time compilation.
00:03:46.720 We also have access to a vast array of Java, Scala, and Ruby libraries.
00:03:52.799 As we delved into wrapping Java libraries, we found it quite enjoyable.
00:03:58.159 This experience made coding much more enjoyable.
00:04:03.599 So, we proceeded with this direction, but we faced various challenges.
00:04:09.760 One major challenge was ensuring that we never experienced downtime.
00:04:14.960 We track advertisements, and our processes must remain invisible to visitors.
00:04:20.720 If something fails, users typically reload the page, which should not happen with our service.
00:04:27.680 We need to continuously receive data. While we can afford latency, we cannot afford to stop receiving data.
00:04:35.280 As long as our web server acknowledges receipt with a '200 OK,' we can buffer data and process it later.
00:04:41.440 Given that we operate a major application across over a hundred AWS instances, we must avoid unexpected errors.
00:04:48.640 It's acceptable for our application to crash occasionally, provided we can restart it and fix any issues in a timely manner.
00:04:54.880 However, data can build rapidly in buffers in a short time, so we need to act quickly.
00:05:01.680 Buffering became a critical part of our approach.
00:05:06.560 We came to really appreciate RabbitMQ, as it was the best architectural choice we made.
00:05:13.199 This queuing system facilitated dividing our operations into isolated services.
00:05:17.760 We had to carefully consider the interfaces between services and how we would represent data externally.
00:05:24.399 Furthermore, when adding a buffering layer, it’s important to consider how transactions will work.
00:05:31.519 This involves determining whether to acknowledge messages before or after persisting them.
00:05:39.040 If you receive a message and persist data, if an error occurs while persisting, you must decide how to handle it.
00:05:46.399 You can either get the message one extra time or lose some data, depending on your acknowledgment strategy.
00:05:52.880 Making operations idempotent means that even if you receive duplicates, the operation remains consistent.
00:05:58.560 This topic or item potency deserves a separate discussion.
00:06:03.760 Regarding databases, we learned the hard way how crucial it is to select the right tools for each job.
00:06:10.560 We started using MongoDB everywhere. For those who have heard us discuss it, you know our mixed feelings.
00:06:17.200 While MongoDB can be highly effective in the right context, it can also be extremely frustrating in the wrong context.
00:06:22.640 We spent weeks troubleshooting when it didn’t fit our needs.
00:06:29.360 We also incorporated Cassandra into our database family. It's an impressive yet complex system.
00:06:36.240 Cassandra is very Java-centric, which can be a pro or a con, depending on your perspective.
00:06:42.880 Other databases we recommend include Redis for certain applications.
00:06:48.080 An alternative approach is to pursue a no-database model, keeping everything in-stream without storing it.
00:06:54.000 Only keep data in memory until processed.
00:07:00.000 Now let’s talk about code and Java. When building distributed systems, Java is a key platform.
00:07:05.760 Java's concurrency library provides everything you need. For those looking for a quick guide, I recommend a specific book.
00:07:13.760 Executors in Java allow for easy threading and work nicely within Ruby.
00:07:20.640 You can submit a block and specify a thread size based on your machine’s capabilities.
00:07:28.560 However, be aware that issues can arise, as we experienced with the JRuby interpreter.
00:07:35.760 Occasionally, it misinterprets classes, leading to unexpected behavior.
00:07:42.000 Despite these hiccups, most time things should work as expected.
00:07:48.480 Blocking queues are another great feature in Java's concurrent library.
00:07:54.720 They allow for producer-consumer patterns while considering back pressure, which I'll discuss next.
00:08:01.440 For example, say you have an application that processes data from a queue and persists it to a database.
00:08:09.600 What happens if your persistence thread slows down? If you create an unbounded queue, you could run out of memory.
00:08:15.840 This leads to a crash when the queue fills up.
00:08:20.640 A better approach involves setting a size limit on the queue, ensuring that the input thread pauses until there is room.
00:08:27.280 We've learned this lesson painfully—backpressure can't be ignored.
00:08:34.080 When threading, consider atomic variables and concurrent data structures, as shared mutable state is tricky.
00:08:40.800 Java provides mechanisms like count down latches and semaphores for distributed locks.
00:08:47.600 The Google Guava library is also an excellent resource, addressing gaps in Java’s concurrency offerings.
00:08:55.280 It includes various utilities like caches, multi-maps, and more.
00:09:01.440 LMAX Disruptor is a high-performance inter-thread messaging library, built for real-time applications.
00:09:08.960 It’s an innovative solution to explore for high-performance JRuby applications.
00:09:14.240 In summary, managing multithreading in JRuby requires an understanding that thread safety is complex.
00:09:21.440 Utilizing Java’s utilities can make life easier; avoid shared mutable state.
00:09:28.000 Consider techniques like back pressure in your designs.
00:09:34.880 Let’s discuss actors. How many of you have heard of actors?
00:09:43.040 Akka is a famous concurrency library in Scala, widely recognized for its actor-based model.
00:09:50.560 We've created a small Ruby wrapper called Mika for easier use.
00:09:56.080 The library simplifies implementation, though it’s not actively maintained.
00:10:03.200 Storm is another ontology-based system that handles processing for you. How many of you have heard of Storm?
00:10:10.960 Storm manages the distribution of tasks effectively. It was acquired by Twitter.
00:10:16.960 Regarding RabbitMQ and Storm, I personally favor RabbitMQ.
00:10:23.200 We utilize a library called Redstorm, which abstracts many difficult aspects of using Storm.
00:10:30.960 However, Storm didn’t work for us due to our high branching factor.
00:10:37.920 This means that each item fed into the system generated hundreds of operations.
00:10:43.920 We needed to wait for all these operations to be persisted before acknowledging them.
00:10:49.920 The overhead from this process was inefficient for our needs.
00:10:56.800 On the topic of big data, how many of you are familiar with Hadoop?
00:11:02.720 My colleague, Theo, developed a library called RubyDupe, which provides a nice abstraction for writing Hadoop jobs.
00:11:09.600 You can implement mappers and reducers by defining map or reduce methods.
00:11:15.040 Then you set up a job to specify input and output and run it.
00:11:21.360 However, managing Hadoop's infrastructure and versions poses various challenges.
00:11:27.680 If you're interested in Hadoop, research its staggering versioning issues.
00:11:33.840 And just like that, I’ve finished faster than I expected.
00:11:39.680 Now I have time for some questions.
00:11:46.080 We have also developed small Ruby wrappers for various Java libraries.
00:11:53.760 Hot Bunnies is a wrapper for RabbitMQ, now maintained by the same person who oversees the AMQP gem.
00:12:02.240 Druidice is a wrapper for Cassandra. My colleagues creatively opted for German names.
00:12:10.560 Pillops is another library that wraps 0MQ, and Multimeter is for Yammer Matrix.
00:12:16.880 That concludes my presentation.
00:12:24.720 Are there any questions?
00:12:51.440 Yes, the question is whether JRuby is faster or slower than Java for running Hadoop.
00:12:56.720 Overall, JRuby on the JVM is indeed slower than Java.
00:13:02.159 However, the difference isn’t significant, and implementing solutions faster can be more valuable.
00:13:09.120 We find it more critical to scale horizontally rather than maximize performance.
00:13:14.400 For instance, using AWS’s Elastic MapReduce, you pay by the hour.
00:13:21.000 While performance is relevant, it isn’t a significant enough issue for us.
00:13:28.000 Are there any additional questions?
00:13:49.200 When using RabbitMQ as a buffer, do we maintain persistence?
00:13:55.360 Yes, we persist everything to ensure reliability. Otherwise, we would risk data loss if the system fails.
00:14:02.160 RabbitMQ has improved significantly, addressing prior back pressure issues.
00:14:09.080 In earlier versions, if pushed too fast, you would encounter performance problems.
00:14:15.600 Is RabbitMQ used for inter-process communication?
00:14:24.720 Yes, we utilize it as an alternative to HTTP for communication between nodes.
00:14:30.880 Thank you!
00:14:36.880 Thank you all!