Building a real-time analytics engine in JRuby

This video was recorded on http://wrocloverb.com. You should follow us at https://twitter.com/wrocloverb. See you next year!

When most people hear big data real time analytics, they think huge enterprise systems and tools like Java, Oracle and so on. Thats not the case at Burt. We're building a real time ad analytics engine in Ruby, and we're handling tens of thousands of requests per second so far. Thanks to JRuby we can have the best of two worlds, the ease of developing with Ruby, and the robustness of the Java echosystem. In this talk you'll discover how we built a massive crunching pipeline using only virtual servers on AWS and free tools like JRuby, MongoDB, RabbitMQ, Redis, Cassandra and Nginx.

wroc_love.rb 2013

00:00:18.240 Please welcome our speaker, David Dahl.

00:00:28.800 Thank you! Can you hear me? Good. The guy in the top row can hear me.

00:00:35.120 So then, everyone can hear me, hopefully. All right, I'm going to talk a bit about JRuby—I know it's a controversial topic.

00:00:41.200 To start off, I am a senior developer at a company called Burt in Gothenburg, Sweden. We do analytics for online advertising.

00:00:48.399 We primarily use Ruby, and most specifically, JRuby. We manage all of this on AWS.

00:00:55.120 That's our case, and that's the story I'm going to share today. Just a brief overview of what we do: we take something like a poorly designed website.

00:01:06.960 This particular site is filled with ads, which highlights our goal: to make online advertising a viable option that people don’t hate and that actually works.

00:01:13.840 We want to help publishers make money so they can produce good content.

00:01:20.479 To achieve this, we first ensure that publishers understand what they are selling.

00:01:26.560 We track ads and create insightful graphs. It should be simple, right?

00:01:32.640 When we got started, we did what most people do: we logged everything into a database and created reports.

00:01:38.079 However, this approach broke down during our first major campaign—no surprise there.

00:01:44.159 We realized we needed to pre-calculate everything, which involved a lot of calculations.

00:01:49.759 Initially, we had everything in one major application, which made deployment extremely daunting.

00:01:55.680 At that point, we were also still using MRI Ruby, which has its own bottlenecks.

00:02:03.920 We realized that we needed to change our approach.

00:02:10.080 We conceptualized an idea that looked somewhat like this: imagine each dot here represents one application or service.

00:02:17.920 Every service performs a specific task, and we send data along a pipeline, refining it through each service.

00:02:24.960 Eventually, this process results in useful metrics that we can display in our web applications.

00:02:31.120 One critical lesson in building a distributed system is to always start with at least two of everything.

00:02:36.560 This helps solve the hard part of the problem.

00:02:40.480 Going from one to two services is challenging, but scaling up from two to three or four is much easier.

00:02:46.239 As we started implementing this idea, we got stuck.

00:02:51.920 Still using MRI, we planned to separate and buffer operations with RabbitMQ, which uses EventMachine.

00:02:58.720 We intended to handle storage with MongoDB, but these blocking operations led to issues.

00:03:04.640 Ruby's threading limitations meant we couldn't effectively manage concurrent processes.

00:03:12.000 After a lot of deliberation, we considered switching to Java.

00:03:17.040 Although enterprise solutions often have a bad reputation in our community, they have their merits.

00:03:25.600 Enterprise tools are proven and come with many libraries, especially for high-performance applications.

00:03:31.680 We soon discovered JRuby, which allows us to utilize Java’s capabilities while writing in Ruby.

00:03:41.680 With JRuby, we gain threading, a real garbage collector, and just-in-time compilation.

00:03:46.720 We also have access to a vast array of Java, Scala, and Ruby libraries.

00:03:52.799 As we delved into wrapping Java libraries, we found it quite enjoyable.

00:03:58.159 This experience made coding much more enjoyable.

00:04:03.599 So, we proceeded with this direction, but we faced various challenges.

00:04:09.760 One major challenge was ensuring that we never experienced downtime.

00:04:14.960 We track advertisements, and our processes must remain invisible to visitors.

00:04:20.720 If something fails, users typically reload the page, which should not happen with our service.

00:04:27.680 We need to continuously receive data. While we can afford latency, we cannot afford to stop receiving data.

00:04:35.280 As long as our web server acknowledges receipt with a '200 OK,' we can buffer data and process it later.

00:04:41.440 Given that we operate a major application across over a hundred AWS instances, we must avoid unexpected errors.

00:04:48.640 It's acceptable for our application to crash occasionally, provided we can restart it and fix any issues in a timely manner.

00:04:54.880 However, data can build rapidly in buffers in a short time, so we need to act quickly.

00:05:01.680 Buffering became a critical part of our approach.

00:05:06.560 We came to really appreciate RabbitMQ, as it was the best architectural choice we made.

00:05:13.199 This queuing system facilitated dividing our operations into isolated services.

00:05:17.760 We had to carefully consider the interfaces between services and how we would represent data externally.

00:05:24.399 Furthermore, when adding a buffering layer, it’s important to consider how transactions will work.

00:05:31.519 This involves determining whether to acknowledge messages before or after persisting them.

00:05:39.040 If you receive a message and persist data, if an error occurs while persisting, you must decide how to handle it.

00:05:46.399 You can either get the message one extra time or lose some data, depending on your acknowledgment strategy.

00:05:52.880 Making operations idempotent means that even if you receive duplicates, the operation remains consistent.

00:05:58.560 This topic or item potency deserves a separate discussion.

00:06:03.760 Regarding databases, we learned the hard way how crucial it is to select the right tools for each job.

00:06:10.560 We started using MongoDB everywhere. For those who have heard us discuss it, you know our mixed feelings.

00:06:17.200 While MongoDB can be highly effective in the right context, it can also be extremely frustrating in the wrong context.

00:06:22.640 We spent weeks troubleshooting when it didn’t fit our needs.

00:06:29.360 We also incorporated Cassandra into our database family. It's an impressive yet complex system.

00:06:36.240 Cassandra is very Java-centric, which can be a pro or a con, depending on your perspective.

00:06:42.880 Other databases we recommend include Redis for certain applications.

00:06:48.080 An alternative approach is to pursue a no-database model, keeping everything in-stream without storing it.

00:06:54.000 Only keep data in memory until processed.

00:07:00.000 Now let’s talk about code and Java. When building distributed systems, Java is a key platform.

00:07:05.760 Java's concurrency library provides everything you need. For those looking for a quick guide, I recommend a specific book.

00:07:13.760 Executors in Java allow for easy threading and work nicely within Ruby.

00:07:20.640 You can submit a block and specify a thread size based on your machine’s capabilities.

00:07:28.560 However, be aware that issues can arise, as we experienced with the JRuby interpreter.

00:07:35.760 Occasionally, it misinterprets classes, leading to unexpected behavior.

00:07:42.000 Despite these hiccups, most time things should work as expected.

00:07:48.480 Blocking queues are another great feature in Java's concurrent library.

00:07:54.720 They allow for producer-consumer patterns while considering back pressure, which I'll discuss next.

00:08:01.440 For example, say you have an application that processes data from a queue and persists it to a database.

00:08:09.600 What happens if your persistence thread slows down? If you create an unbounded queue, you could run out of memory.

00:08:15.840 This leads to a crash when the queue fills up.

00:08:20.640 A better approach involves setting a size limit on the queue, ensuring that the input thread pauses until there is room.

00:08:27.280 We've learned this lesson painfully—backpressure can't be ignored.

00:08:34.080 When threading, consider atomic variables and concurrent data structures, as shared mutable state is tricky.

00:08:40.800 Java provides mechanisms like count down latches and semaphores for distributed locks.

00:08:47.600 The Google Guava library is also an excellent resource, addressing gaps in Java’s concurrency offerings.

00:08:55.280 It includes various utilities like caches, multi-maps, and more.

00:09:01.440 LMAX Disruptor is a high-performance inter-thread messaging library, built for real-time applications.

00:09:08.960 It’s an innovative solution to explore for high-performance JRuby applications.

00:09:14.240 In summary, managing multithreading in JRuby requires an understanding that thread safety is complex.

00:09:21.440 Utilizing Java’s utilities can make life easier; avoid shared mutable state.

00:09:28.000 Consider techniques like back pressure in your designs.

00:09:34.880 Let’s discuss actors. How many of you have heard of actors?

00:09:43.040 Akka is a famous concurrency library in Scala, widely recognized for its actor-based model.

00:09:50.560 We've created a small Ruby wrapper called Mika for easier use.

00:09:56.080 The library simplifies implementation, though it’s not actively maintained.

00:10:03.200 Storm is another ontology-based system that handles processing for you. How many of you have heard of Storm?

00:10:10.960 Storm manages the distribution of tasks effectively. It was acquired by Twitter.

00:10:16.960 Regarding RabbitMQ and Storm, I personally favor RabbitMQ.

00:10:23.200 We utilize a library called Redstorm, which abstracts many difficult aspects of using Storm.

00:10:30.960 However, Storm didn’t work for us due to our high branching factor.

00:10:37.920 This means that each item fed into the system generated hundreds of operations.

00:10:43.920 We needed to wait for all these operations to be persisted before acknowledging them.

00:10:49.920 The overhead from this process was inefficient for our needs.

00:10:56.800 On the topic of big data, how many of you are familiar with Hadoop?

00:11:02.720 My colleague, Theo, developed a library called RubyDupe, which provides a nice abstraction for writing Hadoop jobs.

00:11:09.600 You can implement mappers and reducers by defining map or reduce methods.

00:11:15.040 Then you set up a job to specify input and output and run it.

00:11:21.360 However, managing Hadoop's infrastructure and versions poses various challenges.

00:11:27.680 If you're interested in Hadoop, research its staggering versioning issues.

00:11:33.840 And just like that, I’ve finished faster than I expected.

00:11:39.680 Now I have time for some questions.

00:11:46.080 We have also developed small Ruby wrappers for various Java libraries.

00:11:53.760 Hot Bunnies is a wrapper for RabbitMQ, now maintained by the same person who oversees the AMQP gem.

00:12:02.240 Druidice is a wrapper for Cassandra. My colleagues creatively opted for German names.

00:12:10.560 Pillops is another library that wraps 0MQ, and Multimeter is for Yammer Matrix.

00:12:16.880 That concludes my presentation.

00:12:24.720 Are there any questions?

00:12:51.440 Yes, the question is whether JRuby is faster or slower than Java for running Hadoop.

00:12:56.720 Overall, JRuby on the JVM is indeed slower than Java.

00:13:02.159 However, the difference isn’t significant, and implementing solutions faster can be more valuable.

00:13:09.120 We find it more critical to scale horizontally rather than maximize performance.

00:13:14.400 For instance, using AWS’s Elastic MapReduce, you pay by the hour.

00:13:21.000 While performance is relevant, it isn’t a significant enough issue for us.

00:13:28.000 Are there any additional questions?

00:13:49.200 When using RabbitMQ as a buffer, do we maintain persistence?

00:13:55.360 Yes, we persist everything to ensure reliability. Otherwise, we would risk data loss if the system fails.

00:14:02.160 RabbitMQ has improved significantly, addressing prior back pressure issues.

00:14:09.080 In earlier versions, if pushed too fast, you would encounter performance problems.

00:14:15.600 Is RabbitMQ used for inter-process communication?

00:14:24.720 Yes, we utilize it as an alternative to HTTP for communication between nodes.

00:14:30.880 Thank you!

00:14:36.880 Thank you all!