Event Streaming Patterns for Ruby Services

00:00:07.200 Hi! How are you doing? Wasn't this a cool conference? I'm so happy to be here; it's my first trip to Thailand. Are you all having fun? Cool! Let's give a big thank you to our organizers. This is such a great event! By the way, the best conference food I've ever had was here. Seriously, the meal yesterday was amazing! I'm really looking forward to lunch.

00:00:20.359 My name is Brad Urani, and I work for Procore. If you need to contact me, you can find me on Twitter. I will not call it X; it's Twitter. It will always be Twitter to me, even though I don’t use it much anymore. Procore makes software for the construction industry. The world's largest construction projects run on Procore. If you need to build a skyscraper, a subway, or a stadium, you use our software. Here's an example of someone out in the field using one of our Procore drawing apps to build something important.

00:01:00.000 We are a Ruby on Rails success story. We had our IPO in 2021, which came after 18 years of developing our solution using Ruby on Rails. We started our Ruby on Rails app on version 0.9, which was around 18 or 19 years ago. Of course, we are not just using Ruby on Rails; we are doing a lot of interesting things like machine learning. We are also working on augmented reality for construction sites, where you can walk around a building that's under construction and detect defects. For example, you can take photos using a helmet-mounted camera. We're also integrating AI and utilizing drones that fly around buildings and take photos. I'm the principal architect for our streaming platform.

00:01:38.000 Streaming connects services and applications—it acts like the central nervous system for our solution. We have many services and applications that need to communicate with each other. We run Procore in multiple regions, including Europe, the USA, and Canada. These regions need to share data, and I work on creating that global type system, focusing on distributed systems. A fun fact about Procore: we might have the world's largest Ruby on Rails app, depending on how you count it. A year ago, we had 43,000 files and 1,123 tables.

00:02:25.640 I believe Shopify and we are neck and neck in lines of code, although we might have more models. We had 4,000 API endpoints, which is more than they have in one app, and there are about 300 people working on it between the front end and back end. As I mentioned, we don’t only use Ruby on Rails; we have many downstream technologies, including artificial intelligence. We have an entire platform just for search, which we are currently rebuilding to utilize chat GPT for natural language search. For reporting, we have enterprise software that needs to generate large spreadsheets and export PDFs.

00:03:06.720 This brings up interesting technical challenges as we work on this system. First, we have this huge Rails app, and we want to split this up and move to a service-oriented architecture, which is challenging. We need to handle multi-region support effectively, especially since we have data centers in Europe and North America. How do we share data between these regions, and how do we allow one region to access data from another region? A common challenge in larger Rails apps is that we originally built our search feature directly into Rails using Postgres search. While that worked fine when we were a startup, we need to evolve beyond that.

00:03:43.920 We’re also moving our reporting out. We face the question of how we effectively communicate with these new services and maintain end-to-end consistency. Let me define consistency since you'll hear me mention it often. Consistency means that when you have multiple services and multiple data stores, if you need to replicate data from one app to another, they must be 100% identical. Any change to one database must also appear in the other.

00:04:23.200 For example, if we have a new service and a user records service, we need to ensure that the users are correctly reflected in both services and databases. If we separate our search into ElasticSearch while retaining our primary Rails database, we cannot allow a record to enter our primary database without also appearing in Elasticsearch. If that happens, users will receive incorrect search results. We utilize Snowflake for reporting, and if something is in the main database, it must also be in Snowflake; otherwise, reports may be inaccurate, which is unacceptable. We need a solution that guarantees 100% consistency.

00:05:24.960 Now, let me address a quick question: What is the best method for making two Rails apps, or services, communicate? This appears straightforward, but it’s not so simple. One approach might be to have one app send POST requests to another, but this has several drawbacks. What if the downstream service is offline or being deployed, or if network issues arise and the upstream one cannot reach it? You can retry, but eventually, a failure is inevitable. In this case, inconsistency is introduced, which is bad practice. Using gRPC makes it faster and more efficient, but it carries the same problems.

00:06:04.639 What you really need is a system between the two services, commonly known as an event bus. Think of it as a bus transporting events, or a transaction log, or a stream, or pub/sub (it goes by many names). Typically, this is implemented using Apache Kafka, which is an open-source event bus system that works wonders for delivering messages reliably between different applications and services.

00:06:55.120 Kafka is not the only choice; AWS provides similar technology called Kinesis, and Google Cloud has Pub/Sub. Though I haven't used it, I've heard it's quite good for its intended purpose. However, if you’re looking for something versatile and full-featured, I recommend Apache Kafka due to its extensive tooling. In Kafka, we have a producer that publishes messages to topics, and a consumer that subscribes to those topics. Each topic acts like a queue for messages. For instance, you may have a 'user' topic where all the user records are stored.

00:07:56.000 This system provides significant benefits: Kafka offers at least once delivery, guaranteeing that a consumer will always receive a message as long as it’s set up correctly. It ensures that messages are stored durably, which I won’t delve into now due to time constraints. When you write messages to Kafka, the order in which you write them will always be the same order in which you read them, which is an essential property that many queuing solutions lack. When you publish a message to Kafka, you can have multiple subscribers, and you can keep adding more as needed, facilitating growth as your organization scales. With this, multiple teams may find the message useful, allowing them to build integrations off the data.

00:09:03.760 The replayability feature of Kafka is another significant plus. Messages remain available in Kafka even after being read with a configurable retention window—ours is set to one week—allowing you to easily fix issues if something goes wrong by rereading the necessary messages. This, combined with at least-once delivery and guaranteed ordering, results in a reliable messaging system. The message will always eventually reach its destination as long as you've configured everything properly. The durability of Kafka allows high availability, which has led us to never experience a significant outage with it throughout the years.

00:10:05.360 So returning to the question: how do you make one service communicate with another? The answer is to use Kafka for buffered communication, meaning that if your upstream app generates a flood of messages, the downstream application can process them at its own pace without losing any messages. That's why I advocate for it—you get a consistent, replayable, and democratized solution.

00:10:51.120 Now, I want to mention alternatives to Kafka, such as Sidekiq, which is often used for Redis. It's important to know that it doesn't guarantee the order of jobs, potentially leading to unwanted outcomes in your application. RabbitMQ and SQS also have their quirks, but they don't always provide the same guarantees, like order or retention you find with Kafka. Careful consideration is needed when selecting your queuing technology.

00:11:47.680 When it comes to integrating Kafka into your architecture, there are several options. For instance, we use Amazon's managed Kafka service (MSK) which eliminates the need for maintaining your own Kafka brokers. Heroku also offers a solid Kafka service, and we're transitioning to Confluent Cloud, which is run by the founder of Kafka. This service supplies not only Kafka but also additional tools and support that enhance its functionality.

00:12:19.120 For Ruby developers, there’s an excellent gem called Karafka, which acts as a wrapper around the essential Kafka functionalities. It’s fully featured and maintained by a talented developer in Poland, who deserves a shout-out for his efforts. Using Kafka for consistent data streaming simplifies your pipeline, but it's essential to ensure a robust design.

00:13:01.760 When producing messages to Kafka, you can choose between synchronous or asynchronous modes. The synchronous method is simpler—you write a message, and be sure to handle errors properly, providing appropriate feedback to the user. However, if you opt for the asynchronous method, messages are queued in a background thread, yielding quicker responses to users. While this provides benefits, the inconsistency remains a risk if a failure occurs post-response return.

00:13:51.960 Another challenge in this approach is guaranteeing absolute consistency across services and databases. If a Rails app processes an inbound request that involves writing to a database and sending a Kafka message, you must ensure that either both actions succeed or both fail. You'd face the dreaded dual-write problem if the Kafka message was delivered but the database transaction fails, leading to inconsistency.

00:14:25.520 To address this, some developers utilize a transactional outbox pattern. This involves creating an outbox table in your database where messages are stored before being sent to Kafka, ensuring they are always in sync. If you're using certain cloud databases, they may offer integrated solutions. For example, AWS’s Kinesis can automatically output changes into streams or Kafka.

00:15:26.800 As we wrap this up, I want to emphasize various approaches to ensuring data consistency. Methods like change data capture allow the separation of message formats from actual database structures by using mechanisms that handle updates and changelogs. This flexibility allows you to create cleaner APIs for your consumers while preventing database structure leaks.

00:16:32.120 In conclusion, streaming provides a globally distributed, eventually consistent, and near real-time way to manage messages. The reliability of such a system is crucial in complex applications involving multiple teams, and I believe it's an excellent way for managing intricate enterprise architecture.

00:17:16.960 Should you find value in setting up a Kafka solution for your architecture, remember to keep your team's messaging standard consistent. Define headers and serialization options that simplify message formats and improve communication across services.

00:18:00.680 Lastly, whether you are an established organization or a startup, consider what the future holds for log-driven development, whereby instead of writing directly to databases, applications send data to Kafka, which is subsequently processed by downstream consumers. This asynchronous architecture shifts practices and can lead to greater scalability.

00:19:08.040 I hope these insights regarding event streaming have been insightful and beneficial for your projects. Make sure to stay informed about ongoing improvements in the Kafka ecosystem. Thank you all for joining my talk today!

00:24:14.431 Thank you! Much appreciated!