Ruby Video | Minecart - A story of Ruby at a growing company

00:00:30.519 Hello everyone, I'm Matt Wilson, and I'm an engineer at Square. So, real quick, how many people have heard of someone deploying to JRuby in production?

00:00:36.320 Alright, that's smaller than I thought—about 45% of the audience. How many people actually deploy to JRuby in production?

00:00:43.200 Okay, that's like three percent of the audience. So, I love JRuby. JRuby has mature garbage collection and native threading.

00:00:49.840 It has the promise of fine-tuning slow parts of your application, and you can use well-known crypto libraries like Bouncy Castle without having to shell out to the command line.

00:01:05.000 Now, JRuby is not a ploy to get Java into your stack. I think they have not done a great job branding it. JRuby is simply a platform that can interpret Ruby, and it does that well.

00:01:16.840 There are a couple of problems with JRuby. One is startup time, and the second is that sometimes you will get Java stack traces. But I believe Charles can fix this, and I believe in Charles.

00:01:31.040 Let's look at what we're going to talk about. I will share many of our learnings along with a couple of practical examples and pitfalls for integrating Ruby into our custom Java framework.

00:01:42.600 To begin, there is a lot you have to do to convert from a typed language to something not typed. Can everyone see my pointer? I'm just kidding. I wrote that presentation—it was really boring.

00:01:55.159 We have a lot of technical details on our blog at corner.squareup.com if you're interested in how this was done. But what was really fascinating about the talk as I developed it is that Square's Ruby stack leverages our production environment completely.

00:02:06.520 So, what is a production environment, and why couldn't Ruby leverage it before? I'm going to talk a little bit about that today. What I'd like you to take away from this talk is, first of all, that JRuby is awesome.

00:02:25.879 The JVM is something you should consider deploying to or using, even if you never intend to drop down and write Java, Clojure, or other JVM languages. As you split out into a service-oriented architecture, it's critical that you think about building a common framework that all your apps are deployed on top of.

00:02:52.159 This gives you the ability to design and engineer the production environment. For those of you more polyglots, I'll show you an example of a very large Java framework that many of our Java engineers worked on. It's a good framework that allows us to seamlessly integrate Ruby on top in a way that any Ruby engineer off the street can come in and develop Ruby using all the familiar toolkits without understanding what's happening underneath.

00:03:31.599 Now, let's give a brief history of how Square began. I got very lucky. I joined Square early on and worked on a project called Wallet, which is a second application from Square.

00:03:50.160 The idea was you could walk into a store, and by proximity, we could detect when you were close to a merchant. If you had given authorization for that merchant to charge you, your name and picture would show up on the register.

00:04:06.120 You could walk up, order a coffee, and say, 'Charge Matt,' take your coffee, and leave without touching any technology. It changed how people interacted, and it was a really cool project to work on.

00:04:27.000 How this was done was we took four people out of one of the four teams, placed them in a room, and began developing. It was a three-month sprint to add the ability to store credit cards at Square, publish merchant locations for customers to find, and allow customers to search for these merchants.

00:04:43.880 The reason we were able to achieve this was due to highly optimized communication within the team; we came up with a custom vision for where we wanted to go and made decisions autonomously. After three months, we emerged, and the company supported us for another month to push it out, and it launched successfully.

00:05:02.479 Later, we faced a common challenge that many growing companies experience: how do we depart from our monolith?

00:05:07.600 I want to emphasize that monolithic apps are great. Service-oriented architecture is a people solution; it is not a code solution or an availability solution. It's a way for many engineers to collaborate without affecting each other.

00:05:32.520 As we approached this transition, one of the key challenges you must solve is your message bus. So what is a message bus? It defines how data transitions or moves through your system.

00:05:50.040 One implementation is feeds. Feeds are basically a pagination API where you page through data via an append-only list. We wrote it in Java first, and it turns out there are some nuances around sharding because you want to horizontally scale out your consumers.

00:06:06.720 We found additional complexities around publishing, using auto-increment, and dealing with race conditions in the database. There was a significant amount of code involved, but we got it working in Java, and Java can consume from our Ruby services. Now we had to get it working in Ruby, but there were challenges.

00:06:45.000 Our implementation was based on ActiveRecord, and it was daunting. We had a lot of code to maintain, so many would argue it's not idiomatic Ruby.

00:06:58.560 The truth is, it is challenging to maintain feature parity across languages. Whether you have language one and language two, you must choose a primary to implement first and then copy over the class structure. If there’s a security vulnerability or a new feature, you need to make that change in both languages.

00:07:27.000 I don't believe there is a mythical team that can effectively leverage two stacks or build two stacks with feature equivalents over time.

00:07:47.000 But it worked! Ruby was finally able to communicate with Java, and this was essentially the only infrastructure we had in our Java stack at that point. We continued building new Ruby and Java services, but then...

00:08:07.680 Starbucks came along, which was a completely different animal. From day four, we had 50 engineers working on this project. Every time you swipe a card at a Starbucks location, it goes through Square's network. Because of the work I did on Wallet, I got to architect the Wallet integration with Starbucks.

00:08:45.240 This experience taught me about decision sequencing. What decisions needed to be made today? What could we find room to experiment with? And what decisions could we postpone until later?

00:09:03.600 A couple of engineers collaborated to define how services would communicate within the production environment. They established how we would maintain these services, provide necessary SLAs, and expose our APIs both internally and externally.

00:09:31.679 At that time, we were in a single data center with about six apps in production. After three months, we had 15 services in production across two data centers, actively processing card transactions. This challenge significantly advanced our infrastructure.

00:10:06.560 What made the Starbucks project successful was that we optimized for communication. By defining how systems would function in production, we created a framework that all teams could work within iteratively. We had a clear goal to strive toward.

00:10:32.560 This developed very differently compared to how the Wallet project evolved, but it is amazing that both development styles can coexist. Some might think they conflict, but they truly do not; day-to-day agility is how we develop at Square.

00:10:57.600 The question then became: how do we leverage our production stack? We built out technology that made Square robust, especially while other companies aimed for exceedingly high availability.

00:11:23.440 One option was to build a Ruby service container and keep it in feature parity with the Java service container, as I previously described. Alternatively, we could allow Ruby to access some resources in the service container and communicate through a Java API.

00:11:55.640 Another idea was to overlay Rails on the service container and communicate with resources via a Java API. However, I found it most appealing to create a thin API layer that adapted the concepts in the service container to be more idiomatic Ruby.

00:12:34.080 I believed this approach could succeed and remain maintainable since the service container was already optimized for application developers. It offered a few key concepts we could represent in a project we called Minecart.

00:12:56.760 Both sides would develop independently while maintaining a stable API. Now, when I refer to service containers and stability, what does that mean? I like this picture because it reminds me of something Steve Jobs once said: software is like scaffolding.

00:13:21.760 By using Objective C and other native APIs, you can start at a higher level and build a taller building. If you start without any scaffolding, you can only build so high before it falls over. Excitingly, the production environment introduces a new layer of optimization.

00:13:46.840 During development, you often build projects, ship them, and several months later, a team may decide to create something new. However, this often leads to losing out on wisdom gained from the initial project.

00:14:07.440 Each new project may not integrate well with the previous ones, leading to confusion. In contrast, what you want in your production environment is complexity managed like a city. You want all road sizes to be uniform, power grids to supply buildings uniformly, and common fixtures across the design.

00:14:21.879 This uniformity enables infrastructure engineers to balance trade-offs between short-term and long-term goals, allowing them to predict the success of projects without running into unforeseen complications.

00:14:44.480 Regarding scaffolding, production scaffolding like Heroku and App Engine offer particular APIs for monitoring service performance. Meanwhile, environments like EC2 or Rackspace give you more control without caring about the deployed code.

00:15:01.960 We also have frameworks optimized for development environments, such as Rails or Java Play, which are not necessarily designed for production but perform well in it nonetheless.

00:15:32.320 Common features found within our framework include consistent packaging of dependencies for uniform deployment, an application lifecycle managing traffic, and rolling deployments to avoid losing requests.

00:15:44.760 Additionally, we require trust and security measures to maintain the integrity of sensitive information, monitoring for individual nodes and overall performance, and systems for job management and messages to facilitate seamless data flow.

00:16:17.040 We also implement health checks to ensure all downstream dependencies function properly, alongside distributed tracing and common logging to pinpoint errors when they occur.

00:16:43.200 The ultimate goal of all this work is to enable application engineers to focus on creating value for customers. As an application engineer myself, the most essential part is having an idea, building it, and shipping it without burdensome maintenance.

00:17:07.680 When we optimize for the production environment, we elevate the conversation back to the application's needs. For example, if I wanted a feature that charged a wallet credit card, I could sit down with security and application engineers to discuss the necessary security protocols.

00:17:29.840 With a common system, the security engineers can intuitively understand where the service fits into our infrastructure, leading to much more efficient collaboration and increased chances of success for complex products.

00:17:58.880 Unfortunately, optimizing for the production environment can sometimes conflict with development needs. That was the goal of Minecart: to create a system that maximizes the advantages of both.

00:18:22.080 Now, let's explore what Ruby code at Square looks like. This is a standard Rails controller, and here’s two special methods: kitchen and employee. These methods serve as APIs to communicate with other services.

00:18:39.960 We have a remote kitchen service and a remote employee service. We require login, and there’s a curious current user token used for tracking user activity without having to pass around actual user objects with behaviors.

00:19:07.440 In our distributed network, a system called Multipass handles user authentication, exchanging a session for a user. It uses the session token to retrieve the user token, allowing seamless interaction throughout the system.

00:19:21.680 Next, we call kitchen delivery orders to get the order list from the corresponding response. However, there's a specific syntax meaning, here: that we’re performing this request synchronously.

00:19:36.880 To comprehend what's occurring in this request, we should briefly explore the story of service-to-service interaction within our project.

00:19:58.840 There are three components crucial to our connection. The first is mutual SSL, which defines our trusting security mechanism. Then there's protocol buffers for defining the API, and finally, we have a custom protocol for service-to-service communication called Sake.

00:20:29.760 Starting with mutual SSL: it works by allowing service A to talk to service B, which then checks its access control list to ensure service A is permitted to establish a connection.

00:20:47.960 However, if service B tries to talk to service C but isn't authorized, that connection gets blocked. This creates a secure, granular framework for communications between services.

00:21:04.960 Once the connection is established, we examine our API. Protocol buffers function as value objects, providing a snapshot of an object at a specific moment, facilitating easy data transmission.

00:21:30.040 We define our services by saying when we call 'show' with a driver ID, to provide a corresponding driver response. Additionally, we can specify properties within the API definition.

00:21:56.240 What actually facilitates data movement is our protocol called Sake, a standard socket that manages requests and responses using message IDs, ensuring requestors receive the correct responses.

00:22:17.280 The Sake protocol also has this unique concept of a side channel that continues to persist throughout requests. When service A makes a request to service B, which in turn requests from service C, the side channel remains active.

00:22:55.320 This allows for exception chaining. If service C encounters an error, service B will relay the same error back to service A. Thus, if you're alerted regarding service A's issues, you can swiftly identify service C as the source.

00:23:23.320 Now that we're sufficiently prepared, let’s see what happens when we make our request to the Pizza Kitchen service. If the Pizza Kitchen service is unavailable, it automatically redirects to another service.

00:23:45.000 You can loop through all the orders and integrate the driver names, as the data is stored across separate services. Notably, we avoid waiting on any requests, as this script will use promises to access future data as needed.

00:24:05.240 This returns a promise, which provides access to future data. At the time of instantiation, it waits for the request to conclude and returns the requested value.

00:24:20.560 All we have to do is render a simple HTML template for the order list, and it integrates smoothly. It's designed in such a way that a developer accustomed to standard Ruby can easily adapt to our RPC API and develop significant products.

00:24:37.840 This system, although complex, does not complicate the development story. Regarding the database, we're still utilizing a connection pool from our service container while adhering to the SQL ORM practices we expect.

00:25:10.360 Jobs operate simply, akin to how you'd typically expect, while load balancing is distributed across the cluster for RPC. You can opt for synchronous, asynchronous, or callback styles and set properties about those connections.

00:25:50.680 We also accommodate RPC calls on Rails controllers originating from web requests. If another service invokes a request, we offer a Rails-like controller that translates a protocol buffer into a Ruby API resembling Rails.

00:26:07.960 What's remarkable is that no one externally can discern whether they're interfacing with a Ruby or Java application, enabling full flexibility.

00:26:29.840 Lastly, regarding the message bus, while substantial amounts of code underpin this system, it doesn't lead to excessive Ruby code—just a single implementation.

00:26:47.200 The developers of our feeds infrastructure can depend on the service container's job system while providing flexibility to Ruby developers, letting them use tools like Sidekiq or Rescue according to their requirements.

00:27:12.680 This fulfillment carries an anticlimactic feel, but the payoff is the emergence of a new service, the Global Name Service, aimed at facilitating service discovery.

00:27:29.560 In environments where redundancy is crucial, services typically interact with resources through a load balancer, which identifies out-of-service nodes. However, once you expand to multiple data centers, the common practice is to manage routing directly in the application.

00:28:02.560 GNS resolves this challenge by integrating into the application lifecycle, allowing an app during startup to request service connections, which GNS fulfills without delaying operations.

00:28:25.840 As clusters of clients bear the load balancing logic in software, they preferentially route traffic to the local data center while handling failures. Upon signaling readiness, GNS seamlessly begins directing traffic without hassle.

00:28:38.480 The question becomes: what does an application engineer need to do to access this functionality? Essentially, they only need to bundle update Minecart.

00:29:07.320 This reflects the promise of an integrated world that Minecart aims to deliver. The ultimate goal of Minecart is to provide a consistent production story that ensures seamless communication among services.

00:29:26.080 Currently, we have approximately 150 services registered, with around 110 deployed. The concept is that anyone can come in, have an idea, build it, and see it working through development and production.

00:29:46.000 This reduces time spent worrying about dependencies, allowing application engineers to focus on shipping remarkable products. During our last hack week, a team of three managed to develop two new services and integrate them with seven existing services without disrupting others.

00:29:59.000 I brought along Daniel Nyman and Matthew Todd, who played vital roles in bringing this vision to reality. They would be delighted to discuss their experiences if you're interested.

00:30:09.320 I would also love to talk with you about it, and with that, are there any questions?

00:30:19.720 Thank you!