Containerization

Pathfinder - Building a Container Platform in Ruby Ecosystem

Pathfinder - Building a Container Platform in Ruby Ecosystem

by Giovanni Sakti

In this session presented at RubyKaigi 2019, Giovanni Sakti discusses the process of building a container platform within the Ruby ecosystem, particularly focusing on the challenges and learnings from this initiative at his company, Gojek. Sakti begins by introducing himself and explaining Gojek's role as a multifaceted super app in Southeast Asia. He emphasizes Gojek's heavy reliance on Ruby—specifically JRuby—and how this laid the groundwork for implementing a new container platform. The presentation transitions into technical territory, explaining the motivation for creating the container platform due to fragmentation across services and a need for centralization as the company scales. The key points discussed include:

  • Open Source Commitment: A strategic decision to create and use open-source frameworks to facilitate development and ensure seamless integration with existing production services.
  • Container Migration Planning: Challenges associated with migrating numerous services from traditional VMs to containers, especially in a hybrid infrastructure setup due to regulatory requirements for certain services like Gopay.
  • Burrito Log Project: Development of a unified logging platform that abstracts various logging services and is built primarily in Ruby.
  • Pathfinder: Introduction of 'Pathfinder,' a container management platform constructed in Ruby. Sakti describes the architecture involving a state server, scheduler, and agents responsible for container management.
  • Component Interaction: The interplay between the scheduler and the state server, along with the self-registration process for new worker nodes, is elaborated upon.
  • Ruby and MRuby Use: Discussion on integrating MRuby due to its lightweight nature and the simplicity in creating executables tailored to their needs.
  • Ruby vs. Go Performance: Sakti compares the performance of Ruby and Go, noting the strengths and weaknesses as the development moved towards using MRuby.
  • Future Prospects: The potential for scaling Pathfinder within other departments at Gojek as the demand grows, stressing the importance of documentation and community engagement in improving MRuby.

In conclusion, Sakti highlights that while there are still hurdles in finding adequate libraries in Ruby, the potential for building a streamlined, efficient container platform using Ruby is promising. Moreover, he emphasizes the importance of community contributions towards enhancing the Ruby ecosystem's capabilities for container management. He ends the session inviting questions from the audience, fostering a community discussion around these advancements.

00:00:00.260 Hello, good afternoon everyone. Thank you for attending my talk today.
00:00:05.700 Today, I'm going to talk about building a container platform in the Ruby ecosystem.
00:00:11.190 I hope I have the kanji correct because I kind of used Google Translate for this.
00:00:20.900 My name is Gio, and I come from Jakarta.
00:00:26.519 If you don't know, Jakarta is located in Indonesia. A lot of people are more familiar with Bali, but actually, the capital city of Indonesia is Jakarta.
00:00:39.000 It took about seven hours to travel from Jakarta to Tokyo, and then I had to change planes for another two hours from Tokyo to Fukuoka.
00:00:51.899 I enjoy traveling, but most likely, when I go back to Tokyo, I will bring some concepts that I want to explore.
00:01:08.240 I work at a company called Gojek. Is there anyone here who knows about Gojek?
00:01:15.090 A few people, sure! Gojek is quite a hot topic in Southeast Asia. So what is it exactly?
00:01:23.729 I will summarize it, but it's quite hard because we provide a lot of services to our customers in Southeast Asia.
00:01:37.439 Some people might say that we are the Uber of Southeast Asia, but it's actually quite different.
00:01:49.340 Uber converges primarily to a ride-sharing platform, while we provide a lot more services.
00:02:02.399 As you can see, we operate in at least four or five countries including Thailand, Indonesia, Singapore, and Vietnam.
00:02:10.110 We provide a variety of services beyond ride-sharing, including some interesting use cases like GoMassage, which allows you to order a massage to your home.
00:02:27.200 If you want to buy tickets to a movie, you can use our platform as well. We call ourselves a super app.
00:02:39.650 What does that mean? We provide a wide range of apps within a single application.
00:02:54.530 If you open our app, you will see all these services. At Gojek, we love using Ruby.
00:03:01.150 In Gojek, we use at least five programming languages. Ruby is one of them, alongside Clojure, Go, Java, and others.
00:03:15.560 Actually, we use JRuby, not Ruby, but the syntax is the same. We prefer using JRuby because we already use other languages on the JVM platform.
00:03:26.959 This allows us to make it easier to use interchangeable libraries. In Jakarta, I also organize Ruby meetups.
00:03:37.909 We hold Ruby meetups almost every month. The Ruby community in Jakarta is quite established, with our first mailing list dating back to 2001.
00:03:49.819 If I remember correctly, 2001 was the year when we had the official documentation translated into Indonesian.
00:04:05.739 DHH met someone at a conference that year and decided to use Ruby for Rails.
00:04:20.239 We've already held several meetups and we hosted a conference two years ago at the Gojek office in Jakarta.
00:04:39.669 We will have another one soon, so if any of you want to submit a CFP, it's already open and will be held in September.
00:04:54.630 Now, moving on to my main topic, I will discuss building a container platform in the Ruby ecosystem.
00:05:02.590 What is it exactly? This is my abstract, but I want to highlight a few sentences.
00:05:20.050 The first point is that I'm trying to build a container platform where almost everything will be in Ruby or MRuby.
00:05:25.449 I will talk about the current situation and whether this ecosystem can support my efforts.
00:05:32.650 For some of you who are not yet familiar with the concepts of container platforms or orchestrators, I will also explain a bit about the generic architecture behind it.
00:05:48.880 This project started early last year due to the rapid growth we are experiencing in my company.
00:06:00.490 There has been a lot of fragmentation because each team has been moving in different directions.
00:06:08.020 We previously had more than 20 services, each maintained by different people.
00:06:18.580 To achieve rapid growth, we allowed teams to experiment independently, which led to this fragmentation in terms of platforms and support services.
00:06:31.710 Therefore, we decided to converge into a centralized platform, consolidating common tools and services.
00:06:45.280 There are a couple of principles and considerations that we think are important for this project.
00:06:50.650 The first is that it must be open source. We decided to use existing open-source projects or contribute to them.
00:07:02.790 Almost everything we create for this undertaking is open source.
00:07:14.470 We want it to be seamless—we already have many services in production and we don't want anything to break during the transition.
00:07:21.600 At that time, we were mostly using VMs, so we wanted to start exploring containers.
00:07:34.090 Moving a significant number of services to containers is challenging since we already have a large ecosystem on VMs, which includes several thousand production VMs.
00:07:54.520 Additionally, we need to support a hybrid infrastructure.
00:08:00.130 We cannot rely solely on cloud solutions. One of our services, Gopay, is a payment platform in a highly regulated industry.
00:08:13.270 One requirement is to host services within the country. Therefore, anything we build must support hybrid infrastructure.
00:08:25.420 Another project I was involved in was logging.
00:08:31.300 Initially, we allowed teams to choose any logging tools they wanted, such as Elastic Search or Stackdriver, depending on their hosting.
00:08:45.000 However, we needed a unified platform, which is why we developed what we call Burrito Log.
00:08:59.020 Burrito Log serves as an Elasticsearch-based service platform, but we intend for it to be flexible enough to switch search providers.
00:09:11.560 For example, we found another project called Loki, which is also a CNCF project, and we aim to allow interchanging full-text search functionality.
00:09:30.520 Additionally, we decided to use LXD to handle components.
00:09:39.260 For those who don't know, LXD was initially used by Docker as its building block.
00:10:06.220 LXD is different now, as Docker has created another container runtime, but for initial implementations, LXD served as a low-level container runtime.
00:10:17.709 We decided to use LXD due to two main reasons. First, it is a drop-in VM replacement, which is easy to use if you are accustomed to VMs.
00:10:32.230 Also, we utilize Chef for almost all of our infrastructure provisioning. We have many existing cookbooks, allowing us to minimize effort.
00:10:46.200 However, we need a system to manage these LXD containers, as manually handling many containers can become tedious.
00:11:01.900 Thus, we created a platform called Pathfinder. It can be found on GitHub.
00:11:15.280 It is a container platform written in Ruby and supports the Go language.
00:11:39.190 To clarify what a container platform is, it sometimes gets referred to as an orchestrator or manager.
00:11:54.120 To provide a clearer explanation, I found a neat online resource that uses a limited vocabulary of the most common words.
00:12:08.610 This tool allows you to define complex ideas while adhering to the constraints of a simple vocabulary, which I find quite useful.
00:12:29.980 However, I did break the rule with the word 'software,' which is not considered a common term.
00:12:42.600 If I were to define a container platform, I would say it is software that allows us to manage many computers as if they were a single big computer.
00:12:56.650 It assigns jobs to these computers so that developers deploying anything do not have to think about where to assign containers or networking.
00:13:11.930 The architecture abstracts CPU, RAM, and storage, so you're presented with a unified resource view.
00:13:28.000 We also need to abstract the network aspect, as containers need to communicate with each other.
00:13:37.190 Now, I will detail the generic architecture of a container platform.
00:13:43.580 I will describe four key components. The first one is the state server, which I refer to as the 'consciousness' of the system.
00:14:16.960 It essentially stores the state of your containers, nodes, and everything else.
00:14:28.060 Next, we have a scheduler, which collects information from the state server and assigns homes for unscheduled containers.
00:14:47.680 The scheduler serves as the system's brain, pinpointing which node or computer a container should be spawned on.
00:15:07.550 Last but not least, there is an agent installed on all nodes that constantly queries the state server.
00:15:33.060 The agent checks whether there are any unscheduled containers on its particular node.
00:15:50.020 If there are, it will spawn those containers.
00:16:05.730 The agent communicates with the container runtime, which can be various runtimes such as Docker, Containerd, LXD, or others.
00:16:26.050 To help me remember the key components, I created a mnemonic for the main subject here.
00:16:46.040 In Pathfinder, we use Go for the CLI and agent, while the state server and scheduler are written in Ruby, specifically with Rails.
00:17:11.260 Currently, the runtime supports LXD but can easily be adapted for other runtimes, such as Docker or ContainerD.
00:17:24.000 Next, I will explain how the self-registration process works.
00:17:37.360 When adding a new worker node, the agent must be installed on that node.
00:17:50.160 Starting the agent triggers the self-registration process, where it contacts the state server.
00:18:04.680 The state server responds with a token or secret for secure communication. It also saves the agent's information in the database.
00:18:31.020 After the self-registration process, if you run the command line interface and type 'get nodes,' you will see all nodes listed.
00:18:55.170 Initially, since no containers have been created, the list will be empty.
00:19:08.950 Let's create a new container by specifying an image. The container is created within the state server, but nothing is scheduled yet.
00:19:23.950 It is now the scheduler's responsibility to check unscheduled containers and see which worker has the least utilization.
00:19:43.020 The scheduler marks the container with the node number, changing its status from pending to scheduled.
00:20:03.440 After this process, the agent will also check information from the state server.
00:20:18.350 If an unscheduled container has yet to be started, the agent will push the information to the runtime.
00:20:33.790 The runtime then creates the container in the worker node, and if you check the CLI, you will see that it is already provisioned with its own IP.
00:20:58.520 The system also has a rescheduling feature, which allows moving a container from one node to another.
00:21:14.170 If you destroy a provisioned container, the scheduler will place it on another node as long as there are adequate resources.
00:21:26.110 As of now, we have around 50 active worker nodes. It is a relatively small scale, but we have been live for six months.
00:21:51.690 Currently, this infrastructure serves our logging project, but other departments may soon start using it.
00:22:06.230 As Gojek expands into other Southeast Asian markets, we expect traffic and usage to significantly increase.
00:22:20.680 Now, let's discuss Pathfinder. Initially, we decided to use Go because it has libraries for communicating with LXD and can compile to a single binary.
00:22:37.430 However, I am experimenting with replacing all components with Ruby, particularly with MRuby.
00:22:51.740 Ruby is interesting for several reasons. First, it is straightforward to create executable files.
00:23:06.050 Second, it has a low footprint, which is important for the agent running on nodes.
00:23:20.240 Lastly, switching to an all-Ruby setup could streamline our project offerings.
00:23:38.410 We have a lot of Ruby developers at Gojek—more than Go developers.
00:23:50.720 However, a significant challenge remains with the Ruby ecosystem, which currently lacks some of the libraries that Go offers.
00:24:05.350 For example, we need a robust agent that interacts with the state server and the local container runtime.
00:24:16.910 The agent must handle self-registration, querying the state server, and communicating with the container daemon.
00:24:30.680 It is also responsible for sending back metrics.
00:24:42.750 We have a simple library to gather the metrics we need, focusing primarily on CPU and RAM.
00:24:58.620 Creating self-contained executables in Ruby is straightforward. There are libraries that facilitate this.
00:25:12.840 Next, let's look at how to write the CLI in Ruby as well.
00:25:24.280 The CLI also needs to interact with the state server. We have created libraries to handle this communication.
00:25:36.330 Over the past couple of years, I started using Go and discovered a cool library called Cobra for structuring CLI applications.
00:25:57.140 I implemented a similar structure for our Ruby CLI.
00:26:09.840 The state server is currently written in Rails, which is substantial but meets our current needs.
00:26:20.640 Since this project must progress quickly, we utilized Rails, but it could be rewritten to something lighter in the future.
00:26:36.480 We also have the scheduler already implemented in Ruby.
00:26:49.460 As of now, I have tested this Ruby infrastructure and may incorporate components into our production setup.
00:27:03.970 We may replace our staging environment's agent with our MRuby version once it's ready for testing.
00:27:17.290 We've extracted a library in Ruby for interacting with the LXD daemon, and everything is running well.
00:27:30.590 We also created a mini-framework for structuring the CLI.
00:27:43.870 Currently, two components in our platform are now written in Ruby, previously in Go.
00:27:55.620 It will be interesting to improve those components, for instance, to enhance the scheduler.
00:28:06.400 Lastly, I am focusing on using a fair scheduling algorithm and found some libraries in Ruby that could help.
00:28:20.140 To wrap up, while using MRuby, I noticed some differences from standard Ruby.
00:28:30.840 We don't have a gem file in MRuby; instead, we encounter build conflicts.
00:28:45.050 When changing a gem, we must recompile everything, which is beneficial when creating a single executable.
00:29:01.210 To my surprise, MRuby has enough libraries for our project requirements.
00:29:12.160 Building MRuby executables is simple, and our tests confirm they are smaller than Go equivalents.
00:29:25.100 Additionally, I faced some challenges finding mocking libraries for interfacing with API servers.
00:29:39.510 As a workaround, we created a simple mock server for testing.
00:29:51.450 Another challenge was handling concurrency, as I had difficulty finding threading support in MRuby.
00:30:07.460 I may experiment with fibers for concurrency while sending metrics or creating containers.
00:30:21.520 Also, implementing graceful shutdown is crucial and needs attention.
00:30:34.550 Documentation on MRuby is lacking, especially in English, but we can work on improving it together.
00:30:49.120 Thank you very much for your time. Do we have any questions?
00:30:58.829 (Audience questions and discussion)