Giovanni Sakti

Pathfinder - Building a Container Platform in Ruby Ecosystem

This session will discuss about an attempt to build container platform in ruby/mruby ecosystem, the current situation and lesson-learned that we can discern to improve it further.

Further on, for those whom are unfamiliar, this session will also touch a bit about container platform/orchestrator and the generic architecture behind it. So that as a developer, we understand the abstraction that it provides.

RubyKaigi 2019 https://rubykaigi.org/2019/presentations/giosakti.html#apr18

RubyKaigi 2019

00:00:00.260 Hello, good afternoon everyone. Thank you for attending my talk today.
00:00:05.700 Today, I'm going to talk about building a container platform in the Ruby ecosystem.
00:00:11.190 I hope I have the kanji correct because I kind of used Google Translate for this.
00:00:20.900 My name is Gio, and I come from Jakarta.
00:00:26.519 If you don't know, Jakarta is located in Indonesia. A lot of people are more familiar with Bali, but actually, the capital city of Indonesia is Jakarta.
00:00:39.000 It took about seven hours to travel from Jakarta to Tokyo, and then I had to change planes for another two hours from Tokyo to Fukuoka.
00:00:51.899 I enjoy traveling, but most likely, when I go back to Tokyo, I will bring some concepts that I want to explore.
00:01:08.240 I work at a company called Gojek. Is there anyone here who knows about Gojek?
00:01:15.090 A few people, sure! Gojek is quite a hot topic in Southeast Asia. So what is it exactly?
00:01:23.729 I will summarize it, but it's quite hard because we provide a lot of services to our customers in Southeast Asia.
00:01:37.439 Some people might say that we are the Uber of Southeast Asia, but it's actually quite different.
00:01:49.340 Uber converges primarily to a ride-sharing platform, while we provide a lot more services.
00:02:02.399 As you can see, we operate in at least four or five countries including Thailand, Indonesia, Singapore, and Vietnam.
00:02:10.110 We provide a variety of services beyond ride-sharing, including some interesting use cases like GoMassage, which allows you to order a massage to your home.
00:02:27.200 If you want to buy tickets to a movie, you can use our platform as well. We call ourselves a super app.
00:02:39.650 What does that mean? We provide a wide range of apps within a single application.
00:02:54.530 If you open our app, you will see all these services. At Gojek, we love using Ruby.
00:03:01.150 In Gojek, we use at least five programming languages. Ruby is one of them, alongside Clojure, Go, Java, and others.
00:03:15.560 Actually, we use JRuby, not Ruby, but the syntax is the same. We prefer using JRuby because we already use other languages on the JVM platform.
00:03:26.959 This allows us to make it easier to use interchangeable libraries. In Jakarta, I also organize Ruby meetups.
00:03:37.909 We hold Ruby meetups almost every month. The Ruby community in Jakarta is quite established, with our first mailing list dating back to 2001.
00:03:49.819 If I remember correctly, 2001 was the year when we had the official documentation translated into Indonesian.
00:04:05.739 DHH met someone at a conference that year and decided to use Ruby for Rails.
00:04:20.239 We've already held several meetups and we hosted a conference two years ago at the Gojek office in Jakarta.
00:04:39.669 We will have another one soon, so if any of you want to submit a CFP, it's already open and will be held in September.
00:04:54.630 Now, moving on to my main topic, I will discuss building a container platform in the Ruby ecosystem.
00:05:02.590 What is it exactly? This is my abstract, but I want to highlight a few sentences.
00:05:20.050 The first point is that I'm trying to build a container platform where almost everything will be in Ruby or MRuby.
00:05:25.449 I will talk about the current situation and whether this ecosystem can support my efforts.
00:05:32.650 For some of you who are not yet familiar with the concepts of container platforms or orchestrators, I will also explain a bit about the generic architecture behind it.
00:05:48.880 This project started early last year due to the rapid growth we are experiencing in my company.
00:06:00.490 There has been a lot of fragmentation because each team has been moving in different directions.
00:06:08.020 We previously had more than 20 services, each maintained by different people.
00:06:18.580 To achieve rapid growth, we allowed teams to experiment independently, which led to this fragmentation in terms of platforms and support services.
00:06:31.710 Therefore, we decided to converge into a centralized platform, consolidating common tools and services.
00:06:45.280 There are a couple of principles and considerations that we think are important for this project.
00:06:50.650 The first is that it must be open source. We decided to use existing open-source projects or contribute to them.
00:07:02.790 Almost everything we create for this undertaking is open source.
00:07:14.470 We want it to be seamless—we already have many services in production and we don't want anything to break during the transition.
00:07:21.600 At that time, we were mostly using VMs, so we wanted to start exploring containers.
00:07:34.090 Moving a significant number of services to containers is challenging since we already have a large ecosystem on VMs, which includes several thousand production VMs.
00:07:54.520 Additionally, we need to support a hybrid infrastructure.
00:08:00.130 We cannot rely solely on cloud solutions. One of our services, Gopay, is a payment platform in a highly regulated industry.
00:08:13.270 One requirement is to host services within the country. Therefore, anything we build must support hybrid infrastructure.
00:08:25.420 Another project I was involved in was logging.
00:08:31.300 Initially, we allowed teams to choose any logging tools they wanted, such as Elastic Search or Stackdriver, depending on their hosting.
00:08:45.000 However, we needed a unified platform, which is why we developed what we call Burrito Log.
00:08:59.020 Burrito Log serves as an Elasticsearch-based service platform, but we intend for it to be flexible enough to switch search providers.
00:09:11.560 For example, we found another project called Loki, which is also a CNCF project, and we aim to allow interchanging full-text search functionality.
00:09:30.520 Additionally, we decided to use LXD to handle components.
00:09:39.260 For those who don't know, LXD was initially used by Docker as its building block.
00:10:06.220 LXD is different now, as Docker has created another container runtime, but for initial implementations, LXD served as a low-level container runtime.
00:10:17.709 We decided to use LXD due to two main reasons. First, it is a drop-in VM replacement, which is easy to use if you are accustomed to VMs.
00:10:32.230 Also, we utilize Chef for almost all of our infrastructure provisioning. We have many existing cookbooks, allowing us to minimize effort.
00:10:46.200 However, we need a system to manage these LXD containers, as manually handling many containers can become tedious.
00:11:01.900 Thus, we created a platform called Pathfinder. It can be found on GitHub.
00:11:15.280 It is a container platform written in Ruby and supports the Go language.
00:11:39.190 To clarify what a container platform is, it sometimes gets referred to as an orchestrator or manager.
00:11:54.120 To provide a clearer explanation, I found a neat online resource that uses a limited vocabulary of the most common words.
00:12:08.610 This tool allows you to define complex ideas while adhering to the constraints of a simple vocabulary, which I find quite useful.
00:12:29.980 However, I did break the rule with the word 'software,' which is not considered a common term.
00:12:42.600 If I were to define a container platform, I would say it is software that allows us to manage many computers as if they were a single big computer.
00:12:56.650 It assigns jobs to these computers so that developers deploying anything do not have to think about where to assign containers or networking.
00:13:11.930 The architecture abstracts CPU, RAM, and storage, so you're presented with a unified resource view.
00:13:28.000 We also need to abstract the network aspect, as containers need to communicate with each other.
00:13:37.190 Now, I will detail the generic architecture of a container platform.
00:13:43.580 I will describe four key components. The first one is the state server, which I refer to as the 'consciousness' of the system.
00:14:16.960 It essentially stores the state of your containers, nodes, and everything else.
00:14:28.060 Next, we have a scheduler, which collects information from the state server and assigns homes for unscheduled containers.
00:14:47.680 The scheduler serves as the system's brain, pinpointing which node or computer a container should be spawned on.
00:15:07.550 Last but not least, there is an agent installed on all nodes that constantly queries the state server.
00:15:33.060 The agent checks whether there are any unscheduled containers on its particular node.
00:15:50.020 If there are, it will spawn those containers.
00:16:05.730 The agent communicates with the container runtime, which can be various runtimes such as Docker, Containerd, LXD, or others.
00:16:26.050 To help me remember the key components, I created a mnemonic for the main subject here.
00:16:46.040 In Pathfinder, we use Go for the CLI and agent, while the state server and scheduler are written in Ruby, specifically with Rails.
00:17:11.260 Currently, the runtime supports LXD but can easily be adapted for other runtimes, such as Docker or ContainerD.
00:17:24.000 Next, I will explain how the self-registration process works.
00:17:37.360 When adding a new worker node, the agent must be installed on that node.
00:17:50.160 Starting the agent triggers the self-registration process, where it contacts the state server.
00:18:04.680 The state server responds with a token or secret for secure communication. It also saves the agent's information in the database.
00:18:31.020 After the self-registration process, if you run the command line interface and type 'get nodes,' you will see all nodes listed.
00:18:55.170 Initially, since no containers have been created, the list will be empty.
00:19:08.950 Let's create a new container by specifying an image. The container is created within the state server, but nothing is scheduled yet.
00:19:23.950 It is now the scheduler's responsibility to check unscheduled containers and see which worker has the least utilization.
00:19:43.020 The scheduler marks the container with the node number, changing its status from pending to scheduled.
00:20:03.440 After this process, the agent will also check information from the state server.
00:20:18.350 If an unscheduled container has yet to be started, the agent will push the information to the runtime.
00:20:33.790 The runtime then creates the container in the worker node, and if you check the CLI, you will see that it is already provisioned with its own IP.
00:20:58.520 The system also has a rescheduling feature, which allows moving a container from one node to another.
00:21:14.170 If you destroy a provisioned container, the scheduler will place it on another node as long as there are adequate resources.
00:21:26.110 As of now, we have around 50 active worker nodes. It is a relatively small scale, but we have been live for six months.
00:21:51.690 Currently, this infrastructure serves our logging project, but other departments may soon start using it.
00:22:06.230 As Gojek expands into other Southeast Asian markets, we expect traffic and usage to significantly increase.
00:22:20.680 Now, let's discuss Pathfinder. Initially, we decided to use Go because it has libraries for communicating with LXD and can compile to a single binary.
00:22:37.430 However, I am experimenting with replacing all components with Ruby, particularly with MRuby.
00:22:51.740 Ruby is interesting for several reasons. First, it is straightforward to create executable files.
00:23:06.050 Second, it has a low footprint, which is important for the agent running on nodes.
00:23:20.240 Lastly, switching to an all-Ruby setup could streamline our project offerings.
00:23:38.410 We have a lot of Ruby developers at Gojek—more than Go developers.
00:23:50.720 However, a significant challenge remains with the Ruby ecosystem, which currently lacks some of the libraries that Go offers.
00:24:05.350 For example, we need a robust agent that interacts with the state server and the local container runtime.
00:24:16.910 The agent must handle self-registration, querying the state server, and communicating with the container daemon.
00:24:30.680 It is also responsible for sending back metrics.
00:24:42.750 We have a simple library to gather the metrics we need, focusing primarily on CPU and RAM.
00:24:58.620 Creating self-contained executables in Ruby is straightforward. There are libraries that facilitate this.
00:25:12.840 Next, let's look at how to write the CLI in Ruby as well.
00:25:24.280 The CLI also needs to interact with the state server. We have created libraries to handle this communication.
00:25:36.330 Over the past couple of years, I started using Go and discovered a cool library called Cobra for structuring CLI applications.
00:25:57.140 I implemented a similar structure for our Ruby CLI.
00:26:09.840 The state server is currently written in Rails, which is substantial but meets our current needs.
00:26:20.640 Since this project must progress quickly, we utilized Rails, but it could be rewritten to something lighter in the future.
00:26:36.480 We also have the scheduler already implemented in Ruby.
00:26:49.460 As of now, I have tested this Ruby infrastructure and may incorporate components into our production setup.
00:27:03.970 We may replace our staging environment's agent with our MRuby version once it's ready for testing.
00:27:17.290 We've extracted a library in Ruby for interacting with the LXD daemon, and everything is running well.
00:27:30.590 We also created a mini-framework for structuring the CLI.
00:27:43.870 Currently, two components in our platform are now written in Ruby, previously in Go.
00:27:55.620 It will be interesting to improve those components, for instance, to enhance the scheduler.
00:28:06.400 Lastly, I am focusing on using a fair scheduling algorithm and found some libraries in Ruby that could help.
00:28:20.140 To wrap up, while using MRuby, I noticed some differences from standard Ruby.
00:28:30.840 We don't have a gem file in MRuby; instead, we encounter build conflicts.
00:28:45.050 When changing a gem, we must recompile everything, which is beneficial when creating a single executable.
00:29:01.210 To my surprise, MRuby has enough libraries for our project requirements.
00:29:12.160 Building MRuby executables is simple, and our tests confirm they are smaller than Go equivalents.
00:29:25.100 Additionally, I faced some challenges finding mocking libraries for interfacing with API servers.
00:29:39.510 As a workaround, we created a simple mock server for testing.
00:29:51.450 Another challenge was handling concurrency, as I had difficulty finding threading support in MRuby.
00:30:07.460 I may experiment with fibers for concurrency while sending metrics or creating containers.
00:30:21.520 Also, implementing graceful shutdown is crucial and needs attention.
00:30:34.550 Documentation on MRuby is lacking, especially in English, but we can work on improving it together.
00:30:49.120 Thank you very much for your time. Do we have any questions?
00:30:58.829 (Audience questions and discussion)