Talks
How We Scaled GitLab For a 30k-employee Company

How We Scaled GitLab For a 30k-employee Company

by Minqi Pan

In this talk presented at RailsConf 2016, Minqi Pan discusses the complexities of scaling GitLab, an open-source alternative to GitHub, for large organizations like Alibaba. Due to GitLab's architecture, which relies on a single filesystem to store git repositories, scaling becomes a significant challenge. Instead of using traditional NAS for file storage, the decision was made to transition to a cloud-based object storage solution like Alibaba OSS, similar to Amazon S3. This transition necessitated a careful redesign of both the Ruby layer and lower-level C components to cope with performance degradation from network I/O.

Key points covered include:
- GitLab Architecture: GitLab functions as a black box that operates through HTTP and SSH, with backends like PostgreSQL and Redis. The architecture can create bottlenecks when scaling due to its reliance on a traditional filesystem for repository storage.
- Innovative Solutions: Pan introduced 'ssh-to-http', a project to translate SSH requests into HTTP requests to simplify server interactions. Load balancing was handled through IPVS for more efficient traffic management during requests.
- Cloud Migration: The need to move GitLab’s storage to a cloud solution was emphasized due to the limitations of existing architecture. Moving to Alibaba OSS aligned better with the need for scalability and ease of maintenance.
- Handling Git Operations: The talk detailed how the design connected Git to LibGit2, transforming how data retrieval operations were performed, especially dealing with packed content, which posed unique challenges due to its structure in Git.
- Performance Concerns: Initial performance benchmarks indicated slower responses post-transition due to the necessary switch from fast filesystem operations to slower HTTP communications. Caching strategies were examined as a means to mitigate these performance losses.
- Future Developments: Pan also hinted at exploring the potential of creating an AWS S3 compatible back-end to enhance deployment flexibility.

In conclusion, effective caching and innovative architectural adjustments are critical to overcoming performance issues encountered with the shift to cloud storage. This talk serves as a comprehensive overview for Rails developers facing similar scaling issues with distributed applications.

00:00:09.670 When I submitted this talk to RailsConf, it was under the track 'We're Living in a Distributed World.' However, I'm surprised to find that I'm the only talk in that track. It seems there are no other presentations addressing the scaling of Rails applications and distributed systems. I believe the reason might be that, as Rails developers, we follow certain best practices that make scaling our applications seem manageable and less problematic. But GitLab presents some unique challenges. I would say it's a tough nut to crack. In this talk, I'm going to discuss how I would address those challenges. Thank you for coming to my talk; my name is Minqi Pan, and I came from China. I work for Alibaba Group. Here’s my GitHub account and Twitter handle; feel free to follow me.
00:01:18.650 So, what is GitLab? Well, it's essentially an open-source clone of GitHub, although people rarely like to admit that. A better way to describe it is as a self-hosted Git repository manager that you can deploy on your machine. It's designed for on-premises installation. Let's do a quick survey: how many of you use GitHub in your organization? Thank you. Now, if you look closely, GitLab functions as a black box that exposes two ports: HTTP and SSH. HTTP serves two purposes: you can clone a repository and push content to a repository. More importantly, as a Rails application, it provides rich user interactions through its web interface. On the other hand, SSH is limited to git clone and git push operations. From a simplistic perspective, it operates in a way that makes scaling quite problematic.
00:02:11.060 If we examine the backend architecture of GitLab, it uses several components. For example, it relies on backends like PostgreSQL for database operations, abstracted through Active Record, allowing for flexibility in the underlying implementation. Additionally, it uses Redis for queue management and caching and employs a filesystem to store Git repositories. Opening up this black box reveals a structured environment containing various components like OpenSSH for SSH operations, Unicorn for HTTP requests, and GitLab Shell for handling Git commands. However, the architecture presents challenges when scaling and maintaining high availability, especially for a large organization like mine.
00:03:05.630 The issue primarily lies in how GitLab stores its repositories on a single filesystem. To address the scaling challenge, we need to think about the HTTP and SSH requests that come in. As Rails developers, we're familiar with using Unicorn instances to handle HTTP requests. We planned to configure Engine to point to these Unicorn servers and facilitate the request handling. However, SSH posed a problem because of its distinct interactions compared to HTTP.
00:03:51.190 To resolve this, I started a project called 'ssh-to-http,' which is open-source on my GitHub account. This project essentially translates SSH requests to HTTP requests, streamlining interactions with the server. The decision for GitHub to use HTTP as the default protocol for cloning repositories stems from the complications associated with using SSH, which can introduce additional latency. While many companies may use SSH, our approach at Alibaba was different. Instead of using Engine as the front end, we opted for something called 'IPVS,' which stands for IP Virtual Server, a feature from the Linux kernel. Unlike Engine, which operates at the application layer, IPVS performs load balancing at the transport layer, supporting all TCP/IP communications.
00:04:55.910 By using IPVS, we addressed some limitations associated with HTTP and SSH, albeit with trade-offs. With a fourth-layer architecture, we’re unable to perform health checks on our servers since we do not get to see HTTP status codes as we would at the application layer. This limitation means we rely solely on monitoring the packets being transmitted. The SSH protocol's inherent complexity, involving key checks and authentication, poses further hurdles. When deploying, it's critical to ensure that host keys are consistent across the cluster to prevent connection issues.
00:05:34.220 When adding SSH keys for clients, those keys need to be propagated across the entire cluster, which can be problematic. You have to ensure that authorized keys are synchronized across all machines. In our environment, we managed to distribute these keys effectively, avoiding any issues in access or connection. Nonetheless, the real challenge comes from how GitLab stores repositories, which relies on a traditional file system.
00:06:01.370 I’d like to pause here to discuss the Twelve-Factor App principles. GitLab does not adhere strictly to these principles. Specifically, the first rule emphasizes treating backing services as attached resources, making them easily replaceable. In contrast, GitLab stores critical data directly on its file system, leading to issues with scaling and maintainability. We aim to migrate significant content, such as Git repositories and user-generated attachments, to a scalable cloud storage solution.
00:07:06.630 When considering options for cloud storage, several choices are available. One option is GitLab Enterprise's feature called Geo, which offers replication across servers. However, this doesn't resolve our problems at Alibaba, given the vast storage needs of our repositories. The overall size significantly exceeds the capacity of a single machine. In terms of distributed system architecture, Geo relies on a one-master, many-slave replication model and does not offer sharding or effective disaster recovery. This design compromises between consistency, availability, and partition tolerance, only achieving two of the three.
00:08:32.760 Another avenue we explored involves eliminating SSH through my 'ssh-to-http' gem. By focusing on HTTP, we can utilize the routing capabilities inherent in our architecture to manage requests more effectively. The idea is to leverage the namespaces used within GitLab to create sharding logic and distribute requests across clusters of servers. For instance, using a simple hashing algorithm to determine which of three machines should handle incoming requests. While it's an appealing strategy, the sidekiq does not inherently support sharding, and additional complexities arise when managing tasks across shards.
00:09:30.030 Moreover, every page in GitLab does not fall into a single shard. For example, if someone accesses the admin page, the request may include information from multiple repositories, complicating task management and retrieval.
00:09:48.850 Additionally, we have to consider user authentication levels since SSH commands do not have access to all repositories by default. This complexity requires modifications to application logic, introducing yet another layer of challenge. Each solution we devise comes at a cost, whether in implementation complexity or in performance trade-offs.
00:10:20.490 Now, regarding file system storage, there are several options we could consider. One approach involves creating a Twelve-Factor App by using attachable file systems. Network-attached storage (NAS) is a common solution, as are software-based options like Google’s GFS. We also have the option of using remote procedure calls (RPCs) to handle file operations at a lower level. However, after evaluating all these solutions, we determined that NAS is not feasible for our needs due to its lack of scalability and adaptability to Alibaba's infrastructure.
00:11:12.820 While these alternatives may work for some organizations, we opted for a different solution entirely by moving our storage to the cloud. Our choice was to utilize Alibaba OSS, a service similar to Amazon S3, providing object storage in the cloud. Let's delve into the technical aspects of how we made this transition.
00:11:53.330 GitLab offers several ways to access its Git repositories. One of the components we identified for removal is an old gem called 'Grit,' which is written in Ruby and not necessary for our implementation. By unplugging Grit and plugging in a newer gem that leverages libgit2, we significantly simplified our architecture. This allowed us to maintain component compatibility and modernize our integration approach.
00:12:30.049 The architectural structure we designed connects Git with LibGit2. Git has two types of storage: the object database (ODB) and the reference database (RefDB). ODB stores the actual data chunks within repositories, while RefDB contains branches and tags. When we designed the cloud-based backend, we accounted for both types of storage, ensuring that loose and packed objects were compatible with our cloud infrastructure.
00:13:21.559 For loose objects, the process remains straightforward: we make HTTP requests to read and write data directly from the cloud. RefDB works similarly, placing branch data under the refs directory. While implementing the backend, we could easily translate requests for both ODB and RefDB into HTTP requests. The complexity arises with how we handle packed content, as it demands a more sophisticated approach to data retrieval.
00:14:20.070 Packs are crucial in Git for transferring and storing data efficiently. When developing our OSS solution, we learned that reading data from these packs encompasses a more complicated setup. Each pack is associated with an index file that specifies where to start reading, requiring multiple range HTTP requests to fetch the required data correctly. We need to consider various efficiencies to reduce the number of HTTP requests necessary during this process. For example, we can retrieve larger blocks of data to mitigate latency and speed up overall operations.
00:15:36.060 Additionally, because Git employs several commands that call upon each other during operations like fetch or clone, we refrain from altering the protocols that manage data transformation. Instead, we focused on modifying only the components interacting with the filesystem. As a result, the changes translate smoothly, without introducing significant overhead. There are specific cases where we need to re-implement certain Git commands to enable our new architecture to function seamlessly, especially regarding read and write operations across both ODB and RefDB layers.
00:17:27.060 After making these changes, we assessed performance to determine the trade-offs of transitioning to cloud storage. Initially, one might expect slower interactions since we are replacing fast file system I/O with slower HTTP I/O. We conducted benchmarks using a repository called GitLab. This repository contains over 200,000 objects and, when packed, exceeds 100 megabytes in size. In scenarios where we conduct pushes or clones, we find variances in performance due to the way cloud storage is implemented. For instance, large pushes utilize pack storage efficiently, while smaller pushes lead to more loose storage interactions.
00:18:45.950 Push performance may not represent a significant bottleneck; however, operations like 'git clone' become slower. This slowdown occurs due to the underlying range operations that must be performed in the context of the cloud. Similarly, commands that fetch data experience delays since they must process across the new cloud-based infrastructure. The implications on the Rails layer are substantial since every operation cascades down and influences page loading and other user interactions within GitLab.
00:19:34.150 Given that our real operations and Rails interactions have been affected, we recognized that utilizing caching mechanisms would be critical. We implemented cache layers across multiple aspects of our architecture to mitigate the performance impacts of moving to cloud storage. In certain specific operations, before changes, response times were around 50 milliseconds, but after the initial move, we observed that these operations could take upwards of five seconds. This highlighted the need to further optimize and include caching to return to an acceptable performance baseline.
00:21:14.830 For instance, LibGit2's design allows for multiple ODB backends, enabling us to prioritize which backend to utilize in certain circumstances effectively. This means we can create structures within the storage layer for improved performance while reading data. For reads we performed frequently, we could store responses in local file storage, allowing immediate access on subsequent requests rather than continually querying the remote cloud service. Although ODB remains immutable, we can still utilize caching strategies effectively without concern for data expiry.
00:22:20.460 Further work is required for RefDB since it updates frequently with new commits. Therefore, we must constantly consider cache expiry and invalidation strategies to ensure data remains current. Moving forward, it remains imperative that we retain flexibility in performance-based adjustments while continuing to improve our architecture. It seems that this approach is somewhat effective for our current needs, and if you find it valuable, I am considering creating an AWS S3 version of a similar backend.
00:23:37.830 This is particularly relevant as GitLab is currently not deployable on Heroku. With a backend designed for compatibility with AWS S3, users could potentially harness greater flexibility in their deployments. Concurrently, we still encounter performance costs, such as those from additional instances spawned during processes like fetching commit history. To overcome performance barriers, it would be beneficial to develop a backend that offers users the choice between using a filesystem or AWS S3, which could significantly enhance deployment ease.
00:24:23.880 We could also focus on enhancing the existing LibGit2 library to efficiently utilize Rugged as a default, optimizing the overall performance. Several scenarios have highlighted advantages of Rugged in specific instances; thus, we want to consolidate our approach to maximize efficiency in GitLab's operations. I'll be sharing further developments on my GitHub account, so if you're interested in tracking progress, you are welcome to explore the repository.
00:25:46.970 Thank you very much for attending my talk, and I appreciate your attention.