Summarized using AI

Castle On a Cloud: The GitHub Story

Ben Bleything • February 20, 2014 • Earth • Talk

The video titled "Castle On a Cloud: The GitHub Story" features Ben Bleything from GitHub, focusing on the infrastructure that supports their internal applications, primarily built on AWS and Heroku. Although GitHub.com receives significant attention, this presentation sheds light on the often-overlooked aspects of their cloud infrastructure, which includes over 300 AWS instances, numerous Heroku dynos, and various other cloud services.

Key points discussed include:
- Infrastructure Overview: GitHub relies on a diverse infrastructure with hundreds of internal applications. Most of these applications are hosted on Amazon EC2 and Heroku, highlighting the reliance on cloud services to maintain operational efficiency.
- ChatOps: GitHub's operational model leans heavily on ChatOps, allowing the distributed Ops team to automate processes and communicate effectively via Campfire, integrating operational tasks with chat capabilities through tools like Hubot.
- Data Center Specifics: The presentation describes a data center located in Virginia that powers the main GitHub.com site, which is mainly served by high-density Dell C-series sled servers without virtualization for serving the website.
- Resource Management in AWS: GitHub utilizes various AWS services, such as EC2, RDS, S3, CloudFront, and more, to manage their resources while facing challenges like AWS's resource limits, particularly with S3 buckets.
- Identity and Access Management (IAM): Bleything discusses utilizing IAM to manage AWS credentials more efficiently, emphasizing the ability to consolidate storage locations while ensuring proper access controls.
- Operational Challenges and Solutions: Due to internal decisions, limits on S3 bucket creation have led to innovative workarounds, such as assigning prefixes to a single bucket for different environments. This adjustment significantly reduced their bucket count.
- Cloud AWS Tool: The Cloud AWS tool allows team members to interact with their EC2 resources via chat, showcasing how automation streamlines operational tasks and improves response time.
- Focus on Database Management: Bleything explains the preference for managing databases with MySQL rather than Postgres on Heroku due to existing expertise and tooling, despite some developers using Postgres.

In conclusion, Ben Bleything's talk highlights the complexities and strategies GitHub employs in managing their vast cloud infrastructure. The emphasis on automation, proper resource management, and the innovative use of ChatOps underpins their operational success. The main takeaway is that effective integration of cloud services and tools significantly enhances operational capabilities while addressing challenges associated with resource limits.

Castle On a Cloud: The GitHub Story
Ben Bleything • February 20, 2014 • Earth • Talk

When you think "GitHub", you're probably thinking of what we lovingly refer to as GitHub Dot Com: The Web Site. GitHub Dot Com: The Web Site runs on an incredibly interesting infrastructure composed of very powerful, cleverly configured, and deeply handsome servers. This is not their story.

This is the story of the other 90% of our infrastructure. This is the story of the 350 AWS instances, 250 Heroku dynos, and dozens of Rackspace Cloud, Softlayer, and ESX VMs we run. This is a story of tooling and monitoring, of happiness and heartbreak, and, ultimately, of The Cloud.

Help us caption & translate this video!

http://amara.org/v/FG3s/

Big Ruby 2014

00:00:19.920 Sorry about that. My name is Ben, and I work at GitHub in the Ops group. My timer didn't start, so I'm sorry about that.
00:00:30.240 I work specifically on supporting, maintaining, and building infrastructure for our internal applications. You may have heard other Hubbers talking about things like Team, which is our internal Twitter-like social networking site, or Hire, where we track candidates and interviews. We have about 150 applications that we use internally. They're not all web apps; some of them are utilities and services that enable other things. But there's a lot of them, and my team supports all of them. Most of the internal tools and infrastructure are on EC2, in fact, all of it is on EC2, either directly with VMs that we own or via Heroku.
00:01:05.120 And that's what I want to talk to you about today. Personally, as a technologist, I really love the cloud. I think it's a super cool concept, and I actually know what it means; I don't just use it as a marketing term. I think it's awesome. I really like the idea that I can say, 'Hey, I need 25 cores right now for 10 minutes,' and then get them and throw them away when I'm done. I love the elasticity and the on-demand nature of compute resources. It's just super cool and really exciting to me.
00:01:41.680 Our use of AWS and Heroku at GitHub isn't anything groundbreaking. Almost everything we do is something that we've heard other people doing and copied. We take some notes from Netflix; they're huge AWS users. A lot of our Ops team came from Heroku, and so they know the right ways to interact with it. None of it is groundbreaking, but I still think that there's a lot of stuff in there that's kind of cool and a little bit different. And that's what I wanted to share with you today. My hope is that by the end of this talk, you'll be able to come up to me and tell me what I'm doing wrong, what things we can do better, and where we can improve. Maybe you'll also build some of the things that we haven't gotten to yet and open-source them so I don't have to build them.
00:02:22.239 I'm going to be talking about ChatOps a little bit. If you haven't heard this term before, it's how we refer to our way of working. GitHub is a little more than 60 percent distributed, meaning not everyone is in San Francisco. In fact, nobody from the Ops group is in San Francisco at all. If you look at the Ops group, there are 16 or 17 of us, and the largest concentration of Ops people in a single city is three. No city has more than three Ops people.
00:02:58.880 So while GitHub is very focused on using chat—Campfire is what we use as our primary method of communication—the Ops team tries to use it as our primary method of work as well. This means automating all of our common operational tasks through Hubot. I'll show you some examples later, but if you want a really good background on this, go look up Jesse Newland's talk from RubyFusa last year (2013). It's on YouTube and it's a great talk and a really good introduction to the concept of ChatOps in the context of our Ops team.
00:03:44.760 For a sense of scale, I'd like to give you an idea of what our infrastructure looks like. The numbers here are deliberately vague and I apologize for that, but I've been told that I can't share the real numbers. We have a data center located in the middle of nowhere in Virginia, which has some advantages that I'll get to in a second. We have a few hundred machines there—more than a few hundred but less than a thousand. They're mostly Dell C-series, which are high-density sled servers. If you're a hardware nerd like I am, come find me later, and I'll tell you all about them because I think the C-series chassis is super cool.
00:04:18.080 The data center powers pretty much all of GitHub.com. When you go to the website, most of what you're seeing is being served out of the data center. There are a few systems that are hosted on EC2, but not many. Interestingly, we do almost no virtualization in the data center. What we do is in service of our continuous integration systems and some of our build systems. No virtualization is involved in serving GitHub.com, which surprises me every time I think about it, but that's what we do.
00:04:58.560 One of the advantages of being in the middle of nowhere in Virginia is that we are figuratively right next door to the US East One region of AWS. This is cool because it means our latency is basically a function of the speed of light. We're very close to US East One, and I'll talk about in a few minutes why that is as exciting as I think it is. Anyway, we have instances and resources in five regions of AWS. They're primarily in US East One, but we also have resources in Europe, US West, and a couple in the APAC region.
00:05:44.560 We have a little over 300 EC2 instances in total, a little over 40 RDS instances, and 250 terabytes of EBS across more than 400 volumes. We've got over 100 S3 buckets, which will be relevant in a moment. We also use CloudFront, DynamoDB, SQS, SNS, SES, Elastic MapReduce, Elastic Hash, and a whole bunch of other AWS services in small ways. Additionally, we have 150-plus applications running on Heroku. I don't know the exact number, but it's more than 150. To be fair, a lot of those are staging instances or experiments, but they're all things that take up my time and energy. We have about 230 dynos total running those applications.
00:06:32.000 So obviously, a lot of them are single dyno apps, but some of them are actually scaled up for real use. We have more than 50 Heroku Postgres instances and, actually, I have an exact number for this one: 264 instances of 25 different add-ons. So we use Heroku pretty seriously. We also have a handful of servers in other places. We've got some physical hardware that we've leased sitting in other people's data centers. We have some VMs that we've leased from other cloud providers. Most of this stuff is pretty uninteresting, with the possible exception of the Enterprise Team's build system, which is a three-node ESX cluster that runs a few hundred VMs.
00:07:16.080 The Enterprise build product is really cool, but I don't know much about it, so I can't tell you about it. It's probably far more interesting than what I'm about to tell you. If you ever get the chance to talk to the Enterprise folks at GitHub, make sure to impress them with questions about it because it's really cool. Early on in our use of AWS, it was meant almost specifically to supplement our physical infrastructure. The idea was that if we had a burst of traffic or our queues got really backed up, we could just throw some cloud resources at it and make those issues go away. And it works pretty well for that; it's sort of what EC2 was originally designed for.
00:08:09.760 At some point, however, we started moving faster than we could procure hardware, and we got to the point where we needed to deploy new applications, but we didn't have hardware for it, so we used EC2. This meant that we ended up having more permanent things living on EC2 than we had originally intended, and certainly more than we had built tooling for. These days, most of what's running on EC2 are things that were prototyped and built before we had hardware, and they're just waiting to move on to real hardware once it's available. Things that actually require elasticity, such as data processing, our analytics team, have a large cluster of compute servers in EC2, and that's probably going to stay there forever because it just makes more sense.
00:09:07.440 Then there are apps that, for various reasons—mostly technical but sometimes not—don't work very well on Heroku. That's almost everything we've got in AWS. One of the things that has been really challenging for us is AWS's resource limits. When you create a new AWS account, one of the things you'll face is not being able to start more than 115 EC2 instances, which seems like a lot but isn't. Or 40 RDS databases; again, it seems like a lot, but those limits run out pretty quickly. These are the limits that we've actually hit: we have way more EC2 instances than that, we have more reservations, a lot more RDS instances, and more EBS storage.
00:09:50.560 But there's one limit that really hurts the most, and it's S3. The reason is that this is the only limit on AWS that you can't have raised. Everything else you can file a ticket for, and if you can justify it, they'll turn it up for you, and you can run more servers or whatever else. However, you cannot, under any circumstances, get more S3 buckets. Early in our growth, we made some bad decisions, and one of those was creating separate S3 buckets for every application—in some cases, separate buckets for every environment. So production, development, and staging would have separate S3 buckets, which means you run out really fast. If you run 30 applications, you're quickly out of buckets, and we run 150.
00:10:27.520 So we need to find ways to deal with this. Probably the thing that we do on AWS that I think is the coolest is using IAM, which is the Identity and Access Management service. It doesn't sound particularly exciting, but it lets you do some really cool stuff. Effectively, it's like LDAP; it's a glorified user manager. But what it does is it allows you to create a set of credentials that access a specific subset of resources on AWS. This lets us consolidate our buckets down and say that each application gets one bucket, and the development, production, and staging sections are prefixes in that bucket.
00:11:06.719 Each one has a set of credentials that can only read and write from that prefix. This has helped a lot, reducing our bucket load by a factor of three where we've used it; we haven't universally rolled this out, but it opened us up to doing some even cooler things. One of the concepts in IAM is that of a role. A role is not a user, but it is an entity that you can attach policies to, and then users, or other resources like EC2 instances, can assume that role and gain the rights that it has.
00:11:50.720 So, in a way, this is kind of like pseudo-mode for AWS: I can say, 'I require for 20 minutes credentials that can access a certain bucket,' and I can go to STS (the Security Token Service) to request to assume that role. Assuming I have permission to do so, I get credentials that let me do whatever that role can do. Where that starts to get super cool is when you mix it with instance profiles, which is a way you can attach a role permanently to an EC2 instance. When you launch an instance, it can have one of these IAM roles attached to it, and what happens is that the credentials are put in the instance metadata.
00:12:50.080 Your app doesn't have it; you just fetch it out of the metadata. Those credentials are automatically rotated; last time I checked, it's every 75 minutes or so. So every 75 minutes, you're getting a fresh set of credentials that you can just read and use for whatever you need. This may be a little bit confusing. The way that it all fits together is this: you have an S3 bucket for your app with prefixes for each environment, then you have IAM roles that match each environment—your app production, your app staging, your app dev. When you create this instance, down here, the staging instance, you apply the staging IAM role to it, and then it has credentials on the machine. You didn't have to put them in your application. They're just there, and that has access only to the staging prefix in that bucket.
00:13:41.760 You can attach a whole bunch of other policies; in fact, almost every single action you can take on AWS can be limited by these policies. This is just one that we use the most of, and we're looking to roll out a lot more as we go along. The other kind of related thing is cross-account management access controls. This allows you to have two accounts and say, 'This account can access resources in this account.' So if you absolutely need more than 100 buckets, you can just create a new account and start putting buckets over there, then write policies that allow resources in this account to access buckets in the other account.
00:14:30.800 The way we use that is to create secure write-only dropboxes. We have our main account, but they need to ship data off to somewhere else, usually for database backups and things like that. They have credentials that can only write to a bucket in another account; they can't read the data back, only write to it, and that is super handy for us. Now I talked a little bit about ChatOps earlier, and I think some examples would be good. So the question is: how do we ChatOps EC2? We have this tool called Cloud AWS. You just say '/cloud aws' in chat, and you get tons of different things.
00:15:14.320 One of the things you can do is list all of the instances. This is just a small snippet of the 250 that are currently running. It has a whole bunch of stuff off the edge of the screen, too, about their security groups, their IAM roles, and all this other metadata—it's kind of instant access to all of our EC2 instances. You can also provision a server via chat. For example, I can say, 'Create me a server called temp blything 2 and have it be an M3 medium.' It took 1345 seconds to do it, but then I got all this information back with the IP addresses and everything else. I just have a server ready to go.
00:15:59.959 One of the cool things that happened behind the scenes was that it automatically created an app and environment pair, security group, and IAM role. Those roles can only be assigned at launch time when creating instances. So we automatically create those if they don't already exist and assign them to the application when we provision the instance, so if we decide we want to use that role later, we can just go write policies into it without having to reprovision the instance to add those features. You can also delete instances, but the magic word for that changes every day.
00:16:53.600 The thing about Heroku is that they're a Postgres shop, while we are primarily a MySQL shop, which isn't super handy. There are MySQL add-ons available for Heroku, but I haven't had good luck with any of them. As a policy, we rely on developers to manage their databases. If you're running on Heroku and you're comfortable with Postgres, feel free to use it. Personally, I'm a Postgres guy and I'm happy to support my developers using Postgres all day long.
00:17:44.480 However, since as an organization, all of our expertise is in MySQL, many developers prefer to stick with that. As a result, we tend to put them on RDS rather than using add-on providers because we have a lot of tooling around RDS already, as we use it elsewhere. All the Cloud AWS tools I previously mentioned—everything you can do for an EC2 instance, you can also do for an RDS instance, which makes it a more seamless choice for us.
00:18:36.800 Now, I'm out of time, and this was a briefer overview than I intended, but I hope it was somewhat interesting. Thanks for your time.
Explore all talks recorded at Big Ruby 2014
+14