Lauren Tan
Building the World's Largest Studio at Netflix
Summarized using AI

Building the World's Largest Studio at Netflix

by Lauren Tan

In this talk titled "Building the World's Largest Studio at Netflix" presented by Lauren Tan at RubyConf AU 2018, the evolution and innovative processes at Netflix are discussed with a focus on building an efficient studio for original content production. Tan shares her background, transitioning from DVD rentals to streaming video, and the company's drive to not only entertain but to redefine content creation. Key points include:

  • Netflix's Journey: From mailing DVDs to becoming a pioneer in video streaming, Netflix has revolutionized content consumption.
  • Studio Engineering: A dedicated team (approximately 60 individuals) is tasked with building applications that manage and optimize the production of original content. Their mission is to enhance efficiency and support creative storytelling.
  • Application Development: Netflix initially relied on spreadsheets and manual processes for production data, evolving to structured applications like 'Origin Story' which collect comprehensive production information.
  • Choosing Ruby: Despite being a Java-heavy organization, Ruby was selected for its ease of use and quick iteration capabilities, enabling adaptability in an evolving technological landscape.
  • Microservices Architecture: The shift to a microservices approach was crucial for scaling operations without overwhelming complexity. Each service allows teams to work independently while maintaining efficient processes.
  • Tooling and Innovation: Tools like Mute and Spinnaker facilitate project organization and deployment, improving developer workflow and productivity.
  • Data Handling: As Netflix grows, managing vast amounts of production data becomes critical to understanding audience engagement and enhancing content development.

  • Key Takeaway: The culture at Netflix emphasizes flexibility, creativity, and a collaborative spirit, reinforcing the notion that successful technology solutions stem from a supportive team environment. Tan concludes by noting the importance of empathy in their projects, showcasing a commitment to both creators and audiences, and expresses excitement about the future of Netflix’s studio ambitions.

The talk highlights Netflix’s ongoing journey towards redefining its role as a creator, facilitated by innovative engineering practices centered around the Ruby programming language.

00:00:16.560 You can find me on the interwebs as Sugar Pirate or Potato. I'm rubbish at reserving usernames as you can tell by the underscore and my inability to spell "potato." You might be wondering why I'm here today, and I'm here because of my many skills. One of them is philosophy, which some would say I'm pretty good at. I also have an extensive knowledge of Star Trek, so ask me more about that later. I have a bit of a strange accent; even though I'm Australian, I grew up in Singapore.
00:00:36.730 Then I moved to Melbourne with two Monash trainees, and later I moved to the States—first to Boston, then to California. Since then, I did my first ever talk at Evernote Conference in 2015, and I've been giving talks ever since, mostly about JavaScript or Elixir. This is actually my first ever Ruby conference, so hello everyone! And, of course, here's a customary dog photo. I got a dog when I went to America. Her name is Zelda, as you can tell from her costume.
00:01:04.560 Now that you know everything about me, let's start from the beginning. About 20 years ago, a little company called Netflix did something pretty crazy: we sent out DVDs in the mail, because who likes going outside anymore? Even today, we're still sending out DVDs to our customers. A decade after that, we shipped a billion DVDs. Then we did another crazy thing: we thought, 'Hey, this Internet thing is pretty cool,' so we started streaming video on demand.
00:01:30.670 Despite what other people said about it, we continued to innovate and to make things better. Even today, we continue to make big bets to give people more content that they want. Today, Netflix is clearly the best website on the Internet. We even have our own memes! In 2018, successful companies don't IPO; they have memes, and that's how you know you've made it.
00:02:00.360 In more serious terms, the work we've done on our originals has gotten more recognition. This was supposed to be a screenshot of us winning the Oscar this year for our original documentary, "Icarus," but it's not loading. As you can see, not all my pictures are missing.
00:02:24.270 Some of the work the studio engineering team has done—I'll just describe the picture to you. It was supposed to be a picture of an article from Variety, which is an American online magazine, showcasing some of the work my team has done. Despite our success, we don't want to get too comfortable; we're continuing to do crazier things and producing our own Netflix originals. We're pivoting from being just a streaming service to also becoming a studio, and it hasn't been an easy task.
00:03:33.090 Ultimately, we're doing this because we want to produce the best possible product for every single one of our customers. People's tastes are really broad, even in a single market. Members who love blockbusters, Korean dramas, anime, and zombie films will find that Netflix fills their homepage with relevant and interesting titles. But it took us a while to get to this point, and we have an even longer and more arduous journey ahead of us.
00:04:45.990 Once we started with Internet streaming, it took us a while to get really good at it; but when it came to content, it was another story. Our first project was extremely manual and inefficient. We had a ton of Google Documents, a ton of spreadsheets, and at the time, we didn't automate anything since the volume of content was so low that it didn't matter. At some point, we even had fax machines, post-it notes, and graffiti in the bathroom—any surface that we could write on, we wrote on to take notes about all these things we were making.
00:05:19.380 Eventually, after much effort, we produced our first ever Netflix original. Does anyone know what that is? It's not what you think! The common misconception is that it was "House of Cards," but our first original was actually "Lillehammer," which was a co-production with another studio. Yes, "House of Cards" was indeed our first owned original, and it was our first project that demonstrated that we were onto something.
00:06:02.620 As our studio grew, we continued organizing everything with spreadsheets, which worked for a while but very quickly started not to scale. A couple of years ago, a tiny engineering team formed, and we got to work to answer this question: how do we scale? We released our first studio application called Origin Story a few years ago. Origin Story is a hub for production data, and we collect everything from where and when we shoot, how much it cost us, who's on the cast, the kind of camera they use, who's on the crew, and basically storing any and all kinds of production-related data.
00:07:13.620 Back when Netflix was still largely a Java shop—and it still mostly is—our first application was built with Rails and Ember, thanks to the suggestion of one of our first engineering managers. Since then, that little team has grown into the studio engineering organization. Right now, there are about 60 of us, and our mission is to transform Netflix into the most efficient studio on the planet. We want to help storytellers around the world bring the most creative and compelling original stories to life.
00:08:31.360 Our applications span the entire lifecycle of creating a Netflix original, from pitch to post. This time, we're not just turning spreadsheets into applications; we also work very closely with our data science teams to create step-function innovations—the kind of 10x improvements that really move the needle. Our strategy of betting on original, compelling content seems to be paying off. We had 100 million paying subscribers in April last year, and now we're at 118 million.
00:09:09.750 We actually have more international members than U.S. members now, which is pretty crazy considering Netflix only went global in 2016. We're now in more than 190 countries, and we're spending a ton of money on original content—about $8 billion. It's really exciting for our team because the whole company is behind this to succeed. But at the same time, it's also incredibly challenging building the studio from scratch—challenging yet satisfying, knowing we're helping to reinvent how TV and movies are made.
00:10:03.640 That leads to why I'm here at RubyConf. Why do we choose Ruby to build the Netflix studio? Before we can answer that, I need to give you some context about Netflix and the kind of environment we operate in. The first thing that you need to know about Netflix is that our culture is very unique. Some key aspects are freedom and responsibility, context not control, and being highly aligned yet loosely coupled.
00:10:32.990 Freedom and responsibility helps us avoid an unhealthy emphasis on process. In our case, we were free to choose the right tool for the job. Even though Netflix is traditionally a Java shop, we went with Ruby because we thought it would be a better fit for the problems we wanted to solve, which are very different from the high scalability issues that the Netflix.com product has. Most of our JVM-based tools and services and our open-source work exist optimized for high availability and scalability, but these are not the problems we have.
00:11:20.640 Just about a week or two ago, it was Ruby's 25th birthday, and it really got me thinking about why we chose Ruby in the first place and why we continued to use it. One thing that really stood out to me is all of you—the Ruby community. We don't hire brilliant jerks, and the Ruby community is one of the nicest out there. It’s also known for many good ideas like test-driven development, ergonomics, developer productivity, and the notion that code is read much more often than it is written.
00:12:03.060 Most of you might agree that Ruby is pretty easy to read. Another factor in our decision for using Ruby is that the Netflix studio space is constantly evolving; our user base and the processes they use are evolving month by month. We need to react quickly to change, and we're still in a period of immaturity—often, we don’t really know what we're building. We’re trying to innovate, and Ruby helps us explore and push our boundaries.
00:12:42.350 Our challenges are not so much about throughput and raw requests per second, but rather that we collect a large amount of data about the TV and movies we produce. This data lives across Netflix in many different systems. Our technical challenges involve building data-intensive applications in a highly distributed environment. When it comes to technology choices, it's always a trade-off about what you are optimizing for.
00:13:17.659 The number of users we support is in the hundreds to the thousands at most; yet collectively, these hundreds of people control billions of dollars in content. So, again, it wasn't really about high scalability for us; it was really more about the amount of money involved, how little time we have to innovate, and the challenge of reducing the time from idea to business value.
00:14:08.960 To be honest, it didn't really matter what language or framework we chose to solve these problems. We could have easily chosen Python, Elixir, Go, or an obscure language like Scala. But our customers don't care what language we use. This slide comes from my favorite talk at All Things Open last year: massively scalable systems require a significant engineering effort to build and operate, but complex sociotechnical systems are even more challenging.
00:14:57.430 As an industry, we tend to try and solve complex problems with technology, but the truth is that tech alone is not a replacement for a broken culture. If I think about the reason why I choose to continue using Ruby on my team, it comes down to a very simple reason: Ruby makes my team happy, curious, and eager to innovate. How many programming languages do you know where people actually love to write in?
00:15:35.150 Most importantly, with the kind of work we do—reinventing the way TV and movies are made—it's really important that we have empathy for the people whose work relies on ours. In more concrete terms, I'd like to show you what we've done as a team. Ruby lets us solve our user problems really quickly and creatively. But as our team grows, one thing we're always contemplating is developer productivity on a larger scale.
00:16:23.170 Like Ruby on Rails, we've learned to share the burden among ourselves by using well-built and maintained conventions. When we first started out writing these studio applications, they were all extremely isolated—each one could be considered a separate monolith. There wasn't much sharing, and we often reinvented the wheel. Our mantra at the time was simply 'just ship it out.'
00:17:07.200 These days, you cannot go to a conference talk without hearing about microservices. As you probably know, Netflix has thousands of microservices in production, and our organization is no exception. After our initial mantra of 'just ship it out,' we quickly realized that wasn't going to scale for us. So, we reevaluated our approach.
00:17:56.970 One thing I want to stress is that we transitioned to microservices not because it was trendy, but because it was necessary. It only makes sense if your company is large. If you're starting out a new project with microservices, you might want to reconsider; you're adding a lot of complexity with very little utility, especially if you are in a small team.
00:18:36.840 It's really easy to assume that technical choices are the only reason why a business is successful. Just because big companies like Netflix use microservices doesn’t mean you need to use them to be successful. Like most things in technology, nothing is a silver bullet that will solve all of your problems.
00:19:08.360 Microservices are not really a technical solution; they are a way to optimize communication within your organization and help you stay productive as complexity grows. Going back to that culture memo I mentioned earlier, because our teams are highly aligned yet loosely coupled, microservices work really well for us.
00:19:51.150 It's much more efficient for a single team to own and maintain a solid service that everyone needs instead of rebuilding the wheel each time. Now, I'm going to talk about some of the things we've built and extracted. After our realization, we had to quickly scale the way we are building our applications because we weren't just building one application—we were building ten at the same time.
00:20:27.440 It required a lot of work in this process, so we started investing heavily in tools to make it easier to use Ruby in production. For example, we now have a tool called Mute, or the Netflix Workflow Toolkit. This command-line tool automates setting up production-ready projects. When we bootstrap a new application, "New" is what we use; it's written in Go and is actually a very simple concept.
00:21:01.830 So, "New" boots up a project with an app type, which is like a blueprint. There are many app types that people write and contribute for different languages and frameworks, including Ruby, and it integrates very easily with Docker when you initialize a new project.
00:22:08.910 It sets up a Git repo in our private Bitbucket, builds pipelines, and sets up Spinnaker. Spinnaker is an open-source multi-cloud continuous delivery platform created by Netflix, and it’s really awesome! A Spinnaker pipeline is a workflow for deploying—typically, it's AWS. For instance, in this pipeline, we're configuring Spinnaker to spin up a Jenkins build to run our tests.
00:22:57.780 At the same time, we're also firing off a request to bake our code into a deployable artifact. If our tests fail, the pipeline is canceled; if it passes, we continue with the process and deploy our application. All of our applications have continuous delivery out of the box. We also set up Jenkins for you, which we use as a general build server.
00:23:54.860 Once your application is in production, you have access to a whole suite of availability and monitoring tools that we've developed at Netflix. This dashboard, for example, gives us insight into the runtime performance and health of our systems. Tooling aside, Ruby is just a dynamic and fun language in general.
00:24:44.310 One of the applications I manage is called Orion. It ETLs data from multiple services and then creates calendar events which we display to our users. We regularly bulk insert a lot of these calendar events at very regular intervals. Initially, we used vanilla Active Model instances, which was not a smart idea as it was taking up too much memory.
00:25:50.060 So we wrote a tiny class to emulate its API without the overhead, which drastically reduced memory pressure during these bulk inserts. Because the API was the same, it was a drop-in replacement in our code. We could then utilize the Ruby open-source community to quickly insert all these records into Postgres.
00:26:29.850 Now, another thing we've extracted is authentication. This is actually my favorite part of the talk. A couple of months ago, we ripped out authentication from all our applications—which felt great! Deleting code is one of my favorite things. Backing up a bit, many of our users' workflows span multiple applications.
00:27:15.490 When they go from one app to another, they have to log in over and over again. This resulted in a very fragmented and frustrating user experience. If you were using like ten applications every day, you would have to log in to each one repeatedly.
00:28:07.510 We write all our UI as single page applications that are deployed separately from our APIs and microservices. We currently deploy them with a static hosting service we built called Bolt. Bolt is a lightning deploy system that integrates with tools across our company, allowing for a single sign-on experience through shared authentication.
00:28:51.520 So, Bolt is an implementation of the lightning deploy strategy that was demoed by Luc Maria at RailsConf a couple of years ago, and it’s set up with Amazon S3 along with Redis, fronted by Apache. I’m not going to get into too much detail about how this system works, but you can check out the talk for more information.
00:29:28.370 This is a very simplified diagram. Bolt integrates with Meechum, which is our Netflix internal identity service, through an Apache module. Meechum in turn integrates with multiple identity providers like Google, allowing access to our applications and providing a single sign-on experience for our employees, contractors, partners, and vendors.
00:30:16.170 More importantly, Meechum provides a single sign-on experience and supports cool features like multi-factor authentication. When a user first visits one of these applications, if they're not authenticated, they get redirected by Bolt to the Meechum login page. What happens is that the request never actually hits our applications.
00:30:48.920 If the user is authenticated, the user receives a JSON Web Token (JWT) from Meechum. This token is structured to include the appropriate scopes that allow access to all necessary services. The APIs also integrate with Meechum to validate that token. When the request finally hits our API—say, a Rails application—we receive various user claims in the request header; this acts as the current user across all our applications.
00:31:35.980 With a well-defined set of claims, it's easy for applications to make use of user-specific metadata. That’s all fine and dandy when a browser client is hitting your API, but when our APIs communicate with other APIs, we use another tool called Metatron, which establishes cryptographic identity in the Netflix ecosystem.
00:32:04.290 We also use Metatron TLS for providing certificates that simplify integration for our services and clients, making it easier to interact with numerous microservices. Because many of our applications use Faraday, we also wrote a simple parity adapter, as it's very popular. Similarly, we've extracted authorization into a separate service.
00:32:41.720 We manage application-level roles and capabilities in a service called Protego. It provides role-based access control at the project and partner-type level, and we also use this to manage what our external vendors can do with our production data.
00:33:36.950 For most of you, managing applications in conjunction with reports can be a chore. Unlike Devise, we didn't just delete Pundit; we stored our policies in a separate service and managed user-role mappings through Starship, which is another service. As you can tell, we have a lot of services!
00:34:27.540 When we produce content, we often work with various partners and vendors who need access to some of our production data. We don’t want to give them all our data—just what they need to do their job, of course. Starship helps us onboard these people and grant them the appropriate permissions for the production project they are working on.
00:35:27.220 Our production admins have a UI where they manage users needing access to that data. In addition to services, we’ve also built many internal gems. So far, we've only open-sourced one, but more are coming. The first one we've open-sourced deals with Chiefs and API spec serialization.
00:36:13.750 For context, many of our single-page applications are Ember applications, while others are React applications. As a convention, most of them follow the JSON API specification for APIs built in JSON.
00:36:46.570 We used to use Active Model Serializers, but unfortunately, it's not as actively maintained as it once was, and its performance wasn’t optimized. We found that 50% of the time spent serving requests was being consumed by serializing the data into this format; we knew we had to go faster. One of our teams wrote a gem that’s optimized to serialize generic Ruby objects into this JSON API schema, and our benchmarks showed a 25 to 40 times improvement over Active Model.
00:37:40.430 Another common application concern is feature flags, which we use heavily to let us deploy frequently without holding up work-in-progress. As you might have guessed, we have a service for this too. This generic service is known as Fast Property, and its UI was integrated into Spinnaker, the multi-cloud delivery system I mentioned earlier.
00:38:42.170 It's convenient because we can manage all our feature flags from this UI, and you can use it not only for feature flags but also to change any property dynamically at runtime should you need to. So, it's a JRPG service, which makes it really easy to write a gem that wraps it.
00:39:29.730 Increasingly, more of our services are moving toward JRPG, allowing us more freedom to choose the right language for the job. Traditionally, since all these services were written in Java, there was an expectation that other services would also be in Java.
00:40:15.890 The only client the teams would produce would be Java, but now it's much easier for us to create Ruby clients for all these different services, making interactions straightforward in a polyglot way. So, as you can see, the gem is pretty simple; the consuming applications don’t need to know how it communicates with that service.
00:41:01.120 Our UI can even directly access UI-specific feature flags since it exposes a RESTful interface. Another common feature we've extracted is comments. Comet is a commenting service at Netflix, allowing users to comment on anything! Some of our applications have recently started using this service while others are slowly migrating.
00:41:48.230 This application, in particular, Origin Story, even has WebSocket support for real-time comments. This is all fine when you're building Rails applications, but if you're writing a smaller microservice in Ruby, one of the things to consider is service discovery.
00:42:35.580 Service discovery is used for locating services for load balancing and failover. Netflix uses an internal service registry called Eureka. Mid-tier services that aren't already fronted by a load balancer typically need to be able to balance by Eureka, and in the studio space, we tend to have application load balancers in front of our APIs.
00:43:08.470 In addition, we wrote a large number of other services, allowing us to provide a high-level overview of what it's like to develop studio applications at Netflix. We started out by writing very simplistic, vanilla Rails applications. We often abused Active Model callbacks, which was not fun. But increasingly, we are moving towards a highly distributed, data-intensive environment.
00:44:00.270 Our organization is still quite young—around two to three years old—and we are really just at the beginning of scaling up our engineering efforts. It's still a very long road ahead, but we're excited to scale up to tackle the challenges. There are billions of dollars worth of content, and the viewable is numbered in the hundreds to thousands.
00:44:48.860 Some of our technical challenges ahead involve enhancing Ruby on the Netflix paved road with more platform support, contributing back to the community with our open-source work, and generally moving fast without breaking things. Again, our challenges are not primarily about throughput but rather the large amount of data we collect about the content we produce.
00:45:46.290 We need to build many different services to make sense of this data and help our creatives unlock key insights about the original content we're producing for our users. In conclusion, we covered a lot today. We first spoke about the origin story of Netflix and how we constantly improve, innovate, and take big bets.
00:46:34.050 We started producing our own original series; it went well, and we wanted to do more, but we soon realized we weren't equipped to scale this up. We built our first studio application in Ruby because we see the studio engineering organization as a well-funded internal startup within Netflix.
00:47:11.040 We want to allow our teams to be fast and creative. We continue to choose Ruby for a simple reason: it makes our team happy, curious, and eager to innovate. Most importantly, with the kind of work we do—reinventing how TV and movies are made—it's crucial we have empathy for the people whose work relies on ours.
00:47:36.020 To achieve this, we shared the pain and extracted common features into various services maintained by dedicated teams. We're building the studio on top of this platform, allowing us to move faster as we scale up our operations.
00:48:33.090 If you're interested in helping us, we're hiring! There's a link below. If you enjoy working with many microservices and tackling latency issues, come talk to me. Thank you for having me!
Explore all talks recorded at RubyConf AU 2018
+8