Surrounded by Microservices

00:00:11.970 My name is Damir Svrtan. I work in the Studio Engineering Organization at Netflix, which is often referred to as the pioneer of the microservice movement. After being there for the last year and a half, I want to share my practical experience of being surrounded by microservices and some of the patterns our team uses to tackle distributed data.

00:00:22.540 Before I delve into microservices and related topics, I want to provide a brief history of Netflix. Many of you are likely aware that Netflix existed for a decade before becoming a streaming service. Initially, it operated by shipping DVDs to homes in red envelopes, allowing users to watch movies as often as they wanted without incurring late fees. Fun fact: this DVD rental business still exists and is used by more than two and a half million people in the U.S.

00:00:41.230 The first phase of Netflix was its DVD rental service. The second phase began in 2007 when Netflix transitioned to streaming, which changed how people consumed content. This was a risky move as streaming technologies were not yet reliable. However, Netflix soon provided a way for viewers to enjoy great content without being interrupted by endless commercials. In 2013, another significant shift occurred when Netflix began producing original content, starting with 'House of Cards' and followed by hits like 'Orange Is the New Black', 'Stranger Things', and 'Black Mirror'. Today, Netflix produces hundreds of original shows and movies, surpassing the output of many major film studios combined.

00:01:14.080 This was a substantial change for Netflix, which transitioned from a fully automated technology company to a player in the entertainment industry. The movie-making business is a century-old industry and, in many ways, it has not changed significantly since its inception. Most work is still done on paper, and it requires a significant amount of manual labor. For instance, does anyone know what this is? It's a fax machine. I personally have no idea how to use one, but every movie production relies on fax machines to distribute vast amounts of paperwork.

00:01:40.409 To give you a sense of scale, there are instances in movie productions that manage to print over 50,000 pages within the first week of shooting. At the scale Netflix operates, traditional methods simply do not suffice. The challenges faced by movie productions and film studios are immense, ranging from deciding which scripts to produce to acquiring content, discovering talent, managing contracts, executing payments, and finding shooting locations worldwide. Additionally, there are issues such as localization, subtitles, dubbing, and marketing, which are endless.

00:02:42.740 Netflix addresses these challenges by utilizing technology to create applications that cover key aspects of every production. This led to the establishment of a fully digital studio, designed to automate the mundane tasks that creatives prefer to avoid. We are talking about a suite of over 50 applications that manage the diverse areas mentioned earlier and much more. Netflix initially approached the entire studio space with a Rails application, allowing rapid development and quick changes, despite having no prior knowledge of the domain.

00:03:20.160 There are numerous applications at Netflix, but this Rails backend remains the largest within Netflix Studios. At one point, there were over 30 developers working on it, and it features more than 250 database tables. As time progressed, these areas began to develop specialized applications, which facilitated breaking down the Rails model and migrating data into microservices. This decision was not driven by performance issues but aimed at establishing boundaries around the various domains, as there are numerous people working in those areas.

00:04:30.330 I joined Netflix about a year and a half ago, and upon joining, our team began working on a new app that would cross multiple business domains. We decided to build this app using Ruby on Rails because of our team's expertise and Rails' support for agile development. This approach enhances team velocity, minimizes time to market, and provides an incredible array of gems, allowing us not to reinvent the wheel.

00:05:15.990 One key aspect characterizing Rails is its pragmatic application of the Active Record pattern. This pattern, however, can lead to tightly coupled models to the database, connecting various facets such as main objects, business rules, and validations all within a single class. While this is effective for rapid development, it also produces models that are challenging to work with in scenarios involving distributed data.

00:06:01.990 Working in Netflix Studios presents unique challenges, especially as we deal with a considerable amount of distributed data. For projects that span multiple domains, the necessity to navigate various data sources becomes apparent. The environment is polyglot, with user data potentially residing behind REST API endpoints, while other movie information may be accessible via gRPC or GraphQL endpoints. Moreover, some data could be stored in a local database, highlighting the complexity of the integration.

00:06:46.130 At the start of this project, we recognized that we needed to refine the persistence layer in Rails. A critical consideration while building this new application is that data residing in our database today may transition to a service tomorrow. In this context, we might be the original authors of certain datasets, but the source of truth for that data could shift as the encompassing area gains broader application. Thus, rather than communicating exclusively with a database, we found ourselves interfacing with various APIs.

00:07:28.210 To address this, we determined that we did not wish to tie our architecture to any single protocol. Our goal became clear: we needed an architecture that distinctly separates business logic from implementation details and protocols. We soon realized that hexagonal architecture is the best fit for us, as it addresses many of our challenges. The premise of hexagonal architecture is to position inputs and outputs at the peripheries of our design.

00:08:09.600 Inputs, such as requests hitting our application through controllers, are clearly defined, while outputs involve data flowing into a persistence layer. By isolating these elements, we can protect the core logic—and the heart of our application—from external concerns. This modularity allows us to modify our API layer or persistence layer, or even parts of them, without altering our business logic.

00:08:55.180 Let us now delve into the core concepts we've integrated into our app. These concepts are not entirely new or revolutionary; many may be familiar from hexagonal architecture or the principles set forth by Uncle Bob in clean architectures. The heart of our domain encompasses our entities, which at Netflix Studios consist of movies, productions, shooting locations, and employees. These entities, by design, possess no awareness of where they are stored or if they reside behind an API endpoint.

00:09:40.040 Next, we utilize repositories, which serve as interfaces for accessing and managing entities. These repositories include methods that typically yield a single entity or a list of entities and catalogue operations for communicating with various data sources. Data sources represent adapters to different storage implementations. A data source might connect to a SQL database through an Active Record model, interface with Elasticsearch, or interact with a simple JSON file or array.

00:10:27.240 The last core concept is our interactors—classes responsible for orchestrating specific domain actions. These resemble what developers often refer to as use case objects or service objects. Interactors explicitly document their functions and encompass complex business and validation logic aligned to specific tasks. While this provides a theoretical foundation, let's now explore how we have implemented these concepts within our domain.

00:11:18.760 To begin, consider the 'Production' entity at Netflix Studios, which may represent a physical production such as a season of 'Black Mirror' or a movie project like 'Bird Box'. Each of these entities possesses an ID, title, and logline. We utilize the 'dry-struct' gem from the Dry Ruby ecosystem to define our entities, allowing us to state attributes and specify which are required and which are optional. Importantly, these entities remain oblivious to any persistence details, enabling us to model our domain without entanglement in data source implementations.

00:12:14.880 For our repositories, we crafted a concise domain-specific language (DSL) spanning about 50 lines. This DSL detail our repositories, identifying the associated entity class and the default data sources. For example, a production repository manages a 'Production' entity while being injected with a movie production data source, which could be an adapter to a REST API or an Active Record model. These repositories encapsulate methods for communication with data sources and wrap results back into defined entities.

00:13:02.660 A noteworthy aspect of our repository design is the injection of data sources, enabling us to swap them when needed. Each data source is required to conform to a shared interface. For instances where we connect to a SQL database, the data source functions primarily through Active Record logic—catering to operations like selections or aggregations—while the implementation details are kept separate from our business logic. Conversely, if we are interfacing with a REST API, we create a class specifically for fetching data via an appropriate REST client.

00:14:02.550 This independence from specific data storage systems grants our repositories significant flexibility. The ability to seamlessly switch data sources means that when we recognize an issue, rolling back only requires one line of change—a simple update in production. This architecture ensures minimal disruption in case of modifications, whether we discover a problem during deployment or at any subsequent point.

00:14:51.390 Now, let’s dive into how we develop our interactors. We utilize the Interactor gem, which standardizes whether an action is successful or has failed. Additionally, we have built a small DSL on top of it, specifying the arguments and dependencies the interactor requires. A primary action might involve onboarding a production into our system, defined by parameters such as production ID and vertical ID. This design completely decouples our logic from any persistence layers, which allows us not only to unit test our interactors thoroughly but also to define repositories in a manner that supports clear object-oriented principles.

00:15:41.300 It's important to note that although Rails does not inherently prioritize separating business logic and persistence, those interested in building applications using a clean architecture framework can explore the Hanami framework. It follows the principles of clean architecture, aligning with the terminology I've mentioned. We initially chose to work with Rails and Active Record for their extensive ecosystem of gems, such as access control lists and counter culture capabilities, allowing us to rapidly deliver features. However, if we were to embark on this project now, we might seriously consider Hanami as a framework.

00:16:26.020 Having touched upon data sources, let’s delve deeper into implementing those that rely on external services, such as REST APIs and gRPC. Managing data across various APIs requires numerous API clients, and we have two primary approaches to tackle this issue. The first involves handcrafting API clients and packaging them into gems, but this approach often proves time-consuming, error-prone, and difficult to maintain at scale. Instead, we’ve opted for auto-generated API clients where each API we engage with can easily provide a manner for us to interface with it, whether through API specifications or useful documentation for client generation.

00:17:26.280 We typically communicate with three types of APIs: REST APIs, gRPC, and JSON APIs adhering to the JSON API specification. For REST APIs, we employ Netflix's Swagger documentation, using Swagger code generators to create clients and subsequently package them alongside other gems in our repository. For gRPC, we obtain protobuf files and generate clients accordingly, minimizing our maintenance burden. Implementing the JSON API specification simplifies our processes, as we can utilize generic clients to retrieve diverse resources directly.

00:18:32.190 Adopting this model means we face some complexities. Primarily, as we depend on a microservices architecture, we must prepare for potential failures in our downstream services. It’s essential to recognize that the network is inherently unreliable. When working with a database, app outages can often be addressed within the context of that single source; however, in a microservices environment, interruptions will consistently occur. Consequently, embracing graceful failure becomes vital. We need to ensure that a single microservice outage does not bring down our entire application while still serving partially available data to our users.

00:19:47.540 Moreover, effective error troubleshooting is imperative, particularly in scenarios where 500 errors arise in a downstream service. Without adequate logging, identifying the root cause can become quite challenging. Measurement also plays a critical role; without it, you cannot accurately diagnose pain points within applications. Maintaining metrics is essential, even straightforward ones such as success-to-failure ratios when interfacing with microservices, as well as understanding response times and latencies. Implementing error reporting systems, akin to Sentry or Airbrake, is crucial alongside careful consideration of error classifications.

00:20:25.800 It becomes paramount that we do not inundate ourselves with noise in the form of unimportant errors. Alerts should only stem from actionable issues on our side. For example, if a downstream service returns validation errors, our team shouldn’t need an alert; that responsibility lies with the service to inform us accurately. Overcommunicating on non-critical errors can lead to alarm fatigue among our teams, undermining our ability to respond efficiently. Alarm fatigue is akin to what occurs in emergency rooms, where a multitude of alarms lead to a situation where medical staff may ignore genuine alerts due to an overload of unnecessary notifications.

00:21:32.000 To illustrate, consider a hospital emergency room where countless alarms emit alerts regarding patients' vital signs. Due to the prevalence of unnecessary alarms, staff can become desensitized, risking significant oversights when urgent situations arise. This fatigue is reflected in our development practices as well; we must ensure that our telemetry tools notify us only when significant events occur, especially those affecting our core functionalities. Initially, we reported every non-actionable metric, but over time we refined our alerting systems, focusing primarily on observable thresholds, such as when 20% of calls to a microservice begin failing.

00:22:27.030 Many might question why we extensively utilize external APIs and whether this practice scales well, considering potential delays due to added network calls. Before answering, it’s helpful to consider an alternative architecture strategy. For instance, we could store the necessary data from other services within our database and refresh these copies based on event changes, leveraging eventual consistency for scalability. However, this approach requires significant consideration of how it trades consistency for availability, where the latter often prevails.

00:23:57.960 To clarify this further, let’s look at Amazon’s shopping cart. For them, maintaining availability is essential; they can afford to issue refunds or discount coupons when errors occur. However, our context within Netflix is different. Our applications have fewer users, and the data we display directly influences their decisions regarding production processes. Hence, ensuring the accuracy of the data we present is crucial, often outweighing the need for caching, particularly if inaccuracies could have significant downstream consequences.

00:25:41.010 At Netflix Studios, it is vital that our services remain reliable, highly available, and responsive. If we encounter any sluggish services, we collaborate closely with relevant teams for prompt resolution. Historically, our services have maintained robust performance. Where possible, we aim to avoid synchronization within our applications. While some projects employ eventual consistency, such as syncing data via events, they tackle specialized cases distinct from our broad needs. Our applications, which engage with more than 40 different endpoints from various services, would face overwhelming complexity if we attempted to synchronize all this data.

00:26:35.930 Consider the potential future challenges we might face, particularly if we begin working with substantially larger datasets or need additional data decoration. In such scenarios, we may opt to maintain our internal copies for efficiency. The beauty of our architecture lies in its flexibility, allowing us to adapt as requirements evolve. Hexagonal architecture efficiently fosters this adaptability, as it enables us to delay vital decisions until we possess a clearer understanding of the underlying domain.

00:27:56.280 The concept of project paradox is noteworthy: we frequently make critical project decisions when limiting knowledge about the domain is at its peak. For instance, at the outset of a project, decisions on database types, storage mechanisms, and pub-sub implementations are made when understanding is shallow. However, as the project advances, knowledge increases, but the scope of our actionable decisions narrows due to earlier constraints. The wisdom from Uncle Bob reminds us that a solid architecture serves to postpone decisions, providing us with more information when they ultimately need to be made.

00:28:36.150 Currently, we use PostgreSQL for data storage. If we determine in the future that this setup no longer meets our requirements, the transition to a different solution, such as Cassandra or another NoSQL database, will not pose a challenge. Thanks to our architectural choices, we are not tightly coupled to PostgreSQL. The same principle applies if we require alternate data storage solutions or enhanced search capabilities; our architecture accommodates such transitions seamlessly.

00:29:38.750 Before concluding, I want to briefly address our testing strategy. We strongly believe that tests must be both reliable and quick; it's a necessity, not merely a luxury. Nobody enjoys running lengthy test suites solely on CI servers. Our goal is to have fast tests that facilitate efficient development workflows. Testing Rails applications often leads to frequent database calls, which can slow processes significantly. Therefore, we strive to conduct our business logic testing independently from data source implementations.

00:30:29.230 We leverage dependency injection to effectively unit test our interactors, which encapsulate our core business logic. This allows us to utilize verified doubles, isolating our core code from the persistence layer. For illustration, consider our onboarding interactor, which accepts a production ID, vertical ID, and repository. We generate these IDs independently of existing data, thus avoiding any direct database interaction. For error scenarios, we can create a verified double for our repository and instruct it to respond appropriately.

00:31:25.000 Integration tests operate at two layers. The first focuses on the data sources, verifying the accurate integration with various services. The second layer encompasses full stack end-to-end tests, covering requests from controllers, through services, repositories, and out to external systems. For external service calls, we use VCR to record interactions, ensuring that our tests do not repeatedly invoke live services each time—only mimicking successful and failed calls as necessary.

00:32:01.680 This approach leads to fast and dependable tests. Our most recent testing results indicate that within one minute and 42 seconds, we completed nearly three thousand examples. This efficiency facilitates usability for our entire team, allowing developers to seamlessly run tests without hindering their daily tasks. Fortunately, we have also identified areas for further optimization, so we hope to enhance our processes in the future.

00:33:02.680 In conclusion, the key takeaway from this presentation is the significance of separating business logic from implementation details. This practice has proven immensely valuable in a microservices environment, where hexagonal architecture has become an excellent approach for tackling challenges. It compels us to think about the relationship between different layers, which is vital for maintaining clarity as our systems grow and evolve. If there is one message to emphasize, it is that delaying decisions is prudent; committing to choices based solely on limited knowledge of the project domain rarely yields favorable outcomes.

00:33:49.280 The decisions we have made to date have served us well, enabling rapid progress while preserving flexibility for future adjustments. Thank you all for your attention. My name is Damir Svrtan, and I hope you gained some interesting insights from this presentation. This topic just scratches the surface, so I’m more than happy to answer any questions you may have.

00:34:51.960 Thank you.