rubyday 2021

API Optimization Tale: Monitor, Fix and Deploy (on Friday)

API Optimization Tale: Monitor, Fix and Deploy (on Friday)

by Maciek Rzasa

This video, titled "API Optimization Tale: Monitor, Fix and Deploy", features Maciek Rzasa, who shares his experiences and lessons learned from a complex API optimization project undertaken at Toptal. The main theme centers around the challenges of API deployment and the importance of monitoring and optimizing the system effectively, especially when pushing changes to production environments.

Key points discussed include:

- Initial Scenario: Rzasa faced anxiety upon seeing a green build on a Friday afternoon and the decision to push it to production despite recent experiences of breaking production.

- Service Extraction: Toptal's billing team initiated a project to extract billing code from a monolithic Rails application to create a service-based architecture while maintaining operational integrity.

- Incremental Approach: The team used feature flags and fallback mechanisms to safely roll out the new features, evaluating the API’s functioning before full deployment.

- Transition to GraphQL: Midway through the project, the team decided to switch from REST to GraphQL to enhance flexibility and performance, introducing further monitoring measures.

- Optimization Strategies: They utilized various techniques including preloading, caching, and joining to tackle performance bottlenecks, especially in relation to excessive API requests.

- Challenges Encountered: Missteps were acknowledged, such as implementing overly aggressive parameter sanitization that caused significant performance issues, showcasing the real risks involved with late Friday deployments.

- Post-Deployment Reflection: Rzasa emphasized the lessons learned from mistakes, the importance of good monitoring practices, and how these experiences contribute to refining development processes.

Concluding takeaways include the necessity for a safe deployment environment, rigorous testing, and an iterative approach to learning from failures. The talk highlights the evolution of the team's practices toward more effective and reliable API management, aiming to encourage other developers to embrace challenges and learn from their experiences with production environments.

00:00:00.120 Oh my gosh.
00:00:33.899 Foreign.
00:01:03.140 There he is. Our first speaker after the keynote.
00:01:08.400 For today, and I’m going to butcher your name, I heard the recording, but I hope I’m going to say this right—Magic Rząsa.
00:01:14.100 Sorry, uh, his interests are as varied as mine. Looking at what you sent us, going from Napoleonic Wars to distributed systems.
00:01:20.520 Today, he will tell us an enthralling tale about API deployments on Fridays.
00:01:26.640 So, welcome Magic! Thank you for joining us!
00:01:35.040 Hello! I’m really happy to be here.
00:01:41.640 We’re super happy to have you here. I think I’ll need to leave now so that you can start. See you later!
00:01:48.540 Hello, I’m Magic. It was a Friday evening, around 4 P.M., when I saw a green build on our Jenkins pipeline.
00:01:55.259 I knew that I needed to push it to production before the weekend so that by Monday, I could see if the bug is fixed. At the same time, I had a very strong gut feeling that it was a trap I was setting up for myself.
00:02:15.239 It wasn’t a superstition about deployments or kittens on Fridays. Just a couple of weeks ago, I broke production and had to stay late to fix it.
00:02:26.879 Believe me or not, after work, I had way better things to do than fixing broken production, so I knew the risk.
00:02:39.959 But why was it so important to ship something so late on a Friday? What was the project about, and what gave me the confidence to deploy on a Friday afternoon?
00:02:58.080 Last but not least, what was I thinking to test my changes in production? I’ll answer all these questions during the presentation, but first, let’s take a few steps back.
00:03:11.819 I’m a backend engineer at Toptal. At the heart of Toptal lies their total platform—it’s a Rails application of over a million lines of code.
00:03:25.319 We have hundreds of engineers developing this application daily, investing millions of hours into it. It’s a monolith, and yes, it’s majestic.
00:03:40.120 But this monolith has its problems. I won’t dwell on them, as they’ve been well described elsewhere. The point is that we're trying to adopt more service-based architecture.
00:03:55.980 The team I work with, the Billing Extraction Team, is at the spearhead of this effort. We aim to wrap the entire billing code and deploy it as a separate service.
00:04:06.900 We have to do this without disrupting business operations and without adding too much pain for our colleagues in other teams. So, what we did was adopt an incremental approach.
00:04:26.759 Initially, we wrapped the billing code in a Rails engine and relied on direct method calls to use it from the platform. Later, we wanted to check how the real external communication worked.
00:04:59.160 So, we implemented an asynchronous API using Kafka. It's interesting, but I won't focus on that here.
00:05:05.720 The direct communication through synchronous calls was established using HTTP. That's the topic of this presentation. To make the change safe and avoid disruption, we added two safety features.
00:05:23.699 The first was a feature flag that allowed us to enable or disable external API calls without deployment, for any fraction of traffic desired.
00:05:34.740 The second was a fallback mechanism; if an HTTP call returned an error, we would fallback to a direct call to ensure correct data was returned.
00:05:53.600 The requests to billing were slower, but they were still correct even if an HTTP error occurred. Let’s consider how the extraction looked on the code level. We will focus on two classes: Product and BillingRecord.
00:06:11.100 Before extraction, Product had many BillingRecords, which belonged to a Product. However, after extraction, we had to remove those ActiveRecord associations while still keeping the relevant data associated.
00:06:30.060 We employed a pattern that could be called 'Active Object.' We still had a Product that was an ActiveRecord model, while BillingRecord became a plain Ruby object. They were associated using normal method calls.
00:06:53.819 In Product, this method called a BillingQueryService, which wrapped all the logic related to external calls, the feature flags, the actual networking code, and more.
00:07:07.919 When we wanted to access product attributes, we queried the database as necessary.
00:07:14.580 We started the extraction process by creating a safe deployment environment. Besides the standard CI checks, we ensured easy deployment and reliable rollback.
00:07:28.380 We added two significant safety features—the fallback and the feature flags—to allow us to track how our REST API was functioning. We deployed and enabled it for a fraction of the traffic and yes, we failed miserably.
00:08:06.900 The performance was abysmal, and we couldn’t state that it was stable. We knew we had to optimize it, and that’s when the real story begins.
00:08:19.500 What would any developer do? I would typically start improving the REST code. Thankfully, my colleagues were smarter, and we decided we needed something different—some more flexibility.
00:08:37.979 Thus, in the middle of this radical refactoring, we made a significant decision to replace REST with GraphQL. We introduced another layer of feature flags to enable a gradual switch between REST and GraphQL, allowing us to toggle between the correct REST version and a more performant GraphQL version consistently.
00:09:07.080 We also added fallbacks to the GraphQL requests to ensure continued functionality, and with that, we felt ready to begin.
00:09:20.160 Not exactly. To avoid flying blind, we included custom request instrumentation. This meant that every request sent to billing was logged, capturing vital details such as method name, queried arguments, stack traces, and response times.
00:09:38.220 This logging allowed us to pinpoint frequently used methods to prioritize them for optimization. We monitored the slow methods by recording response times and reconstructing requests locally due to the recorded arguments.
00:10:01.560 Moreover, we could determine which parts of the application generated these slow or numerous calls, thanks to the stack traces.
00:10:21.959 With this strong foundation, we were ready to dive into real optimization work, having established solid monitoring and a safe deployment environment.
00:10:34.980 We began a furious TV series of optimization, starting with the most notorious issue: jobs generating excessive requests that were bringing down billing.
00:10:49.079 We were facing thousands of requests; it was unbearable. The problem was with this Active Object pattern—it was easy to overlook that we were accessing billing.
00:11:01.320 If a job iterated over N products while calling business logic on each, the underlying logic used billing records, resulting in N queries to the billing service.
00:11:21.720 To manage this, we employed three strategies. First, we moved the data-loading logic out of the product class for this specific case.
00:11:38.640 This was done by loading billing records for batches of products instead of single requests. We grouped products by billing identifier to achieve this.
00:11:52.620 With an indexed hash, we loaded records once per batch, which led to drastically reducing the number of queries—from thousands to dozens.
00:12:06.300 What we learned was that when faced with a flood of requests that resembles N+1 queries, implementing preloading, caching, and joining features are necessary. It’s simpler than it sounds.
00:12:41.579 Next, we efficiently applied caching.
00:12:52.560 However, we faced limitations where we only needed a single field for entire business logic packages. We still had an instance of over-fetching.
00:13:11.760 There was a specific field called several hundred times a day; during peak times, it was too much. We sought a better solution.
00:13:28.800 Our idea was to add this field to Kafka events, modifying data sent by the billing service. This way, we could create a consumer-side projection and read this modified model.
00:13:40.560 I was very excited to implement this, even considering speaking on it at conferences. However, a knowledgeable co-worker challenged us to explore local database alternatives.
00:14:02.640 Though I initially resisted as I wanted an innovative solution, upon investigating, we found the data was already present in our local database.
00:14:12.240 We switched from external queries to simple database queries and resolved the issue swiftly in just a few days instead of weeks, creating a more maintainable solution.
00:14:24.000 What we learned was that while clever technical tricks can win optimization battles, domain knowledge can help us avoid those battles altogether.
00:14:40.500 Now, we faced new challenges with large response sizes slowing down the entire billing process. While every response contained every possible field or association, this led to extensive data being pushed.
00:14:52.680 Initially, we aimed for a minimal API with generic responses. However, we ended up returning around 40 fields, creating severe payload issues.
00:15:02.760 The situation improved once we switched to GraphQL. Custom queries allowed us to pull only the necessary fields, filtering server-side.
00:15:20.880 Our feedback loop was refining: monitoring, prioritizing work, optimizing, and deploying with results confirming enhancements.
00:15:35.880 On July 28th, a hot summer day, I deployed my changes and enabled feature flags. Initially, users reported issues with sidekick workers, and I didn't realize my changes were involved.
00:15:50.760 CPU and memory usage skyrocketed to unforeseen levels due to my modifications. Postgres queries began taking minutes.
00:16:08.640 When I realized it was my build, I felt immediate dread. However, infrastructure resolved the problem efficiently without me even needing to step in.
00:16:24.480 The next day, my manager assured me they were working on fixes and congratulated me on the effort while emphasizing it's okay to make mistakes as long as we learn.
00:16:43.860 What went wrong? I added parameter sanitization and inadvertently sanitized too much, allowing only one parameter instead.
00:17:03.120 The method, which was overly optimistic, caused billing records to fetch excessively due to no identifiers being available for filtering.
00:17:20.579 The fix involved adjusting parameter sanitization and adding a safety check to ensure no request would fetch all records if no specific identifier was passed.
00:17:38.680 This issue was caught in unit tests, which only assessed single parameters, missing multiple scenarios due to how feature flags were set up.
00:18:04.800 We began optimizing based on increasing traffic through external calls and deferred issues tagged for Sundays.
00:18:23.140 Errors appeared as a flood because scheduled reminders sent on Sundays coincided with these queries.
00:18:40.740 In response, we spread notifications over a two-minute window to avoid overwhelming billing, deploying accordingly, yet the errors persisted.
00:19:02.760 The root cause was determined: Sidekiq was executing with lower frequency than anticipated due to settings.
00:19:29.460 Implementing proper limiters allowed us to manage loads correctly, and upon waiting, we had no further errors.
00:19:44.520 We learned valuable lessons over the month due to our safety mechanisms in place. I saw a green build on a Friday, decided to deploy, and was indeed safe on the weekend.
00:20:08.640 Both of my deployments at the beginning and end were executed safely.
00:20:29.640 This triad of monitoring, optimizing, and deploying iteratively helped refine our production API.
00:20:43.560 When we finally enabled external calls fully, it was entirely unremarkable, without concerns for safety.
00:21:01.260 We achieved this through good visibility from monitoring, reliable deployment mechanisms, and various optimization strategies.
00:21:18.640 As you look at these methods, you may be familiar with them already. They aren’t new tricks in the microservice world.
00:21:37.080 However, it's essential to understand why we didn't implement these practices from the start.
00:21:55.920 Simplicity was paramount; we started with a basic solution, implementing minimal generic APIs.
00:22:14.280 Learning determined our processes, as identifying performance issues proved difficult without production traffic to simulate testing scenarios.
00:22:30.360 Lastly, navigating trade-offs in projects takes tact, which involves political and interpersonal dynamics.
00:22:49.440 In sharing our experiences, mistakes do help others avoid retreading those paths. Go back to work, make mistakes, and join us to share your learnings.
00:23:11.520 For environments safe enough for experimentation, consider the feelings associated with deploying on Friday afternoons. Thank you!
00:23:44.039 Yeah, it felt like a roller coaster for us as well.
00:23:56.219 I really appreciate your honesty about failures. They help everyone see that things don’t always have to be perfect.
00:24:11.160 That’s absolutely right. Without failure, we don't push ourselves to grow.
00:24:27.480 Is there anything else you would have done differently regarding the project?
00:24:41.640 Given endless time, I would try to replicate our infrastructure locally to handle production-like traffic effectively, allowing for quicker resolution of issues.
00:25:15.720 Our various environments experienced different issues that occasionally resulted in errors surfaced by the proxy rather than the actual application.
00:25:30.960 However, due to time constraints, we weren't able to implement this practice.
00:25:46.680 That's a common problem, isn't it? Do you have a personal habit or practice that helps when you're under pressure?
00:26:01.680 I've learned it’s essential to maintain documentation for the group, ensuring that specific individuals are responsible for its upkeep.
00:26:22.920 We’ve also been sharing our learnings through blog posts, in an effort to educate.
00:26:34.200 We do have a structured RFC process for significant architectural changes.
00:26:46.920 Collecting feedback is vital before promoting changes, particularly if they have lasting impacts.
00:27:05.640 Do you think dev teams should have a systematic approach to learning?
00:27:25.280 Yes, it's vital that teams implement lessons learned via post-mortems after significant incidents.
00:27:41.520 This process aims to understand what went wrong, how we can improve, and record that information optimally.
00:28:02.840 Have you as a team identified key success areas yet?
00:28:14.680 Definitely. We’ve worked on outlining principles for our coding standards, automation practices, and the broader goals of our tech stack.
00:28:29.040 Sharing in this way allows us all to learn and grow from each other's experiences.
00:28:40.640 I don’t see any further questions at the moment.
00:28:57.600 Well, best of luck to you, and I hope you can join the next session.