00:00:00.120
Oh my gosh.
00:00:33.899
Foreign.
00:01:03.140
There he is. Our first speaker after the keynote.
00:01:08.400
For today, and I’m going to butcher your name, I heard the recording, but I hope I’m going to say this right—Magic Rząsa.
00:01:14.100
Sorry, uh, his interests are as varied as mine. Looking at what you sent us, going from Napoleonic Wars to distributed systems.
00:01:20.520
Today, he will tell us an enthralling tale about API deployments on Fridays.
00:01:26.640
So, welcome Magic! Thank you for joining us!
00:01:35.040
Hello! I’m really happy to be here.
00:01:41.640
We’re super happy to have you here. I think I’ll need to leave now so that you can start. See you later!
00:01:48.540
Hello, I’m Magic. It was a Friday evening, around 4 P.M., when I saw a green build on our Jenkins pipeline.
00:01:55.259
I knew that I needed to push it to production before the weekend so that by Monday, I could see if the bug is fixed. At the same time, I had a very strong gut feeling that it was a trap I was setting up for myself.
00:02:15.239
It wasn’t a superstition about deployments or kittens on Fridays. Just a couple of weeks ago, I broke production and had to stay late to fix it.
00:02:26.879
Believe me or not, after work, I had way better things to do than fixing broken production, so I knew the risk.
00:02:39.959
But why was it so important to ship something so late on a Friday? What was the project about, and what gave me the confidence to deploy on a Friday afternoon?
00:02:58.080
Last but not least, what was I thinking to test my changes in production? I’ll answer all these questions during the presentation, but first, let’s take a few steps back.
00:03:11.819
I’m a backend engineer at Toptal. At the heart of Toptal lies their total platform—it’s a Rails application of over a million lines of code.
00:03:25.319
We have hundreds of engineers developing this application daily, investing millions of hours into it. It’s a monolith, and yes, it’s majestic.
00:03:40.120
But this monolith has its problems. I won’t dwell on them, as they’ve been well described elsewhere. The point is that we're trying to adopt more service-based architecture.
00:03:55.980
The team I work with, the Billing Extraction Team, is at the spearhead of this effort. We aim to wrap the entire billing code and deploy it as a separate service.
00:04:06.900
We have to do this without disrupting business operations and without adding too much pain for our colleagues in other teams. So, what we did was adopt an incremental approach.
00:04:26.759
Initially, we wrapped the billing code in a Rails engine and relied on direct method calls to use it from the platform. Later, we wanted to check how the real external communication worked.
00:04:59.160
So, we implemented an asynchronous API using Kafka. It's interesting, but I won't focus on that here.
00:05:05.720
The direct communication through synchronous calls was established using HTTP. That's the topic of this presentation. To make the change safe and avoid disruption, we added two safety features.
00:05:23.699
The first was a feature flag that allowed us to enable or disable external API calls without deployment, for any fraction of traffic desired.
00:05:34.740
The second was a fallback mechanism; if an HTTP call returned an error, we would fallback to a direct call to ensure correct data was returned.
00:05:53.600
The requests to billing were slower, but they were still correct even if an HTTP error occurred. Let’s consider how the extraction looked on the code level. We will focus on two classes: Product and BillingRecord.
00:06:11.100
Before extraction, Product had many BillingRecords, which belonged to a Product. However, after extraction, we had to remove those ActiveRecord associations while still keeping the relevant data associated.
00:06:30.060
We employed a pattern that could be called 'Active Object.' We still had a Product that was an ActiveRecord model, while BillingRecord became a plain Ruby object. They were associated using normal method calls.
00:06:53.819
In Product, this method called a BillingQueryService, which wrapped all the logic related to external calls, the feature flags, the actual networking code, and more.
00:07:07.919
When we wanted to access product attributes, we queried the database as necessary.
00:07:14.580
We started the extraction process by creating a safe deployment environment. Besides the standard CI checks, we ensured easy deployment and reliable rollback.
00:07:28.380
We added two significant safety features—the fallback and the feature flags—to allow us to track how our REST API was functioning. We deployed and enabled it for a fraction of the traffic and yes, we failed miserably.
00:08:06.900
The performance was abysmal, and we couldn’t state that it was stable. We knew we had to optimize it, and that’s when the real story begins.
00:08:19.500
What would any developer do? I would typically start improving the REST code. Thankfully, my colleagues were smarter, and we decided we needed something different—some more flexibility.
00:08:37.979
Thus, in the middle of this radical refactoring, we made a significant decision to replace REST with GraphQL. We introduced another layer of feature flags to enable a gradual switch between REST and GraphQL, allowing us to toggle between the correct REST version and a more performant GraphQL version consistently.
00:09:07.080
We also added fallbacks to the GraphQL requests to ensure continued functionality, and with that, we felt ready to begin.
00:09:20.160
Not exactly. To avoid flying blind, we included custom request instrumentation. This meant that every request sent to billing was logged, capturing vital details such as method name, queried arguments, stack traces, and response times.
00:09:38.220
This logging allowed us to pinpoint frequently used methods to prioritize them for optimization. We monitored the slow methods by recording response times and reconstructing requests locally due to the recorded arguments.
00:10:01.560
Moreover, we could determine which parts of the application generated these slow or numerous calls, thanks to the stack traces.
00:10:21.959
With this strong foundation, we were ready to dive into real optimization work, having established solid monitoring and a safe deployment environment.
00:10:34.980
We began a furious TV series of optimization, starting with the most notorious issue: jobs generating excessive requests that were bringing down billing.
00:10:49.079
We were facing thousands of requests; it was unbearable. The problem was with this Active Object pattern—it was easy to overlook that we were accessing billing.
00:11:01.320
If a job iterated over N products while calling business logic on each, the underlying logic used billing records, resulting in N queries to the billing service.
00:11:21.720
To manage this, we employed three strategies. First, we moved the data-loading logic out of the product class for this specific case.
00:11:38.640
This was done by loading billing records for batches of products instead of single requests. We grouped products by billing identifier to achieve this.
00:11:52.620
With an indexed hash, we loaded records once per batch, which led to drastically reducing the number of queries—from thousands to dozens.
00:12:06.300
What we learned was that when faced with a flood of requests that resembles N+1 queries, implementing preloading, caching, and joining features are necessary. It’s simpler than it sounds.
00:12:41.579
Next, we efficiently applied caching.
00:12:52.560
However, we faced limitations where we only needed a single field for entire business logic packages. We still had an instance of over-fetching.
00:13:11.760
There was a specific field called several hundred times a day; during peak times, it was too much. We sought a better solution.
00:13:28.800
Our idea was to add this field to Kafka events, modifying data sent by the billing service. This way, we could create a consumer-side projection and read this modified model.
00:13:40.560
I was very excited to implement this, even considering speaking on it at conferences. However, a knowledgeable co-worker challenged us to explore local database alternatives.
00:14:02.640
Though I initially resisted as I wanted an innovative solution, upon investigating, we found the data was already present in our local database.
00:14:12.240
We switched from external queries to simple database queries and resolved the issue swiftly in just a few days instead of weeks, creating a more maintainable solution.
00:14:24.000
What we learned was that while clever technical tricks can win optimization battles, domain knowledge can help us avoid those battles altogether.
00:14:40.500
Now, we faced new challenges with large response sizes slowing down the entire billing process. While every response contained every possible field or association, this led to extensive data being pushed.
00:14:52.680
Initially, we aimed for a minimal API with generic responses. However, we ended up returning around 40 fields, creating severe payload issues.
00:15:02.760
The situation improved once we switched to GraphQL. Custom queries allowed us to pull only the necessary fields, filtering server-side.
00:15:20.880
Our feedback loop was refining: monitoring, prioritizing work, optimizing, and deploying with results confirming enhancements.
00:15:35.880
On July 28th, a hot summer day, I deployed my changes and enabled feature flags. Initially, users reported issues with sidekick workers, and I didn't realize my changes were involved.
00:15:50.760
CPU and memory usage skyrocketed to unforeseen levels due to my modifications. Postgres queries began taking minutes.
00:16:08.640
When I realized it was my build, I felt immediate dread. However, infrastructure resolved the problem efficiently without me even needing to step in.
00:16:24.480
The next day, my manager assured me they were working on fixes and congratulated me on the effort while emphasizing it's okay to make mistakes as long as we learn.
00:16:43.860
What went wrong? I added parameter sanitization and inadvertently sanitized too much, allowing only one parameter instead.
00:17:03.120
The method, which was overly optimistic, caused billing records to fetch excessively due to no identifiers being available for filtering.
00:17:20.579
The fix involved adjusting parameter sanitization and adding a safety check to ensure no request would fetch all records if no specific identifier was passed.
00:17:38.680
This issue was caught in unit tests, which only assessed single parameters, missing multiple scenarios due to how feature flags were set up.
00:18:04.800
We began optimizing based on increasing traffic through external calls and deferred issues tagged for Sundays.
00:18:23.140
Errors appeared as a flood because scheduled reminders sent on Sundays coincided with these queries.
00:18:40.740
In response, we spread notifications over a two-minute window to avoid overwhelming billing, deploying accordingly, yet the errors persisted.
00:19:02.760
The root cause was determined: Sidekiq was executing with lower frequency than anticipated due to settings.
00:19:29.460
Implementing proper limiters allowed us to manage loads correctly, and upon waiting, we had no further errors.
00:19:44.520
We learned valuable lessons over the month due to our safety mechanisms in place. I saw a green build on a Friday, decided to deploy, and was indeed safe on the weekend.
00:20:08.640
Both of my deployments at the beginning and end were executed safely.
00:20:29.640
This triad of monitoring, optimizing, and deploying iteratively helped refine our production API.
00:20:43.560
When we finally enabled external calls fully, it was entirely unremarkable, without concerns for safety.
00:21:01.260
We achieved this through good visibility from monitoring, reliable deployment mechanisms, and various optimization strategies.
00:21:18.640
As you look at these methods, you may be familiar with them already. They aren’t new tricks in the microservice world.
00:21:37.080
However, it's essential to understand why we didn't implement these practices from the start.
00:21:55.920
Simplicity was paramount; we started with a basic solution, implementing minimal generic APIs.
00:22:14.280
Learning determined our processes, as identifying performance issues proved difficult without production traffic to simulate testing scenarios.
00:22:30.360
Lastly, navigating trade-offs in projects takes tact, which involves political and interpersonal dynamics.
00:22:49.440
In sharing our experiences, mistakes do help others avoid retreading those paths. Go back to work, make mistakes, and join us to share your learnings.
00:23:11.520
For environments safe enough for experimentation, consider the feelings associated with deploying on Friday afternoons. Thank you!
00:23:44.039
Yeah, it felt like a roller coaster for us as well.
00:23:56.219
I really appreciate your honesty about failures. They help everyone see that things don’t always have to be perfect.
00:24:11.160
That’s absolutely right. Without failure, we don't push ourselves to grow.
00:24:27.480
Is there anything else you would have done differently regarding the project?
00:24:41.640
Given endless time, I would try to replicate our infrastructure locally to handle production-like traffic effectively, allowing for quicker resolution of issues.
00:25:15.720
Our various environments experienced different issues that occasionally resulted in errors surfaced by the proxy rather than the actual application.
00:25:30.960
However, due to time constraints, we weren't able to implement this practice.
00:25:46.680
That's a common problem, isn't it? Do you have a personal habit or practice that helps when you're under pressure?
00:26:01.680
I've learned it’s essential to maintain documentation for the group, ensuring that specific individuals are responsible for its upkeep.
00:26:22.920
We’ve also been sharing our learnings through blog posts, in an effort to educate.
00:26:34.200
We do have a structured RFC process for significant architectural changes.
00:26:46.920
Collecting feedback is vital before promoting changes, particularly if they have lasting impacts.
00:27:05.640
Do you think dev teams should have a systematic approach to learning?
00:27:25.280
Yes, it's vital that teams implement lessons learned via post-mortems after significant incidents.
00:27:41.520
This process aims to understand what went wrong, how we can improve, and record that information optimally.
00:28:02.840
Have you as a team identified key success areas yet?
00:28:14.680
Definitely. We’ve worked on outlining principles for our coding standards, automation practices, and the broader goals of our tech stack.
00:28:29.040
Sharing in this way allows us all to learn and grow from each other's experiences.
00:28:40.640
I don’t see any further questions at the moment.
00:28:57.600
Well, best of luck to you, and I hope you can join the next session.