How to stop breaking other people's things

by Lisa Karlin Curtis

In her talk at RubyConf 2021, Lisa Karlin Curtis addresses the crucial topic of minimizing disruptions to integrators by effectively managing API changes. She begins with a cautionary tale illustrating the unintended consequences of a well-intentioned update that ultimately caused a significant issue for another developer within the ecosystem. Lisa emphasizes that developers must recognize the delicate balance they must maintain between enhancing their systems and considering the potential impacts on those using their APIs. The key points discussed include:

Understanding Breaking Changes: A breaking change occurs when an API alteration invalidates an integrator's assumptions, leading to unexpected errors or failures. Lisa encourages developers to view potential changes through the lens of how likely they are to disrupt existing integrations.
Common Sources of Issues: Several typical adjustments are flagged as causing issues, such as modifying endpoint structures, altering rate limits, and changing response formats. For instance, she shares a story about how a seemingly harmless fix to reduce endpoint latency inadvertently drove another integrator's system into a crash.
Identifying Assumptions: Assumptions made by integrators can be explicit (through documentation) or implicit (based on industry standards), and understanding these assumptions is critical for minimizing the risk of breaking changes.
Best Practices for API Maintenance: To bolster integration reliability, Lisa outlines strategies, including:
- Comprehensive and clear documentation of APIs, highlighting edge cases and potential issues.
- Effective communication of changes through different channels, transitioning from passive documentation updates (pull communication) to proactive alerts (push communication).
- Utilizing versioning systems like semantic versioning judiciously to signal potential breaking changes.
- Engaging with integrators to keep abreast of their assumptions and usage patterns.
Mitigation Strategies: Implementing gradual changes, providing sandbox environments for testing, and maintaining rollback capabilities can greatly reduce the impact of any breaking changes.

In conclusion, while completely avoiding disruptions in integrations may be unrealistic, Lisa advocates for a proactive approach in managing APIs, enhancing communication, and fostering a relationship based on trust with integrators to mitigate the impact of potential breaks. This structured approach can lead to a smoother evolution of APIs without sacrificing the reliability that integrators depend on.

00:00:10.719 Hey.

00:00:11.599 Thanks so much for coming to my talk about how to stop breaking other people's things.

00:00:13.440 I'm Lisa Karlin Curtis, born and bred in London, England.

00:00:16.640 I'm a software engineer at Incident.io, where we work on being the best way to help your team manage incidents across your whole organization. You should check us out online.

00:00:23.840 Today, I'm going to be talking about how to stop breaking other people's things. We're going to start with a sad story.

00:00:34.880 A developer notices that they have an endpoint with extremely high latency compared to what they expect. After finding a performance issue in the code, they deploy a fix.

00:00:41.040 The latency on the endpoint goes down by half. The developer stares at the beautiful graph, posts it in the graph Slack channel, and feels really good about themselves before moving on with their life.

00:00:48.800 Meanwhile, in another part of the world, another developer gets paged to find their database's CPU usage has spiked, struggling to handle the load.

00:00:59.039 Upon investigation, they notice there's no obvious cause. No recent changes were made, and the request volume is pretty much what they expected.

00:01:04.960 They begin scaling down queues to relieve the pressure, which resolves the immediate issue. Despite that, they still don’t know what happened.

00:01:11.439 Then they notice something odd: they have suddenly started processing webhooks much faster than usual.

00:01:19.040 It turns out that our integrator, as seen on the right-hand side of the situation, had a webhook handler that would receive a webhook from us and then make a request back to check the status of a resource.

00:01:28.799 This was the endpoint that our first engineer had fixed earlier that day.

00:01:36.960 Just as an aside, I'm going to use the word 'integrator' frequently, which refers to people integrating with the API you are maintaining. This could be someone inside your company, a customer, or a partner.

00:01:46.880 Back to our story: that webhook handler spent most of its time waiting for our response before updating its own database. Essentially, our slow endpoint rate-limited the webhook handler's interaction with its own database.

00:02:03.680 As a result, when it would receive a webhook, send a GET request, and wait around for us to respond, it would lead to a series of bottlenecks.

00:02:11.440 The webhooks in this system are a result of batch processes—they are very spiky and would tend to arrive all at once a couple of times a day.

00:02:27.040 As the endpoint became faster during those spikes, the webhook handler began to exert more load on the database than normal, leading to a developer being paged to address the service degradation.

00:02:35.680 The fix is fairly simple: you can either scale down the webhook handlers to process fewer webhooks or beef up your database resources. This situation illustrates how easy it is to accidentally break someone else's functionality.

00:02:44.160 Even when you're trying to do the right thing by your integrators and improve the performance of your API, issues like this can arise.

00:03:03.680 To set the scene, let’s consider some examples of changes that I’ve seen break code in the past: traditional API changes, such as adding a mandatory field, removing an endpoint, or changing validation logic.

00:03:34.000 I think we all understand that these changes can easily break someone else's integration. Other examples include introducing a rate limit or changing a rate limiting strategy.

00:03:53.439 Docker recently communicated changes in their rate limiting effectively, but it still had a significant impact on many integrators.

00:04:11.920 For example, changing an error response string can lead to issues. At my previous job with GoCardless, we had a bug where the code didn’t respect the Accept-Language header on a few endpoints.

00:04:45.360 When we corrected the error response to provide the correct language based on the Accept-Language header, one of our integrators raised an urgent ticket, stating that we broke their software because they relied on our API not translating that error.

00:05:13.440 They had written their code to specifically check the error string, applying a regex in order to ensure their software behaved as expected.

00:06:07.520 Now let’s discuss breaking apart a database transaction. It's essential to recognize that in our systems, internal consistency is crucial.

00:06:28.360 For instance, if you have a resource that can either be active or inactive, and you deactivate it while failing to create an event for that change, an integrator could run into issues trying to understand why the resource was deactivated.

00:06:59.360 This can lead to their UI breaking if they can't find the corresponding event, illustrating that breaking consistency is not just about our own systems but also about how it affects integrators.

00:07:22.320 Now, let's talk about the timing of batch processing. Companies often have daily batch processes in place, which are pivotal for their operations, particularly in payment transactions.

00:07:52.960 At GoCardless, there were numerous cases where integrators created payments just in time for the daily run, making it critical that we maintain a predictable timing for these runs.

00:08:09.200 If we changed the timing of the daily payment runs, all those integrations would break since they depended on our established schedule.

00:08:35.680 Lastly, reducing the latency on an API call can also have a breaking impact. This brings us back to the story we started with.

00:09:01.120 Throughout this talk, I will define a breaking change as an action that I, as the API developer, perform that inadvertently breaks an integration.

00:09:22.040 When that happens, it usually results from an assumption made by the integrator that is no longer valid. It's important to recognize that assumptions are inevitable.

00:09:38.720 Even if a developer's assumption leads them to a breaking change, the reality is that it often turns into a problem for the API provider as well.

00:10:10.400 For many companies, when the integrators feel pain, the company will feel it too. Therefore, dismissing or pushing responsibility solely on the integrators is not productive.

00:10:38.720 Assumptions can be formed explicitly through questions and answers, as well as implicitly through industry standards or behaviors that become ingrained over time.

00:11:12.840 Integrators frequently assume that standard HTTP behaviors (like JSON content types) won’t change, and this can lead to unforeseen consequences.

00:11:52.800 For example, we faced an incident where a standard change in HTTP header casing caused a significant outage for two key integrators who relied on the previous behavior.

00:12:29.120 When integrating, developers want reliability, but they also expect innovation, leading to a tricky balance when changes are made.

00:12:43.920 Developers tend to pattern match aggressively based on previous experiences, assuming others’ systems are flawless while often overlooking potential issues in their own.

00:13:15.680 Interestingly, we all have instances where we experience rare edge cases and yet continue to trust other systems will behave consistently.

00:13:40.080 A great example here is the early history of MS-DOS—developers had to reverse engineer undocumented behaviors, influencing how applications operated initially, and preventing Microsoft from making changes that would break compatibility.

00:14:06.240 After a period of stability, teams may begin to assume that changes are impossible, leading to risk by ignoring potential volatility.

00:14:45.440 To avoid breaking others' things, we need to help our integrators stop making poor assumptions. One key way to accomplish this is by documenting edge cases thoroughly.

00:15:30.960 Documentation should be discoverable; consider SEO for your documentation and ensure critical information is easily accessible.

00:16:06.400 Additionally, never withhold documentation about changing features. Consistently keep your support articles and blogs updated, making it clear if certain features may change.

00:16:40.240 Keeping industry standards in mind is critical; follow them whenever possible and clearly indicate if you can't or if the industry is in transition.

00:17:16.640 When discussing observed behavior, naming can be a primary factor. Developers often skim through documentation; they focus on examples over narrative.

00:18:01.040 A frequent issue arises when numerically formatted strings (like registration numbers) are treated inappropriately as numbers, leading to unexpected outcomes.

00:18:41.520 Highlighting edge cases in documentation can curb many errors, as your examples help integrators avoid misinterpretation.

00:19:23.120 Using communication to counteract aggressive pattern matching is also helpful. If a change to a process is anticipated, clearly note this within the documentation.

00:19:50.480 For instance, clearly denote batch processing timings, while reserving the right to make changes without notice.

00:20:14.559 When unexpected changes occur—like sending additional events in webhooks—making sure that all communication with the outside world is controlled and documented is paramount.

00:20:47.839 It's vital that there’s clarity around constraints when interfacing with integrators. Furthermore, be prepared to roll back changes should integrations break.

00:21:39.680 While changes can't always be avoided, managing the risk they pose is critical. We've discussed the importance of preventing mistakes and fostering robust communication.

00:22:14.880 Engaging with integrators personally is a productive way to troubleshoot potential issues ahead of time. Implementing tests in sandbox environments can also provide insight.

00:22:49.680 Considering changes incrementally can help prepare integrators for more significant changes. Notifications directly through documented communications allow a smoother transition.

00:23:33.120 Interfacing with key integrators can set up enhanced communication for significant changes, ensuring the support team is ready to react if issues arise.

00:24:06.560 Keeping a live or ongoing communication channel with not only your key integrators but all integrators can help maintain open dialogue about ongoing needs.

00:24:41.680 Empathy for their perspective can foster goodwill while also ensuring that appropriate actionable information is floating between both parties.

00:25:15.440 Remember that not all changes can be categorized as simply breaking or non-breaking; they exist on a spectrum with each change presenting a likelihood of causing an issue.

00:26:00.800 So while it might not be your fault that an issue arose, it remains your problem if it impacts integrators. Your biggest customers may face significant challenges due to changes.

00:26:55.680 We must think about how we scale communication and deliver information relevant to integrations without overwhelming or under-informing our developers.

00:27:30.160 Finally, I hope you have enjoyed the talk. Thank you so much for listening.

00:27:47.680 Please find me on Twitter @paprikarti if you'd like to chat about anything we've covered today. I hope you have a great day!