Talks

How to stop breaking other people's things

How to stop breaking other people's things

by Lisa Karlin Curtis

In her talk at Euruko 2021, Lisa Karlin Curtis, a software engineer at GoCardless, addresses the challenges faced by developers in maintaining API compatibility and preventing unintended consequences that may arise when altering APIs—commonly referred to as breaking changes. She emphasizes that breaking changes can lead to significant issues for users of an API, including urgent live operations incidents that necessitate immediate fixes. The presentation outlines key strategies for developers to work more effectively with their API integrators and avoid causing disruptions.

Key Points Discussed:
- Understanding Integrators: Lisa defines integrators as those who use APIs either within the same company or as customers. Recognizing their needs and behaviors is crucial for API reliability.
- Examples of Breaking Changes: Common changes that can break integrator code include adding mandatory fields, removing endpoints, changing validation logic, adjusting rate limits, modifying error responses, and altering batch processing timeliness.
- Real-World Anecdotes: A central story involves a developer fixing a performance issue in one endpoint, which inadvertently led to database degradation for another team relying on that endpoint due to unexpected load changes.
- Documentation and Communication: Lisa stresses the need for comprehensive and accessible API documentation to minimize incorrect assumptions by integrators. This should include edge cases and exploration of observed behaviors that could lead to smooth interaction with the API.
- Managing Assumptions: Discussion about explicit and implicit assumptions made by integrators; she suggests that developers should engage in better communication, provide clear documentation, and foster a collaborative approach to reduce the likelihood of assumptions leading to issues.
- Approach to Changes: Lisa proposes considering the probability and severity of breaking changes rather than a binary approach of breaking or not breaking, encouraging empathy with integrator experiences to anticipate the impact of changes.
- Versioning and Rollbacks: The presentation touches upon semantic versioning and discusses the importance of supporting previous versions of APIs to manage transitions well. Lisa highlights the utility of limited incremental changes to monitor potential impacts before a full rollout.

Conclusion and Takeaways:
Lisa concludes that while completely avoiding breaking changes is nearly impossible, steps can be taken to manage risks effectively. Building transparent communication channels, continuously updating documentation, and ensuring responsive support mechanisms are vital in fostering trust with API integrators. Understanding the developer's responsibilities to their integrators can lead to a more reliable and supportive API ecosystem.

00:00:00.560 Are we ready to move on to our next speaker? Our next speaker is Lisa Karlin Curtis, and she is going to talk about how to stop breaking other people's things.
00:00:08.720 Lisa is a full-stack developer at GoCardless. She started out as a consultant working with the HMRC and then smart meters before accidentally becoming a developer.
00:00:15.200 I would like to hear more about that. She works mainly on a Rails app with some forays into the JavaScript front end and legacy PHP applications.
00:00:21.520 She loves building stuff but is also really interested in how people interact with each other in a work environment.
00:00:26.760 Particularly in software engineering, having seen the old way at Accenture with large-scale waterfall projects, she is now looking at taking the lessons from that environment to the startup scene.
00:00:33.440 About the talk: breaking changes are sad. We’ve all been there; someone changes their API in a way you weren’t expecting, and now you have a live ops incident you need to fix urgently to get your software working again.
00:00:38.879 Of course, many of us are on the other side too—we build APIs that other people's software relies on.
00:00:44.280 All the discussions and later on questions for Lisa can be put in the stream chat. So, please, Lisa, the floor is yours; you can start your presentation.
00:02:04.320 Hi, um, thanks so much for having me. You’re okay? This is really cool! So, yeah, I'm Lisa Karlin Curtis, born and bred in London, England, and I'm a software engineer at GoCardless.
00:02:12.560 I work in our financial orchestration team. We are a payments company that focuses on recurring payments, and I’m going to be talking today about how to stop breaking other people's things.
00:02:18.319 We’re going to start with a sad story. A developer notices they have an endpoint that has a really high latency compared to what they'd expect.
00:02:28.879 They find a performance issue with the code, which is essentially an exacerbated N+1 problem, and they deploy a fix.
00:02:33.920 The latency on the endpoint goes down by half. The developer stares at the beautiful graph with a lovely cliff shape, right? You know, really high and nicely dropping, and they feel really good about themselves.
00:02:39.200 They pat themselves on the back and move on. Somewhere else in the world, another developer gets paged; their database CPU has spiked, and it's struggling to handle the load.
00:02:44.560 They’ve got a bit of service degradation, so what happened here? They start investigating; there’s no obvious cause, and no recent changes were deployed.
00:02:51.280 The request volume is pretty much what they’d expect. They start scaling down their queues to relieve the pressure, which seems to solve the problem.
00:02:56.720 The database recovers, and then they notice something strange: they’ve suddenly started processing webhooks much more quickly than they used to.
00:03:02.640 So it turns out that our integrator, which is on the right hand side of this slide, had a webhook handler that would receive a webhook from us.
00:03:08.959 Then it would make a request back to find the status of the resource; the reason they needed to do that was that the events could be delivered out of order.
00:03:15.519 They wanted to make sure that their status reflected what was in our database, and this was actually the endpoint that we fixed earlier that day.
00:03:22.080 I’m going to use the word integrator a lot, and what I mean is people who are integrating against the API that you are maintaining.
00:03:28.000 Sometimes that will be inside your company, like another team, or sometimes it might be a customer.
00:03:34.000 So back to our story: that webhook handler spent most of its time waiting for our response; it was very I/O bound.
00:03:39.840 Then it would update its own database, so the slow endpoint was essentially rate-limiting the webhook handler's interaction with its own database.
00:03:45.120 It's worth noting that at GoCardless, our webhooks are often a result of batch processes, which means they're really spiky.
00:03:50.480 We send big sets of them a couple of times a day, so as the endpoint got faster during those spikes, the webhook handler started to apply more load to the database than normal.
00:03:56.480 To such an extent that an engineer got paged to resolve a service degradation. The fix here is fairly simple: scale down those webhook handlers so they process fewer webhooks and the database usage returns to normal.
00:04:02.159 Alternatively, beef up your database, but it shows us just how easy it is to accidentally break someone else's stuff even if you're trying to do right by your integrators.
00:04:08.239 So to set the scene, here are some examples of changes that have broken code in the past. Traditional API changes, right? Adding a mandatory field, removing an endpoint, or changing validation logic.
00:04:15.680 I think we're all comfortable with this stuff and why it could break things.
00:04:20.400 Introducing a rate limit or even changing your rate limiting logic—Docker did this reasonably recently, and I think they communicated it very clearly, but it obviously impacted lots of integrators.
00:04:32.560 They also worked hard to provide tooling for integrators to self-serve and understand their impact, which I thought was really cool.
00:04:46.240 Changing an error response string—GoCardless had an issue where we basically found a bug in our own code.
00:04:52.400 We weren’t respecting the Accept-Language header on a few of our endpoints, so somebody would request us with 'Accept-Language: fr' and we would respond with English errors.
00:04:59.200 We noticed this, thought it was bad, and we fixed it. Then we received a call from an integrator claiming we broke their stuff.
00:05:05.280 We were confused, as we hadn’t realized we were breaking their integration.
00:05:11.040 It turned out that they were relying on the previous behavior where we ignored their Accept-Language header and always responded with an English error.
00:05:17.600 They were using that English error to match it against a string, translating it, and displaying something in the UI.
00:05:23.040 Breaking apart a database transaction might seem obvious in some ways when we think about our own systems.
00:05:30.080 We all know that internal consistency is really important, but it's relevant for your integrators too.
00:05:40.160 For example, let's say you have a resource that can be either active or inactive, and when it's deactivated, you create a row in an events table explaining why that happened.
00:05:46.640 It would be quite natural for an integrator to build a UI that explains why this resource was deactivated, helping the user understand what happened.
00:05:54.879 If in the past that event was created inside a database transaction with the status change, and now we break apart that transaction, it creates new scenarios for the integrator to handle.
00:06:01.759 There’s now a possibility for the integrator where the resource can be inactive, but there’s no corresponding event to tell them why.
00:06:07.520 It's entirely plausible that the integrator has assumed this would never happen because it never had, and thus their UI will error if they cannot find the corresponding event.
00:06:13.520 Changing the timing of your batch processing can also lead to issues. As I mentioned, GoCardless is a payments company, and we have a daily batch process that submits instructions to the banks.
00:06:21.679 We can see from our logs that certain integrators create lots of payments just in time, right before our daily payment run.
00:06:29.759 So we know that if we were to change our timings without communicating with them, it could cause significant issues, as a lot of their payments might be delayed.
00:06:38.479 The last example here is reducing the latency on an API call, which is, kind of, what we discussed in that first example.
00:06:43.920 This is probably a good thing overall, but it can have some negative side effects.
00:06:50.320 So today, I'm going to define a breaking change as something where I, as the API developer, do something and someone's integration breaks.
00:06:58.080 That happens fundamentally because an assumption made by that integrator is no longer correct.
00:07:03.760 When this happens, it’s easy to criticize the engineer who made that assumption, but I don’t think that’s particularly productive for a couple of reasons.
00:07:09.840 Firstly, assumptions are inevitable; as a developer, you cannot get anywhere without them.
00:07:15.760 So if you want people to write code, they’re going to make assumptions.
00:07:22.960 Secondly, even if it’s their fault, it’s often your problem.
00:07:31.680 Possibly not if you’re Google or AWS, but for most companies, if your integrators are feeling pain, then you’ll feel it too.
00:07:38.240 Either immediately or in the long-term, when you’re trying to renew contracts.
00:07:45.440 So how do these assumptions actually develop? We can think of these in two categories: explicit and implicit.
00:07:51.520 Explicit assumptions occur when an integrator asks a question, gets an answer, and then builds their system based on that answer.
00:07:58.080 So your first step, if you’re building an integration, is to look at the documentation.
00:08:03.520 It's worth noting that people are quite lazy, and they often skip to the examples without reading any of the narrative text.
00:08:09.280 You need to make sure that your snippets are super representative of how your system is going to behave.
00:08:16.080 They might also look at support articles or blog posts, perhaps stuff you’ve published or something a third party has online.
00:08:21.760 Then, you have ad hoc communication, which includes random emails or phone calls with like a pre-sales team or your solutions engineers.
00:08:27.040 This ad hoc communication drives the assumptions that integrators make about how your software behaves.
00:08:33.280 Other assumptions are more implicit. Industry standards are quite interesting here; if you send me a JSON response, you’re going to give me an application JSON header.
00:08:41.920 I won't need my HTTP client to tell me that it's going to be JSON because it can work it out for itself.
00:08:48.560 I, as an integrator, will assume this never changes. Similarly, I assume that you will keep my secrets safe.
00:08:55.279 If you tell me my access token was used to create something, I will assume it was probably me.
00:09:01.440 This is fine, but in some cases, you can find yourself in trouble, particularly if these standards change.
00:09:07.680 We had a bad incident at GoCardless when we upgraded our HAProxy version, which was observing the new industry standard.
00:09:14.000 The new standard was down-casing all of our outgoing HTTP headers. According to the official textbook, HTTP response headers should not be treated as case-sensitive.
00:09:20.800 But a couple of key integrators had been relying on the previous behavior and had a significant outage.
00:09:26.560 That outage was exacerbated by the fact their requests were being processed, but they weren’t processing our responses.
00:09:34.480 That meant we had two systems that were out of sync in a really unfortunate way.
00:09:38.679 Finally, let’s talk about observed behavior. As an integrator, you want the engineers running the services you use to be constantly improving them and adding features.
00:09:46.480 But you also want them to not touch anything, ensuring that its behavior won't change.
00:09:52.160 As soon as a developer sees something, whether that's an undocumented header or an HTTP response—like a batch process that happens at the same time each day or a particular API latency—they assume it's reliable.
00:09:58.560 They build their systems accordingly. Humans also pattern match aggressively, not just in software but in all walks of life.
00:10:05.760 We see this in the theory of language acquisition; we find it easy to convince ourselves that correlation equals causation.
00:10:11.040 That means that particularly if we can come up with an explanation of why A always means B, however far-fetched, we are quick to accept and rely on it.
00:10:18.560 It’s quite ironic given that we are all developers who are employed to make changes to our own systems.
00:10:25.520 We should understand that they are constantly in flux, yet we all encounter interesting edge cases every day.
00:10:31.040 Someone hits an incredibly unlikely scenario that causes our code to misbehave, but somehow we assume others' code will behave consistently and remain the same forever.
00:10:37.440 None of this stuff is new. A great example, if a bit retro, is MS-DOS, which is an old operating system from Microsoft.
00:10:43.920 MS-DOS was released with a number of documented interrupt calls, hooks, and all that retro stuff, but early application developers weren't able to achieve everything they wanted.
00:10:50.720 This was compounded because Microsoft used undocumented calls in their software, making it impossible to compete using what was only in the documentation.
00:10:57.200 So, like all good engineers, they started decompiling the operating system and wrote lists of undocumented information.
00:11:03.920 The most famous of which is probably Ralph Brown's interrupt list, which became widely shared.
00:11:10.560 Using these undocumented features became so widespread that Microsoft couldn't change anything without breaking these applications.
00:11:16.000 Particularly as an operating system, these applications were a core part of their value proposition, so breaking them clearly wasn't an option.
00:11:22.400 We can think of the interrupt list as analogous to someone writing a blog on Medium called '10 Things You Didn't Know That So-and-So's API Could Do.'
00:11:28.240 It seems innocuous at first, but it can cause problems down the line.
00:11:35.520 Some of these assumptions are also totally unconscious. Once something is stable for a while, we tend to assume it will never break.
00:11:42.080 This is particularly obvious when it comes to resource choices, such as how much CPU or memory to allocate to a particular pod.
00:11:48.480 The napkin math we do is always pretty haphazard. If we’re all being honest, we pull numbers out of thin air, watch them for a bit, and then change them until they seem happy.
00:11:55.120 That works fine as long as what that pod is being asked to do is reasonably consistent over time.
00:12:01.760 But as we've discussed, this might not be true.
00:12:07.840 We can think about this in our first story: the database had plenty of resources until our endpoint got faster.
00:12:13.760 If we want to stop breaking other people's things, we need to help our integrators stop making bad assumptions.
00:12:20.160 When it comes to your documentation, document edge cases. Discoverability is also crucial.
00:12:27.680 Think about SEO, which is search engine optimization, and also the search within your docs site.
00:12:34.240 Don’t ever deliberately leave something undocumented if it's subject to change; just call it out clearly.
00:12:40.800 This gives integrators the best chance of making a good choice.
00:12:47.520 Support articles and blog posts must be kept religiously up to date, and again, try to ensure they’re quite searchable.
00:12:54.079 If you come across third-party blogs that are incorrect, try contacting the author or commenting with the fix needed.
00:13:02.080 You can also point them to an equivalent page on your own doc site.
00:13:10.000 If you get unlucky, that third-party blog content can become the equivalent of Ralph Brown's interrupt list and can fix you to contracts you really don’t want.
00:13:16.240 When it comes to ad hoc communication, consistency is key.
00:13:23.280 If a developer wants to understand what might break someone else's stuff, they need to know what communication is going out.
00:13:30.080 Ideally, this should be in a super-searchable format so they can understand what assumptions might have been made.
00:13:37.600 Many B2B software companies just email random PDFs around, creating shared Slack channels.
00:13:43.840 At that point, as an engineer, you don't stand a chance of knowing what assumptions might have been made.
00:13:50.720 If you’re able to have a central repository for those kinds of materials, it helps.
00:13:57.120 It doesn’t have to be public, but something where you’re repeatedly sharing the same information.
00:14:03.760 Ideally, this information isn't static, but there's an expectation from your integrators that it might change.
00:14:10.160 When it comes to industry standards, just follow them wherever you can.
00:14:17.440 And flag loudly if you can't or where the industry hasn’t yet settled.
00:14:23.680 Also, there’s a lot to think about with observed behavior so we'll give it its own slide.
00:14:30.720 Naming is really important because developers often don’t read narrative docs and instead look at examples.
00:14:37.120 One example is numbers that begin with zero, which often get truncated, such as company registration numbers.
00:14:44.480 We have a field in our API called account number ending, and unfortunately, in Australia, some account numbers have letters in them.
00:14:50.320 This results in confusion for integrators, even though that field is a string.
00:14:56.000 We try to call that out clearly in our docs, even providing examples that highlight those edge cases.
00:15:01.760 You also want to use your documentation to combat pattern matching. If batch timings could change, call that out in the documentation.
00:15:09.600 If you say, 'We currently run this once a day at 11 a.m.,' make sure it’s clear that this timing is likely to change.
00:15:16.000 Expose information about your API that might change, to signal to integrators that what they see now may not always be true.
00:15:22.240 And restrict your own behavior: document a limit and implement it in code to ensure you keep that commitment.
00:15:29.760 We had an issue at GoCardless where an integrator started adding a lot of extra events to their webhooks.
00:15:36.560 Our webhook handlers ran out of memory because they were trying to load way too much data.
00:15:43.760 So if we had known there was a limit on the number of items in a webhook, we could have tested against it.
00:15:50.080 We could have made sure that our pods were resourced appropriately.
00:15:55.840 Complex products make it unlikely that all your integrators will avoid bad assumptions.
00:16:02.560 We need to find strategies to mitigate the impact of our changes.
00:16:10.080 The first thing to remember is that a change isn't either breaking or not. I think this is a completely false binary.
00:16:17.360 If an integrator has done something strange enough, almost anything can be breaking.
00:16:25.760 This binary has historically been used to assign blame. If it’s not breaking, then it’s the integrator’s fault.
00:16:31.920 But as we discussed earlier, it may not be technically your fault, but it’s probably still your problem.
00:16:38.960 If your biggest customer's integration breaks, the fact that you didn’t break the rules will be little consolation.
00:16:45.040 So instead of viewing it as a yes/no question, we should think in terms of probabilities.
00:16:52.960 How likely is it that someone has made this assumption? How likely is it that this will cause an issue?
00:16:59.520 How severe do we think that issue might be? Not all breaking changes are equal.
00:17:05.680 Some changes are 100% breaking—killing an endpoint, for example. You'll have a lot of unhappy integrators.
00:17:12.160 But many changes fall somewhere between 0% and 100% breaking.
00:17:19.840 Try to empathize with your integrators about the assumptions they might have made.
00:17:26.800 Use people in your organization who are less familiar with the specifics than you are as rubber ducks.
00:17:34.240 If possible, talk to them. The more you talk to your integrators, the more you will understand the mistakes they might make.
00:17:40.800 If you can find ways to dogfood your APIs, this can help you find tripwires.
00:17:47.680 This is particularly good as an onboarding exercise; we ask our new joiners to build an integration against our API, putting them in the shoes of integrators.
00:17:54.080 This also helps you keep your docs and guides up-to-date, introducing them to your product in an accessible way.
00:18:02.720 Sometimes you can measure this: add observability to help you look for people relying on undocumented behavior.
00:18:08.960 For example, I've mentioned we see a spike in payment create requests every day just before our payment run.
00:18:15.760 This approach can help identify which integrators might be impacted so you can reach out to them specifically.
00:18:22.320 Some of you may be wondering what about semantic versioning (semver).
00:18:28.080 Now, don't get me wrong, semantic versioning is awesome provided it's used appropriately.
00:18:34.560 The identification of the release type is correct, and this is a great way to release potentially breaking changes.
00:18:40.720 We should use this not just for packages but also for APIs and web hooks wherever possible.
00:18:47.040 This solves some of our problems but not all of them. As someone who maintains a public API, there are lots of changes that can't be applied this way.
00:18:53.120 For instance, the timing of our batch processing or reducing the latency on an endpoint.
00:19:00.080 Not everything can be applied on an opt-in basis at a merchant-by-merchant level.
00:19:06.720 Additionally, every new version you support increases the complexity of your system.
00:19:13.120 Complexity leads to risk; it makes it harder to debug things and can cause other issues.
00:19:19.520 There’s a trade-off to make. If a major version doesn't work for your use case, I recommend scaling your release approach.
00:19:25.520 This depends on how many integrators you think have made bad assumptions and what impact those might have.
00:19:31.840 We want different strategies at different levels; if we over-communicate, we get into a 'boy who cried wolf' situation.
00:19:38.160 No one reads emails sent to them, and their integrations end up breaking anyway.
00:19:45.120 Strangely enough, the email in their inbox that they didn’t read doesn’t seem to make them feel any better.
00:19:52.000 To handle this, start with pull communications. Update your docs or a changelog.
00:19:58.800 This is particularly useful to help integrators recover after they've found an issue.
00:20:05.040 Then you can upgrade to push communications, like a newsletter or an email.
00:20:11.440 This is where it gets tough. We all ignore many emails every day, so try to ensure the content is as relevant as possible.
00:20:18.160 Don’t tell integrators about changes to features they don’t use and resist the temptation to include marketing content.
00:20:24.720 If you’re really worried, use explicitly acknowledged communications. This works well if you have a few key integrators you want to check in with.
00:20:31.680 For instance, if these are the only people relying on this functionality or just a couple of particularly important integrators.
00:20:38.080 It’s important to make these kinds of changes often. It’s a muscle you need to practice; otherwise, both you and your integrators get scared.
00:20:45.360 You may forget how or lose the infrastructure to do it.
00:20:52.720 And if you’re really unlucky, the cultural incentive is to argue that a change isn’t breaking and release things without rigor.
00:20:59.760 We can also mitigate the impact of a breaking change by considering how to release it.
00:21:06.160 If possible, make those changes incrementally to give early warning signs to your integrators.
00:21:13.040 For example, apply the new behavior to a percentage of requests; this helps integrators avoid performance cliffs.
00:21:20.080 It could turn a potential outage into a minor service degradation.
00:21:27.760 Many integrators will have near-miss alerting to help them identify problems before they cause significant damage.
00:21:35.280 If you have a sandbox environment, it's a great candidate for applying changes.
00:21:42.720 Making changes there as long as integrators are actively using it can act as the canary in the coal mine.
00:21:50.360 This helps flag changes you didn’t think were dangerous but might be a little bit trickier than you thought.
00:21:57.920 Finally, think about rolling back. If your biggest integrator calls you to tell you that you’ve broken their integration, it’s nice to have a kill switch.
00:22:05.040 This is based on the nature of the change, but it's good to know what your kill switches are and to be clear about when they are possible.
00:22:12.080 As soon as that call comes in, you want to know your options and be able to react quickly.
00:22:18.560 The only way to truly avoid breaking other people’s things is not to change anything at all, and often even that is not possible.
00:22:25.760 So instead, we should think in terms of managing risk.
00:22:32.240 We've talked about ways to prevent these issues by helping your integrators make good assumptions in the first place.
00:22:39.760 It is crucial to build and maintain the capability to communicate when making potentially breaking changes.
00:22:46.560 But you aren’t a mind reader, and integrators are sometimes careless under pressure, just like you.
00:22:53.200 Be cautious and assume that your integrators didn’t read the docs perfectly or maybe at all and may have cut corners.
00:23:00.000 They may not have the observability of their systems that you might hope or expect.
00:23:06.320 You need to find the balance between caution and product delivery that's right for your organization.
00:23:13.600 For all the modern talk of 'move fast and break things,' it is still painful when things break.
00:23:21.440 Recovering can take a lot of time and energy.
00:23:29.680 Building trust with your integrators is critical to the success of a product.
00:23:36.480 But so is delivering features. We may not completely stop breaking people's things, but we can make it much less likely and much less severe.
00:23:43.480 I really hope you’ve enjoyed the talk. Thank you so much for listening. Please find me on Twitter @PatrickEdge if you’d like to chat about anything we’ve covered today.
00:23:50.800 I hope you all have a great day.
00:26:29.679 Thank you, Lisa! I know it's really hard to deliver a talk without an audience, so let me read some feedback for you.
00:26:35.840 People enjoyed your talk. Such great feedback! Wow! That was the best talk of the day so far.
00:26:42.320 Thank you! The audience really enjoyed your speech. Are we ready to go to the questions? Yes? Let's go!
00:27:10.240 What about decisions? How should we document past decisions, like why did we do that?
00:27:16.320 Um, I think this is interesting, as there's an internal and an external side to this.
00:27:25.600 Internally, I believe the best documentation is in git because it sticks around longer and is the easiest way to make stuff discoverable.
00:27:32.320 There are a bunch of talks about this, but include the 'why' in commit messages.
00:27:45.280 Try to ensure your commit messages are atomic, following the best practices learned early in development.
00:27:50.800 For larger decisions impacting your integrators, pushing that information out is key.
00:27:56.480 A blog is a great way to communicate that kind of 'point-in-time' reasoning—this is why we’re doing this.
00:28:02.200 It helps buy people in. If they need to make a change to mirror what you've done, you want to convey that there’s a good reason behind it.
00:28:09.200 Communicate the benefits it will bring them while being sorry for the pain it may cause.
00:28:16.000 I think a blog really distinguishes that type of communication from your documentation, which should be static.
00:28:24.480 Thank you for the good idea.
00:28:32.320 The next question is can you apply the insights from this talk to user experience?
00:28:45.280 Certainly, users are often surprised by sudden changes on the website or app.
00:28:52.640 I think one of the best things to do is A/B testing and to roll things out to a percentage of your users.
00:29:01.280 This is particularly useful for a big user base and can provide early warning signs if something isn’t right.
00:29:07.680 When it comes to UX, the best thing you can do, as horrifying as it is, is to watch people using your tool.
00:29:14.720 Incredibly painful, but it’s truly the best way to learn about expectations.
00:29:22.480 We would love to see the best memes about this in our chat or later on Twitter.
00:29:28.720 Now, let’s move on to naming. How do you convince people that naming is critical?
00:29:37.640 It's obvious to me, but I’ve struggled to convince others.
00:29:42.720 I think it's about observed behavior. Developers don’t read documentation; they read a minimum number of words to get stuff working.
00:29:49.920 When you explain that, everyone internally goes, 'Oh, yes! I do that too.'
00:29:56.320 If you utilize examples, they will see the name as the most front-and-center information about what that field means.
00:30:02.800 If you get that wrong, you’ll spend the rest of your life putting signposts and flags everywhere.
00:30:08.240 Framing it like that really helps convey its importance.
00:30:15.200 You can also share horror stories, like the time Australian account numbers threw me for a loop.
00:30:21.440 These anecdotes help illustrate the importance of naming.
00:30:28.080 Let’s move to the topic of versioning APIs.
00:30:36.000 Breaking changes should not be in the same version of the API, right? That’s where versioning comes in.
00:30:43.440 Adding new attributes is often considered a non-breaking change.
00:30:50.080 However, what happens when a client breaks because they received unexpected attributes?
00:30:55.840 In that case, whose fault is it? I don’t think knowing whose fault it is is a useful question.
00:31:03.040 The industry standard increasingly is that adding fields should not be a breaking change, so clients should discard unexpected keys in a JSON response.
00:31:10.160 It's essential to signal that this standard exists in your documentation.
00:31:17.200 You should also note that many libraries auto-generate clients, which helps maintain expected behavior.
00:31:23.680 If you can provide those libraries, it greatly aids integrators.
00:31:30.000 Regarding the assumptions: if assumptions are unavoidable, what should we leave out of documentation?
00:31:36.640 This is a trade-off, and it's difficult to give a generic answer.
00:31:42.960 The point about communication comes down to the 'boy who cried wolf' situation.
00:31:50.080 If you receive too many emails that are unnecessary, you'll stop reading them.
00:31:56.800 So, push communications can be dangerous, especially when inundated with marketing material.
00:32:03.680 Narrowing your audience is helpful; tell only the relevant people about the product changes.
00:32:10.080 You should document the most important things—the ones with the highest impact.
00:32:17.760 For example, if someone misinterprets something and that leads to double charging a customer, that is critical.
00:32:25.040 Conversely, if the outcome is simply displaying a string that isn’t quite right, you can be more relaxed.
00:32:31.920 You need to manage the likelihood of assumptions alongside potential impacts.
00:32:41.120 Now, do you ever skip nice changes because you know it will create a lot of work to communicate the change?
00:32:48.160 Of course, I would never do that! Yes, everybody has faced this situation.
00:32:54.960 The problem typically is that either you don’t change it, or you do change it but don’t tell anyone.
00:33:02.080 You need to keep those incentives to treat integrators with respect, which means reducing friction.
00:33:08.680 Build tools that make it easy to communicate, get feedback on that communication, and adhere to style guidelines.
00:33:15.280 Clarifying responsibility for communication can alleviate confusion.
00:33:22.600 Have the tooling and processes to reduce friction; otherwise, integrators may end up with a worse service, or their systems will break.
00:33:30.080 What about deprecated fields? How long should we keep them forward?
00:33:37.480 I apologize to our integrators! We often keep them forever.
00:33:44.720 Your policy should consider the cost to the person making the change versus the risk of keeping multiple versions.
00:33:52.640 People often mistakenly think keeping deprecated fields doesn’t impact anyone, but having multiple versions introduces complexity.
00:34:00.080 Anything that can help reduce complexity is a positive thing, including eliminating deprecated elements.
00:34:08.080 Set a hard line—commit to a specific day, and if the world isn't on fire, the deprecated elements will be removed.
00:34:14.840 Make sure every team feels empowered to enforce this policy; it's critical for system health and helps everyone out.
00:34:24.080 Thank you, Lisa! You truly are talented, and the audience has greatly enjoyed your presentation.
00:34:30.720 Huge thanks to you, and let’s hope to see you in the chat! You can find Lisa on Twitter!