Flaky Tests

Summarized using AI

Applying SRE Principles to CI/CD

Mel Kaulfuss • October 13, 2022 • Helsinki, Finland

In this video titled "Applying SRE Principles to CI/CD" presented by Mel Kaulfuss at Euruko 2022, the speaker explores how to apply Site Reliability Engineering (SRE) principles to improve Continuous Integration and Continuous Deployment (CI/CD) processes. Kaulfuss shares personal anecdotes from their experiences in software development, highlighting common CI/CD challenges such as flaky tests, slow builds, and reliability issues, which often hinder developers' productivity.

Key Points Discussed:

  • Introduction to CI/CD and SRE:

    • CI/CD allows for automated building and testing of code, enabling teams to ship code frequently and reliably.
    • SRE, established at Google in 2003, focuses on improving operational practices and the reliability of systems.
  • Challenges in CI/CD:

    • Kaulfuss details a scenario where the CI/CD process can fail due to flaky tests and builds that take excessive time, leading to frustration among developers.
    • Shares statistical insights about the time developers spend retrying failed builds, emphasizing the need for improvement in CI/CD workflows.
  • The Role of SRE Principles:

    • Identifies the significance of understanding Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets in establishing a reliable CI/CD pipeline.
    • SLOs define acceptable reliability levels, while SLIs serve as metrics to gauge the success of SLOs, with error budgets dictating acceptable failure thresholds.
  • Measurement and Observability:

    • Advocates for the importance of measurement to establish a baseline and have informed discussions with stakeholders.
    • Encourages teams to define what "well" looks like in their CI/CD processes to drive improvements.
  • Practical Implementation:

    • Discusses customizing SLOs and SLIs based on specific needs, like ensuring builds start within a reasonable time or maintaining test suite reliability percentages.
    • Suggests utilizing tools like Datadog and Honeycomb for gathering observability metrics and performance data.
  • Continuous Improvement:

    • Emphasizes the necessity of adjusting CI/CD practices based on collected data, encouraging a proactive rather than reactive approach.
    • Encourages collaboration among teams to diagnose and resolve issues like flaky tests effectively.

Conclusions and Takeaways:

  • Applying SRE principles can significantly improve CI/CD processes and rebuild trust among stakeholders.
  • Automation, measurement, and robust observability are critical in refining deployment practices and enhancing developer experience.
  • Engaging all stakeholders in defining reliability metrics fosters better alignment and shared understanding of system performance expectations.

The session concludes with an invitation for questions from the audience, highlighting the interactive nature of the discussion and the ongoing conversation about improving CI/CD practices.

Applying SRE Principles to CI/CD
Mel Kaulfuss • October 13, 2022 • Helsinki, Finland

Discover how to approach CI/CD with an SRE mindset. Learn what SLOs, SLIs & error budgets are, and how to define them for your own build & deploy processes. Rebuild trust with your system’s stakeholders, and reclaim control over slow & unreliable build and deploy processes.

To watch with closed captions, view the livestream recording: https://www.youtube.com/watch?v=reVGR35H264&t=11910s

EuRuKo 2022

00:00:04.880 Here we are! I hope everyone enjoyed lunch. Oh, whoa! Let's get this party started.
00:00:20.580 Oh no, the sound didn't work! There was a sound that I really wanted to play.
00:00:33.660 The party hasn't started if we haven't heard the Mac sound from the 2000s.
00:00:41.879 So, my name's Mel. I'm a Lead Developer Advocate at Buildkite, and I come from Melbourne, Australia. Buildkite is a CI/CD platform with a hybrid model, meaning you can run your build agents on your own infrastructure while we host and maintain the APIs and a UI for you to view all your pipeline and build-related information.
00:00:53.700 Now, it's not a competition, but I think I probably flew the furthest to get here today. I'm willing to have a discussion with anyone who thinks they flew further, but my flight was 25 hours.
00:01:11.700 I know someone who caught a ferry from Germany that took 30 hours, so they win in the duration stakes but not in the distance stakes. I have my flashlight on my phone, and that's really amateurish.
00:01:35.700 I love Ruby and the Ruby community. Just a bit of background: I wrote my first line of Ruby or any code at a Rails Girls event in Melbourne in 2014. Shout out to Rails Girls! From then on, I attended my first Ruby conference, looked around, and met a whole heap of people. I felt like I had found my people.
00:01:58.680 For the first time ever, the Ruby community welcomed me, and I felt a part of it. Since then, I organized three Ruby conferences in Australia and emceed four. I can't believe I'm at Euruko. This is my first international trip since 2020, so it's a big moment for me.
00:02:17.340 I’m going to take a selfie to prove to my workplace that I’m not just here on holiday; I'm actually working. Thank you for indulging me.
00:02:30.140 Today, we're going to look at CI/CD through an SRE lens. I will cover some of the issues we might face in CI/CD and explore some SRE principles, discussing how we might apply them to CI/CD in a practical sense.
00:03:00.480 Today's drama is a beautiful venue to unfold this story. It's a drama that will happen in three parts: starting from the rabbit hole, we will look to a rosy future together. If there's no drama, we aren't shipping software. Would you agree? There's always drama in what we do.
00:03:14.580 Let's set this in motion. I don't have a dramatic score to play, but you can imagine whatever music you like—something dramatic, like Game of Thrones or Lord of the Rings.
00:03:38.940 We have a monolithic codebase, a gargantuan test suite, and for some additional drama, we have microservices as far as the eye can see. This is the software landscape we operate in, in 2022. It’s complex and very difficult to navigate. But I’m probably like you—we all love a good challenge. That's why we work in tech. We love the highs and the lows. It's not about the destination; it's about the friends we make along the way.
00:04:15.299 Today is a big day for me. My adventure is coming to a conclusion—I am about to ship something I've been working on for a couple of months. I'm pretty excited; it's been tested, verified, and we are ready to go. The branch is green! I’m going to go to mine this huge moment, and I’ve always wanted to say this—make it so!
00:04:45.780 My build’s kicked off, and I’ve got time for a coffee. I’ll enjoy some freshly brewed vibes with some freshly shipped vibes. My day was looking good, but now it’s no longer looking good because my build failed.
00:05:07.500 Sifting through the logs, I can see that a test has failed—this is weird because everything on my branch passed. The test that failed isn't even related to the code I just shipped. So, I’m pretty blocked, heading down a rabbit hole that I just really don’t like going down.
00:05:31.139 I don’t know how you all feel about CI/CD, but I love CI/CD. I work for a company that builds CI/CD tools, so clearly I love it. But as a software developer, I’m pretty partial to not knowing or wanting to know that CI/CD exists. I want to push changes and know that my code is tested and safe, and that it will be live within 10 minutes in production, even on a Friday. I don’t think that’s too much to ask, do you?
00:06:05.460 Apparently, it is. We end up not trusting CI/CD like we should be able to, and we battle the system, adapting workflows to processes that don’t live up to what they're supposed to.
00:06:40.220 As the label says, continuous integration and deployment is the automation of building and testing code. CI/CD allows teams to ship code easily and frequently with a high level of trust that users won't be impacted by bugs.
00:07:07.259 At Buildkite, we have a retry fail step button, so I don’t have to kick off my entire build again. But I’m mashing the button, and the same test keeps failing. Flaky tests are a thing, and we all know this.
00:07:19.979 I know they’re a thing because when I searched Google for ‘flaky test memes,’ there were literally thousands. So, when we encounter a flaky test as software engineers, what do we do? We go and make a meme! This gave me lots of material to choose from for my talk, so thank you.
00:08:00.460 I remember when I first started working as a software engineer, when I pushed a build with a tiny change, like baby steps. The build was red. I turned to the senior engineer next to me and said, 'For some reason, this test failed, and it doesn't seem related to what I pushed.' It turned out it was a smoke test that just never worked.
00:08:32.279 From that moment on, I always asked, 'Surely that's flaky?’ Every time there was a red failure, I’d think it was the tests, nothing to do with my work. But we need to be able to trust our tests. As software engineers, we want CI/CD to provide us with fast, reliable feedback about the software being delivered.
00:09:04.140 When it doesn’t, we have issues. End users of our systems, as well, no longer trust the system we’ve built to ensure what we're deploying is safe.
00:09:35.139 Last year, I did some digging into the impact of flaky tests on our users. I like going down rabbit holes when I choose to. I asked someone to run me a query to find out how long users spent hitting the retry button. I was staggered by the numbers—the very real monetary cost that retrying tests had on Buildkite users.
00:10:10.760 Over a one-month period, Buildkite users spent a cumulative 9,413 days hitting retry, which is a huge, huge number. If you think about that across multiple CI systems, that’s a lot of cost.
00:10:33.480 Think about the environmental cost and the monetary cost of developers sitting there hitting retry. It's a big number. To put it into perspective, you can get to Mars and back 17 times in 9,413 days.
00:11:03.300 Besides the time I've spent hitting retry to see if my build passes or fails, I've lost my flow. What do we do when we lose our flow? We hop onto Slack to get rid of all the little unread notifications, we jump into Twitter, you know, dump a hot take or two, but it takes time to get back into the zone.
00:11:39.600 It’s disheartening to have to ask in Slack, 'Does anyone know if this test is flaky? I swear I’ve seen that fail before.' Sometimes it passes, sometimes it fails. When I’m also battling mega slow builds, I just want to get my changes out.
00:12:02.760 I don’t need to go on; you get the picture. I’m sure you’ve all been where I’ve been. Right now, it’s quite frankly a frustrating place to be.
00:12:30.840 So, how can we minimize the impact of situations that I have just vividly and realistically lived through? Turns out SREs know a thing or two about ensuring that systems are reliable.
00:12:54.060 Just a little bit of a history lesson: the first SRE team was formed at Google in 2003. They wrote a book called Site Reliability Engineering and How Google Runs Production Systems.
00:13:16.040 It's a very good book that lays out some of the principles and practices of SRE, along with the benefits of maintaining services and associated infrastructure. There’s a lizard on the cover, and I actually thought it was a monitor—how can you let that one slip?
00:13:39.660 We can't talk about SRE without talking about DevOps first. Google's SRE book suggests that DevOps is defined as a loose set of practices, guidelines, and a culture designed to break down silos in software engineering, operations, networking, and security.
00:14:03.240 DevOps principles encourage us to remove silos and do away with barriers that exist, either organizationally or between different disciplines in teams. It also encourages us to accept that accidents are a normal part of building and maintaining software. Naturally, CI/CD plays a big role in that.
00:14:37.440 Understanding that tooling and culture are interrelated, tooling and automation are, of course, essential components of DevOps, but DevOps thinking recognizes that organizational culture and human systems are critical to adopting new ways of working.
00:15:06.720 Finally, Google's SRE handbook states that DevOps understanding measures crucial for success. It's essential for breaking down silos and managing production incidents and crucially measuring and verifying reality.
00:15:35.760 DevOps is essentially a philosophy—it's a way of thinking about our work, our people, and our organizations as an ecosystem, keeping it healthy and functional. An SRE, on the other hand, is more practical; it focuses on improving operational practices and the reliability of our core systems.
00:16:05.160 There are many examples of SRE principles out there. While there’s no definitive list set in stone, some common characteristics appear in any definition: we work towards automation or eliminating anything repetitive to reduce costs, design systems with a bias toward reducing risks to availability, latency, and efficiency, and we should be able to ask arbitrary questions about our system without needing to know in advance what we’d like to ask.
00:16:39.360 My favorite principle? Avoiding more reliability than what’s strictly necessary. SRE focuses on ensuring reliable services, but 100% reliability is never the goal, as we know it's unattainable.
00:17:01.080 Accepting that mistakes are part of software, SRE ensures systems are only as reliable as necessary. Defining what is necessary is a practice unto itself. Thanks for joining me in this little SRE-shaped rabbit hole! It's been lovely.
00:17:33.600 However, it's time to get out because there are people wanting me to ship features. Mercury isn't in retrograde at the moment, so it’s definitely not that that’s bumming me out.
00:17:58.800 We know we have problems that keep cropping up when we least want them to. Let's look at putting some of these principles into practice. Things feel really bad in CI/CD right now—we know we have a problem with flaky tests, slow builds, and long waits for builds to kick off.
00:18:31.260 It feels bad, but what do we actually know? Remember, both DevOps thinking and SRE principles state that measurement is crucial for success. We need to practice observability and have ready access to data about our systems.
00:19:01.320 We need good metrics as an objective foundation for conversations with stakeholders. They will agree on these metrics because we will have hard facts about the state of our system, not just some anecdotes. However, how does this work in practice?
00:19:35.700 Let’s take a look: it's impossible to do your job well if you haven’t defined what ‘well’ means. How do you improve things if you don’t know how bad they are to start?
00:20:03.960 Site reliability engineering uses SLOs, SLIs, and error budgets to define what's important, how reliable things should be, and how to measure these things. When these are clearly understood and defined, it’s almost magical how well it allows teams to focus their energy on what matters.
00:20:30.420 So, what's an SLI? It’s a key metric indicating whether an SLO is being met. An SLO is a promise related to a metric that matters to users or maintainers of a system. SLOs define a level of unreliability that’s acceptable, and the error budget reflects how much and for how long a service can fail to meet the SLO before there are consequences.
00:21:04.900 How do we get everyone to agree on what SLOs, SLIs, and error budgets to set? If you’re feeling pain right now, others are probably feeling the same way. It's the perfect opportunity to come together and start working to improve things.
00:21:39.480 Let's stop being reactive and become proactive. We'll assign metrics and strive to uphold them—only when needed! It's going to be amazing.
00:22:12.600 Time for some questions: What systems are we discussing? Are we talking about CI/CD or limiting the scope to an application test suite? What’s important to the different stakeholders? It's crucial to understand everyone's expectations of the system.
00:22:56.760 Once you have a shared understanding, it’s time to agree on SLOs, SLIs, and reasonable error budgets. You might want to stop developers waiting around for builds to kick off, so a reasonable SLO could be starting builds within 30 seconds.
00:23:36.840 The SLI might be the time spent waiting for a build to start, with an error budget of 33 builds taking more than 30 seconds in a four-week period. Google's SRE book suggests a method to calculate your error budget: one minus your SLO percentage.
00:24:13.230 Error budgets are flexible, and if you want more information, you can Google how to set an error budget to find resources on that.
00:24:41.520 You might commit development teams to have tests committed and notified with success or failure in five minutes. For this, your SLI would be the total build runtime, with an error budget of 33 builds taking more than five minutes in a four-week period. For my problem today, a great SLO would be ensuring my test suite's reliability percentage is above 87%. Your error budget could have 77 test runs scoring below that threshold.
00:25:34.260 How do you get all these SLI metrics? Good question! For production monitoring systems, there are big players like Datadog and Honeycomb, along with many others that allow you to configure custom metrics. Honeycomb even has a product for managing SLOs along with their observability tooling.
00:26:34.560 For total wait time, your CI/CD will likely allow you to access this metric. At Buildkite, we have a CLI tool to collect our agent metrics, and we also have OpenTelemetry instrumentation built in the agent, enabling you to send traces and metrics around build wait times to Datadog, Honeycomb, and more.
00:27:52.320 Once you have that SLO being measured via the SLI, you can start tuning your agent capacity to improve total build start time. We want developers to have speedy feedback loops; for that, your SLI would again be the total build runtime. We have an API endpoint to click this data, and our UI also shows a monthly average build time for each pipeline.
00:28:52.080 Once you’re measuring this build runtime metric, you can tune your CI/CD infrastructure as needed—but only after spending your error budget. Imagine you have nothing left in the budget; you’ve had more than 33 builds over five minutes, so you need to get those times back down into acceptable parameters.
00:29:43.680 Since Buildkite agents are hosted on your own infrastructure, leverage the same cloud-hosted capabilities, like auto-scaling compute resources, as you would for production infrastructure. Pipeline steps can also be optimized to run in parallel across as many agents as needed to speed up builds.
00:30:11.640 Another way to speed up builds is to assess test performance. We have a new product called Buildkite Test Analytics that provides interesting information about your test suite's health so you can optimize.
00:30:28.320 For example, we sped up a slow test that once took 35 seconds down to under three seconds simply by switching out the Capybara matcher. That’s a huge speed gain over an entire suite.
00:30:56.880 Once you've defined what's important and understand what your stakeholders consider important, you can access those metrics and tune performance as needed. The reliability score is front and center for everyone to see, and we can easily track this for improvement.
00:31:21.840 The SLO for our test suite reliability percentage should be greater than 87%, and once the error budget is exceeded with 77 test runs under that mark, the budget is gone!
00:31:50.280 However, with continued testing and focus on quality, we can ensure long-term improvements. So, thank you for being here today.
00:32:06.120 I appreciate today’s discussion and can’t wait to share more! Also, our virtual booth will have additional resources, webinars, and our Twitter page, where I’d love for you to come say hello. I have some retro stickers and ten Buildkite Kites to give away, so come grab one before I head back to Australia.
00:32:46.860 Thank you for your time today, and enjoy the rest of the conference!
00:33:16.680 Thank you, Mel! We have a few questions. Would you be happy to answer those?
00:33:41.780 Excellent! Do you have any tips or strategies to fix flaky specs?
00:33:59.220 That can be challenging as they're all pretty different. In my experience, they often involve integration tests and are about pixel locations or buttons not being where you expect them.
00:34:29.661 I’d recommend pairing with someone on problem-solving for those, as it can often lead to new insights.
00:34:50.820 How do you differentiate between tests that fail due to code issues versus flaky tests or infrastructure problems?
00:35:15.240 That's a fantastic question! I often went to the systems engineer to blame them, but it was usually my work that needed fixing. Test analytics can help you see reliability scores and determine flakiness.
00:35:37.920 You mentioned CI reliability score. How do you define that score?
00:35:58.620 It's typically a straightforward number within the test analytics data. It’s something like a percentage of reliability.
00:36:14.550 If a team is reluctant to introduce SLOs and SLIs, could the first SLO match the current system to identify and prevent regressions?
00:36:47.040 That's a great strategy! Starting with a baseline helps provide that necessary information. It can be a good starting point to discuss whether you want to raise or lower future metrics.
00:37:25.080 Is it better to increase SLOs after continued success or focus on consistency?
00:37:42.120 I would remain pragmatic here. If you’ve agreed with all stakeholders, it's okay to keep things as they are for now. Focus on new challenges.
00:38:04.380 Can you share your slides online so we can access them?
00:38:20.160 Sure! I would love to share those online.
Explore all talks recorded at EuRuKo 2022
+6