00:00:04.880
Here we are! I hope everyone enjoyed lunch. Oh, whoa! Let's get this party started.
00:00:20.580
Oh no, the sound didn't work! There was a sound that I really wanted to play.
00:00:33.660
The party hasn't started if we haven't heard the Mac sound from the 2000s.
00:00:41.879
So, my name's Mel. I'm a Lead Developer Advocate at Buildkite, and I come from Melbourne, Australia. Buildkite is a CI/CD platform with a hybrid model, meaning you can run your build agents on your own infrastructure while we host and maintain the APIs and a UI for you to view all your pipeline and build-related information.
00:00:53.700
Now, it's not a competition, but I think I probably flew the furthest to get here today. I'm willing to have a discussion with anyone who thinks they flew further, but my flight was 25 hours.
00:01:11.700
I know someone who caught a ferry from Germany that took 30 hours, so they win in the duration stakes but not in the distance stakes. I have my flashlight on my phone, and that's really amateurish.
00:01:35.700
I love Ruby and the Ruby community. Just a bit of background: I wrote my first line of Ruby or any code at a Rails Girls event in Melbourne in 2014. Shout out to Rails Girls! From then on, I attended my first Ruby conference, looked around, and met a whole heap of people. I felt like I had found my people.
00:01:58.680
For the first time ever, the Ruby community welcomed me, and I felt a part of it. Since then, I organized three Ruby conferences in Australia and emceed four. I can't believe I'm at Euruko. This is my first international trip since 2020, so it's a big moment for me.
00:02:17.340
I’m going to take a selfie to prove to my workplace that I’m not just here on holiday; I'm actually working. Thank you for indulging me.
00:02:30.140
Today, we're going to look at CI/CD through an SRE lens. I will cover some of the issues we might face in CI/CD and explore some SRE principles, discussing how we might apply them to CI/CD in a practical sense.
00:03:00.480
Today's drama is a beautiful venue to unfold this story. It's a drama that will happen in three parts: starting from the rabbit hole, we will look to a rosy future together. If there's no drama, we aren't shipping software. Would you agree? There's always drama in what we do.
00:03:14.580
Let's set this in motion. I don't have a dramatic score to play, but you can imagine whatever music you like—something dramatic, like Game of Thrones or Lord of the Rings.
00:03:38.940
We have a monolithic codebase, a gargantuan test suite, and for some additional drama, we have microservices as far as the eye can see. This is the software landscape we operate in, in 2022. It’s complex and very difficult to navigate. But I’m probably like you—we all love a good challenge. That's why we work in tech. We love the highs and the lows. It's not about the destination; it's about the friends we make along the way.
00:04:15.299
Today is a big day for me. My adventure is coming to a conclusion—I am about to ship something I've been working on for a couple of months. I'm pretty excited; it's been tested, verified, and we are ready to go. The branch is green! I’m going to go to mine this huge moment, and I’ve always wanted to say this—make it so!
00:04:45.780
My build’s kicked off, and I’ve got time for a coffee. I’ll enjoy some freshly brewed vibes with some freshly shipped vibes. My day was looking good, but now it’s no longer looking good because my build failed.
00:05:07.500
Sifting through the logs, I can see that a test has failed—this is weird because everything on my branch passed. The test that failed isn't even related to the code I just shipped. So, I’m pretty blocked, heading down a rabbit hole that I just really don’t like going down.
00:05:31.139
I don’t know how you all feel about CI/CD, but I love CI/CD. I work for a company that builds CI/CD tools, so clearly I love it. But as a software developer, I’m pretty partial to not knowing or wanting to know that CI/CD exists. I want to push changes and know that my code is tested and safe, and that it will be live within 10 minutes in production, even on a Friday. I don’t think that’s too much to ask, do you?
00:06:05.460
Apparently, it is. We end up not trusting CI/CD like we should be able to, and we battle the system, adapting workflows to processes that don’t live up to what they're supposed to.
00:06:40.220
As the label says, continuous integration and deployment is the automation of building and testing code. CI/CD allows teams to ship code easily and frequently with a high level of trust that users won't be impacted by bugs.
00:07:07.259
At Buildkite, we have a retry fail step button, so I don’t have to kick off my entire build again. But I’m mashing the button, and the same test keeps failing. Flaky tests are a thing, and we all know this.
00:07:19.979
I know they’re a thing because when I searched Google for ‘flaky test memes,’ there were literally thousands. So, when we encounter a flaky test as software engineers, what do we do? We go and make a meme! This gave me lots of material to choose from for my talk, so thank you.
00:08:00.460
I remember when I first started working as a software engineer, when I pushed a build with a tiny change, like baby steps. The build was red. I turned to the senior engineer next to me and said, 'For some reason, this test failed, and it doesn't seem related to what I pushed.' It turned out it was a smoke test that just never worked.
00:08:32.279
From that moment on, I always asked, 'Surely that's flaky?’ Every time there was a red failure, I’d think it was the tests, nothing to do with my work. But we need to be able to trust our tests. As software engineers, we want CI/CD to provide us with fast, reliable feedback about the software being delivered.
00:09:04.140
When it doesn’t, we have issues. End users of our systems, as well, no longer trust the system we’ve built to ensure what we're deploying is safe.
00:09:35.139
Last year, I did some digging into the impact of flaky tests on our users. I like going down rabbit holes when I choose to. I asked someone to run me a query to find out how long users spent hitting the retry button. I was staggered by the numbers—the very real monetary cost that retrying tests had on Buildkite users.
00:10:10.760
Over a one-month period, Buildkite users spent a cumulative 9,413 days hitting retry, which is a huge, huge number. If you think about that across multiple CI systems, that’s a lot of cost.
00:10:33.480
Think about the environmental cost and the monetary cost of developers sitting there hitting retry. It's a big number. To put it into perspective, you can get to Mars and back 17 times in 9,413 days.
00:11:03.300
Besides the time I've spent hitting retry to see if my build passes or fails, I've lost my flow. What do we do when we lose our flow? We hop onto Slack to get rid of all the little unread notifications, we jump into Twitter, you know, dump a hot take or two, but it takes time to get back into the zone.
00:11:39.600
It’s disheartening to have to ask in Slack, 'Does anyone know if this test is flaky? I swear I’ve seen that fail before.' Sometimes it passes, sometimes it fails. When I’m also battling mega slow builds, I just want to get my changes out.
00:12:02.760
I don’t need to go on; you get the picture. I’m sure you’ve all been where I’ve been. Right now, it’s quite frankly a frustrating place to be.
00:12:30.840
So, how can we minimize the impact of situations that I have just vividly and realistically lived through? Turns out SREs know a thing or two about ensuring that systems are reliable.
00:12:54.060
Just a little bit of a history lesson: the first SRE team was formed at Google in 2003. They wrote a book called Site Reliability Engineering and How Google Runs Production Systems.
00:13:16.040
It's a very good book that lays out some of the principles and practices of SRE, along with the benefits of maintaining services and associated infrastructure. There’s a lizard on the cover, and I actually thought it was a monitor—how can you let that one slip?
00:13:39.660
We can't talk about SRE without talking about DevOps first. Google's SRE book suggests that DevOps is defined as a loose set of practices, guidelines, and a culture designed to break down silos in software engineering, operations, networking, and security.
00:14:03.240
DevOps principles encourage us to remove silos and do away with barriers that exist, either organizationally or between different disciplines in teams. It also encourages us to accept that accidents are a normal part of building and maintaining software. Naturally, CI/CD plays a big role in that.
00:14:37.440
Understanding that tooling and culture are interrelated, tooling and automation are, of course, essential components of DevOps, but DevOps thinking recognizes that organizational culture and human systems are critical to adopting new ways of working.
00:15:06.720
Finally, Google's SRE handbook states that DevOps understanding measures crucial for success. It's essential for breaking down silos and managing production incidents and crucially measuring and verifying reality.
00:15:35.760
DevOps is essentially a philosophy—it's a way of thinking about our work, our people, and our organizations as an ecosystem, keeping it healthy and functional. An SRE, on the other hand, is more practical; it focuses on improving operational practices and the reliability of our core systems.
00:16:05.160
There are many examples of SRE principles out there. While there’s no definitive list set in stone, some common characteristics appear in any definition: we work towards automation or eliminating anything repetitive to reduce costs, design systems with a bias toward reducing risks to availability, latency, and efficiency, and we should be able to ask arbitrary questions about our system without needing to know in advance what we’d like to ask.
00:16:39.360
My favorite principle? Avoiding more reliability than what’s strictly necessary. SRE focuses on ensuring reliable services, but 100% reliability is never the goal, as we know it's unattainable.
00:17:01.080
Accepting that mistakes are part of software, SRE ensures systems are only as reliable as necessary. Defining what is necessary is a practice unto itself. Thanks for joining me in this little SRE-shaped rabbit hole! It's been lovely.
00:17:33.600
However, it's time to get out because there are people wanting me to ship features. Mercury isn't in retrograde at the moment, so it’s definitely not that that’s bumming me out.
00:17:58.800
We know we have problems that keep cropping up when we least want them to. Let's look at putting some of these principles into practice. Things feel really bad in CI/CD right now—we know we have a problem with flaky tests, slow builds, and long waits for builds to kick off.
00:18:31.260
It feels bad, but what do we actually know? Remember, both DevOps thinking and SRE principles state that measurement is crucial for success. We need to practice observability and have ready access to data about our systems.
00:19:01.320
We need good metrics as an objective foundation for conversations with stakeholders. They will agree on these metrics because we will have hard facts about the state of our system, not just some anecdotes. However, how does this work in practice?
00:19:35.700
Let’s take a look: it's impossible to do your job well if you haven’t defined what ‘well’ means. How do you improve things if you don’t know how bad they are to start?
00:20:03.960
Site reliability engineering uses SLOs, SLIs, and error budgets to define what's important, how reliable things should be, and how to measure these things. When these are clearly understood and defined, it’s almost magical how well it allows teams to focus their energy on what matters.
00:20:30.420
So, what's an SLI? It’s a key metric indicating whether an SLO is being met. An SLO is a promise related to a metric that matters to users or maintainers of a system. SLOs define a level of unreliability that’s acceptable, and the error budget reflects how much and for how long a service can fail to meet the SLO before there are consequences.
00:21:04.900
How do we get everyone to agree on what SLOs, SLIs, and error budgets to set? If you’re feeling pain right now, others are probably feeling the same way. It's the perfect opportunity to come together and start working to improve things.
00:21:39.480
Let's stop being reactive and become proactive. We'll assign metrics and strive to uphold them—only when needed! It's going to be amazing.
00:22:12.600
Time for some questions: What systems are we discussing? Are we talking about CI/CD or limiting the scope to an application test suite? What’s important to the different stakeholders? It's crucial to understand everyone's expectations of the system.
00:22:56.760
Once you have a shared understanding, it’s time to agree on SLOs, SLIs, and reasonable error budgets. You might want to stop developers waiting around for builds to kick off, so a reasonable SLO could be starting builds within 30 seconds.
00:23:36.840
The SLI might be the time spent waiting for a build to start, with an error budget of 33 builds taking more than 30 seconds in a four-week period. Google's SRE book suggests a method to calculate your error budget: one minus your SLO percentage.
00:24:13.230
Error budgets are flexible, and if you want more information, you can Google how to set an error budget to find resources on that.
00:24:41.520
You might commit development teams to have tests committed and notified with success or failure in five minutes. For this, your SLI would be the total build runtime, with an error budget of 33 builds taking more than five minutes in a four-week period. For my problem today, a great SLO would be ensuring my test suite's reliability percentage is above 87%. Your error budget could have 77 test runs scoring below that threshold.
00:25:34.260
How do you get all these SLI metrics? Good question! For production monitoring systems, there are big players like Datadog and Honeycomb, along with many others that allow you to configure custom metrics. Honeycomb even has a product for managing SLOs along with their observability tooling.
00:26:34.560
For total wait time, your CI/CD will likely allow you to access this metric. At Buildkite, we have a CLI tool to collect our agent metrics, and we also have OpenTelemetry instrumentation built in the agent, enabling you to send traces and metrics around build wait times to Datadog, Honeycomb, and more.
00:27:52.320
Once you have that SLO being measured via the SLI, you can start tuning your agent capacity to improve total build start time. We want developers to have speedy feedback loops; for that, your SLI would again be the total build runtime. We have an API endpoint to click this data, and our UI also shows a monthly average build time for each pipeline.
00:28:52.080
Once you’re measuring this build runtime metric, you can tune your CI/CD infrastructure as needed—but only after spending your error budget. Imagine you have nothing left in the budget; you’ve had more than 33 builds over five minutes, so you need to get those times back down into acceptable parameters.
00:29:43.680
Since Buildkite agents are hosted on your own infrastructure, leverage the same cloud-hosted capabilities, like auto-scaling compute resources, as you would for production infrastructure. Pipeline steps can also be optimized to run in parallel across as many agents as needed to speed up builds.
00:30:11.640
Another way to speed up builds is to assess test performance. We have a new product called Buildkite Test Analytics that provides interesting information about your test suite's health so you can optimize.
00:30:28.320
For example, we sped up a slow test that once took 35 seconds down to under three seconds simply by switching out the Capybara matcher. That’s a huge speed gain over an entire suite.
00:30:56.880
Once you've defined what's important and understand what your stakeholders consider important, you can access those metrics and tune performance as needed. The reliability score is front and center for everyone to see, and we can easily track this for improvement.
00:31:21.840
The SLO for our test suite reliability percentage should be greater than 87%, and once the error budget is exceeded with 77 test runs under that mark, the budget is gone!
00:31:50.280
However, with continued testing and focus on quality, we can ensure long-term improvements. So, thank you for being here today.
00:32:06.120
I appreciate today’s discussion and can’t wait to share more! Also, our virtual booth will have additional resources, webinars, and our Twitter page, where I’d love for you to come say hello. I have some retro stickers and ten Buildkite Kites to give away, so come grab one before I head back to Australia.
00:32:46.860
Thank you for your time today, and enjoy the rest of the conference!
00:33:16.680
Thank you, Mel! We have a few questions. Would you be happy to answer those?
00:33:41.780
Excellent! Do you have any tips or strategies to fix flaky specs?
00:33:59.220
That can be challenging as they're all pretty different. In my experience, they often involve integration tests and are about pixel locations or buttons not being where you expect them.
00:34:29.661
I’d recommend pairing with someone on problem-solving for those, as it can often lead to new insights.
00:34:50.820
How do you differentiate between tests that fail due to code issues versus flaky tests or infrastructure problems?
00:35:15.240
That's a fantastic question! I often went to the systems engineer to blame them, but it was usually my work that needed fixing. Test analytics can help you see reliability scores and determine flakiness.
00:35:37.920
You mentioned CI reliability score. How do you define that score?
00:35:58.620
It's typically a straightforward number within the test analytics data. It’s something like a percentage of reliability.
00:36:14.550
If a team is reluctant to introduce SLOs and SLIs, could the first SLO match the current system to identify and prevent regressions?
00:36:47.040
That's a great strategy! Starting with a baseline helps provide that necessary information. It can be a good starting point to discuss whether you want to raise or lower future metrics.
00:37:25.080
Is it better to increase SLOs after continued success or focus on consistency?
00:37:42.120
I would remain pragmatic here. If you’ve agreed with all stakeholders, it's okay to keep things as they are for now. Focus on new challenges.
00:38:04.380
Can you share your slides online so we can access them?
00:38:20.160
Sure! I would love to share those online.