Incident Management

Summarized using AI

Elasticsearch 5 or and Bust

Molly Struve • November 18, 2019 • Nashville, TN

In her talk 'Elasticsearch 5 or Bust' at RubyConf 2019, Molly Struve recounts the daunting experience her team faced during the Elasticsearch upgrade at Kenna Security in 2017. She emphasizes the importance of preparation and teamwork in navigating software upgrades. The narrative serves as a cautionary tale, illustrating the consequences of assuming smooth transitions based on past experiences. Struve provides a detailed account of their upgrade process, which involved a critical outage that lasted almost a week. During this time, the team encountered severe performance issues, crashes, and the daunting realization that rolling back the upgrade would be challenging.

Struve highlights several key lessons learned from the ordeal:
- Have a Rollback Plan: Preparation is essential; understanding how to revert upgrades can save valuable time and resources should problems arise.
- Perform Thorough Performance Testing: Assumptions about software stability can lead to dire consequences; comprehensive performance testing is necessary.
- Don’t Ignore Small Warning Signs: Early indicators of trouble should never be overlooked. Each warning should be investigated thoroughly.

In addition to technical lessons, she discusses the importance of leveraging community support, emphasizing that outreach for help can drastically cut down the time spent troubleshooting. The role of leadership in these high-stress scenarios was underscored as crucial, with Struve crediting their VP of Engineering for his unwavering support during the crisis. She reflects on the camaraderie developed within her engineering team, highlighting that strong character and team dynamics matter significantly during crises. Finally, she points to embracing mistakes as a way forward, advocating for companies to learn from outages rather than hide from them. Struve hopes her experiences will help others avoid similar pitfalls, framing the talk as a guide to making future upgrades smoother and more manageable.

Elasticsearch 5 or and Bust
Molly Struve • November 18, 2019 • Nashville, TN

RubyConf 2019 - Elasticsearch 5 or and Bust by Molly Struve

Breaking stuff is part of being a developer, but that never makes it any easier when it happens to you. The Elasticsearch outage of 2017 was the biggest outage our company has ever experienced. We drifted between full-blown downtime and degraded service for almost a week. However, it taught us a lot about how we can better prepare and handle upgrades in the future. It also bonded our team together and highlighted the important role teamwork and leadership plays in high-stress situations. The lessons learned are ones that we will not soon forget. In this talk, I will share those lessons and our story in hopes that others can learn from our experiences and be better prepared when they execute their next big upgrade.

#rubyconf2019 #confreaks

RubyConf 2019

00:00:12.389 Excellent! Hi everyone, my name is Molly Struve.
00:00:14.170 Before I get started, I want to point out that my Twitter handle is in the lower right-hand corner of all the slides: @Molly_Struve. I have already tweeted out this slide deck, so if you'd like to follow along with the presentation, head over to Twitter and click on the SlideShare link.
00:00:20.920 Welcome to 'Elasticsearch 5 or Bust.' Currently, I am the Lead Site Reliability Engineer for a blogging platform called Deb.
00:00:22.690 However, the story I'm about to tell you today is from my time at my previous employer, Kenna Security.
00:00:26.350 This story is a cautionary tale of what happens when you naively assume that your next upgrade will go just as smoothly as your previous one.
00:00:30.690 TLDR: It's a bad idea to assume that. I could end the talk right there, but what fun would that be? Since I know most of you came here for the gory details, let's dive in.
00:00:34.460 Before I delve into the juicy part of the story, I want to provide some background on Elasticsearch and Kenna Security, so you can better understand the gravity of the situation.
00:00:39.310 First, let's talk about Elasticsearch and the role it plays at Kenna Security. Elasticsearch is a data store that allows you to search over huge amounts of data—think millions of data points in seconds.
00:00:44.410 So, how does Kenna use Elasticsearch? Kenna is a cybersecurity company that helps Fortune 500 companies manage their cybersecurity risk. One of the defining features of Kenna is that we allow our clients to search through all their cybersecurity data in seconds, thanks to Elasticsearch.
00:00:51.009 Elasticsearch is really at the cornerstone of Kenna's platform and is what sets us apart from our competitors.
00:00:55.160 This fact is crucial to keep in mind as I narrate this story so you can grasp just how significant the events I'm about to describe were for Kenna.
00:01:02.410 Now that you understand how important Elasticsearch is for Kenna, I want to quickly cover some Elasticsearch terminology so that you can follow along.
00:01:09.520 First, there may be times when I refer to Elasticsearch as 'ES.' Anytime I say ES, I'm simply using shorthand for Elasticsearch. You will also hear me mention the term 'node.' A node is a server that is running Elasticsearch.
00:01:17.620 Elasticsearch has several different types of nodes, but for the purpose of this presentation, all you need to know is that a node is a server running Elasticsearch. These nodes are then grouped together to form what is called a 'cluster.'
00:01:25.190 A cluster is a group of nodes that all work together to serve Elasticsearch requests. So now that you know about nodes and clusters, the last piece that needs a little explanation is the actual version upgrade itself.
00:01:32.410 During this story, you will hear me talk about upgrading from Elasticsearch 2 to Elasticsearch 5. Now, despite the fact that the version numbers jump by a value of three, Elasticsearch version 5 is actually only a single version ahead of 2.
00:01:40.169 At that time, Elasticsearch decided to update how they numbered their versions to align with the underlying language that powers it, which is why there is such a jump in the numbering system.
00:01:48.169 Now you know the lingo, and you've got some background on Elasticsearch and Kenna. Let's get to the good stuff—the story.
00:01:56.769 The year was 2017. It was late March, and the time had come to upgrade our massive 21-node Elasticsearch cluster. Preparation had been underway for weeks, getting the codebase ready for the upgrade, and everyone was super excited.
00:02:12.479 The reason we were all so excited was that when we upgraded from version one, we saw huge performance gains. So we couldn't wait to see what this upgrade had in store for us.
00:02:20.889 We chose to run the upgrade on March 23rd in the evening. March 23rd is a Thursday, as we would often do upgrades on Thursdays so that we would have a workday Friday to sort out any issues without anyone having to work into the weekend.
00:02:33.290 What did this upgrade involve? This upgrade involved a few different steps: first, we had to shut down the cluster; then, we had to upgrade Elasticsearch on all the nodes. After that, it was my job to deploy all of our Elasticsearch 5 code changes for our application. Last but not least, we had to start Elasticsearch back up again.
00:02:45.160 All these steps went off without a hitch, and as the night wore on, we were feeling pretty good about everything. Once all the nodes were turned back on, we took our application out of maintenance mode and decided to take it for a spin.
00:02:58.150 However, when we did this, we started to see a bunch of CPU and load spikes on our nodes. This was concerning and not a great sign, but the cluster was still doing some internal balancing, so we chalked it up to that and figured it would be sorted out by morning.
00:03:06.630 We decided to call it a night. This brings us to Friday, March 24th. Much to our dismay, we woke up to find that some of our Elasticsearch nodes had crashed overnight.
00:03:10.350 We were unsure why but thought once again it might have been due to that internal balancing. Once again, we dismissed it. We restarted the crashed nodes and continued to monitor the cluster.
00:03:20.840 Everything seemed pretty good until about 9 a.m., when site traffic started to pick up. Once again, we saw more CPU and load spikes, and then, just like overnight, we saw nodes crash.
00:03:29.490 However, this time, before we could restart the nodes, the entire cluster went down. At this point, we couldn't deny it anymore—something was very wrong.
00:03:35.220 We immediately jumped into full-on panic mode to try to figure out why our cluster was suddenly hanging on by a thread. We started combing through logs and reading stack traces, trying to find any hint as to why this was happening.
00:03:45.850 We tried googling things like 'Elasticsearch 5 upgrade followed by cluster crash,' 'Elasticsearch 5 cluster stability,' and 'Why does Elasticsearch 5 suck so much?'—well, maybe not that last one, but seriously, we googled everything we could think of.
00:03:50.850 No matter what we tried, we couldn't find anything that explained what we were seeing. We came up with a few theories on our own and tried some fixes, but none of them worked.
00:04:01.270 By Friday afternoon, we decided to reach out for help and create a post on the Elastic Discuss forum. The Elastic Discuss forum is a website I had used in the past to find answers when I had issues with Elasticsearch. I thought it might be a good place to ask a question and see if anyone else had any ideas.
00:04:10.356 After about an hour of gathering data, I put together our post: 'ES 2 to 5 upgrade followed by major cluster instability.' We had no idea if anyone would answer, as most of you know that when you ask a question like this on a forum or Stack Overflow, you're either going to get an answer or you're going to get crickets.
00:04:23.270 Lucky for us, it was the former. Much to our surprise, someone did answer—not just anyone, but one of the senior engineers at Elastic, Jason Tedder. Not going to lie, we were a bit starstruck when we found out one of the core developers of Elasticsearch was working on our case.
00:04:36.179 The discussion in the post turned into a private email between us and Jason, with us sharing everything we could to try to help him figure out what was going on. This back-and-forth lasted all weekend and into the following week.
00:04:48.220 I'm not going to sugarcoat it: during this time, our team was in a special level of hell. We were working 15-plus hour days trying to figure out what was causing all these issues while simultaneously doing everything we could to keep the application afloat.
00:04:55.130 However, no matter what we did, that ship just kept sinking. Every time ES went down, we had to boot it back up again. As I mentioned earlier, Elasticsearch is at the cornerstone of Kenna's platform, so as you might imagine, customers and management were not happy during this time.
00:05:03.950 Our VP of Engineering was constantly receiving phone calls and messages asking for updates, and with no solution in sight, we started discussing the 'R-word.' Was it time to roll back? Unfortunately, we had no plan for this, but we figured it couldn't be that hard, right? Oh, how wrong we were.
00:05:13.550 We soon learned that once we had upgraded, we couldn't actually roll back. Surprise!—though not quite the surprise we were hoping for. In order to return to Elasticsearch 2, we would have to stand up an entirely new cluster with Elasticsearch 2 and then copy all of our data from the Elasticsearch 5 cluster into it.
00:05:20.180 We calculated that moving all of our data would take us five days to accomplish, which was not good news, considering we had already been down for almost a week. However, with no other options, we stood up the new cluster with Elasticsearch 2 and began the slog of copying all of our data into it.
00:05:30.860 Then, on Wednesday, March 29th rolled around. By this point, the team was exhausted, but we were still working hard to get Kenna back on its feet. That's when it happened.
00:05:37.290 We finally received the news we had been waiting for. Jason, the Elasticsearch engineer assisting us, sent a message saying he had found a bug. Hallelujah!
00:05:45.878 It turned out that the bug he found was in the Elasticsearch source code, and he had discovered it thanks to the information we had provided to him. He issued a patch for Elasticsearch and gave us a workaround to use until that patch was merged and officially released.
00:05:54.310 When I implemented the workaround, it was like night and day. The cluster immediately stabilized, which you can see on these two graphs. The top graph shows no JVM heap levels, and the ideal pattern is calm. To the right of the line, the bottom graph shows our garbage collection times.
00:06:02.370 Garbage collection should not take a lot of time, and as you can see, after the workaround was deployed, it returned to normal levels where it should be. At this point, our team was ecstatic. I think everyone cried. The battle we had been fighting for nearly a week was finally over.
00:06:16.310 However, even though the incident was over, the learning had just begun. While this makes for a great story to tell over drinks, as you can imagine, our team learned several critical lessons from that upgrade.
00:06:22.520 That’s the real reason why I’m here today—to share those lessons learned with you so you don’t have to go through the nightmare my team and I experienced during your next big software upgrade.
00:06:29.530 So without further ado, let's get right to it. First lesson learned: have a rollback plan. When doing any sort of upgrade, you must know what rolling back entails in the event of a problem.
00:06:36.170 Can you roll back the software in line? If you can’t roll it back in line, how would you handle rolling back to the original version? How long and hard will a rollback be? If it’s going to take a long time, that is something you want to prepare for ahead of time.
00:06:44.770 Basically, you need to worst-case scenario the outcome of that upgrade so you're prepared for anything. That way, if and when something comes up, you have a plan.
00:06:54.470 It can be really easy for us as software engineers to focus solely on our code. The code for the Elasticsearch 5 upgrade could have been rolled back with a simple revert PR, but we never considered that the Elasticsearch software itself could be rolled back, and that was a huge mistake.
00:07:02.990 So the first lesson learned is to have a rollback plan. The second lesson we learned was to perform thorough performance testing. We approached this upgrade with some very wrong assumptions, one of which was that since the last upgrade was great, this one would be too.
00:07:12.109 Software only gets better, right? Yeah, I hear some of you laughing out there because you know as well as I do that that's a load of crap. But because of that blind assumption, we never conducted any heavy performance testing.
00:07:20.779 We validated that all the code worked with the new version of Elasticsearch and that was it. Always test new software. I don't care what type of software you're upgrading to; I don't care how stable and widely used that software is. You have to performance test it.
00:07:29.799 While the software as a whole might look good and stable to 99% of its users, you never know if your use case might be the small piece that has a bug or might be the small piece that is unoptimized.
00:07:39.340 Many companies had been running Elasticsearch 5 when we upgraded to it, but none of them encountered the particular bug we did. Since the outage, we've implemented a testing strategy that allows us to send requests to multiple clusters to test their performance.
00:07:47.990 Now, whenever we're making big cluster changes, we use that strategy to avoid any surprises during your next upgrade. The next lesson we learned from this whole debacle was to not ignore small warning signs.
00:07:53.530 This is something I didn’t cover in the initial story because it actually happened well before the upgrade took place. When we were initially working on the code changes for this upgrade, we did all the testing locally.
00:08:01.670 During that testing, we crashed Elasticsearch in our local environments multiple times. Once again, we figured there was no way it could be the software; it had to be our issue with our configuration or settings.
00:08:09.670 So we tweaked our settings and configuration to stabilize it and then went about our way making additional code changes.
00:08:16.639 In hindsight, ignoring that small warning sign was probably the biggest mistake. Yet again, our trust in the software and our past experiences twisted our bias so much that we never considered the problem could lie with Elasticsearch.
00:08:24.000 If you get small warning signs, investigate further. Don’t just get it into a working state and assume it will be fine because it probably won’t be. Make sure you understand what is happening before you dismiss a warning sign as unimportant.
00:08:32.490 So, this gives us three lessons learned. Unfortunately, all of these lessons were learned the hard way by not following them during this upgrade.
00:08:40.769 Now I want to shift gears because the next three lessons are things that actually went incredibly well during the upgrade.
00:08:43.940 When I look back on the whole experience, it's apparent that if these next three lessons had not been followed, the fallout from the upgrade could have been much worse.
00:08:53.310 I'm sure some of you are thinking, 'Wait, worse? You were down for nearly a week! How does it get worse than that?' Oh, you'll see how it could have been.
00:09:02.800 So number four on my list of lessons learned is to use the community. Never discount the help that can be found within the community, whether it's online or simply calling up a former coworker.
00:09:06.730 It can be hard and scary to ask for help sometimes. No one wants to be the one to ask that question that seems like a one-line answer.
00:09:12.760 However, don’t let that stop you. Wouldn't you rather have that one-line answer than to chase your tail for hours or days? This is one area where I believe our team excelled.
00:09:22.747 Immediately after the upgrade, we reached out to the community for help by posting on Elastic's Discuss forum. It probably took me an hour to write that post, but it was the most valuable hour of the entire upgrade because that is what got us our answer.
00:09:31.540 Many will reach out to the community as a last resort. I’m here to tell you: don’t wait until you’re at the end of your rope. Don’t wait until you’ve suffered through days of downtime before asking for help.
00:09:40.740 Ask when you need it because you might save yourself and your team a whole lot of struggle and frustration in the process. I’m very proud that we turned to the community so quickly, and it definitely paid off.
00:09:51.200 Sure, it might not pay off every time, but it’s so simple to ask—why not do it? Worst case scenario, it allows you to organize your thoughts around your debugging efforts.
00:09:55.610 So, before your back is against the wall, remember to take advantage of the community. The fifth lesson learned is one we successfully took advantage of during this upgrade, and that is that leadership and management support are crucial during any high-risk software upgrade.
00:10:01.860 The leadership and management team backing your engineers is extremely important. When you look at this outage story as a whole, it's easy to focus on the engineers and everything they did.
00:10:08.680 One of the key reasons we engineers were able to do our jobs was because of our VP of Engineering. It wasn’t just the engineers in the trenches on this one; our VP of Engineering was right there with us the entire time.
00:10:15.130 He was up with us late into the evenings and online at the crack of dawn every single day during the incident. He was not only there to offer technical help, but more importantly, he was our cheerleader.
00:10:23.800 There were many times when our team just wanted to throw in the towel and give up, but he kept encouraging us and pushing us because he believed we would figure it out.
00:10:27.580 In addition to being our cheerleader, he also acted as our defender. He fielded all calls and messages from upper management, which allowed us to focus on fixing things.
00:10:34.170 We never had to worry about explaining ourselves because he handled everything. He shielded us from all the additional worry and panic that other stakeholders were experiencing during the outage.
00:10:41.130 Above all else, our VP never wavered in his trust that we would eventually figure it out. He was the epitome of calm, cool, and collected throughout the entire situation, which is what kept pushing us forward.
00:10:49.230 His favorite saying is 'fail forward,' and he demonstrated that belief clearly during this Elasticsearch outage. Today, 'fail forward' is one of the cornerstone values at Kenna's engineering culture, and we use this story as a shining example.
00:10:56.610 If we had a different VP, I am sure things would have turned out very differently for us. So for those of you staying in this room, who are VPs, managers, or C-suite executives, listen up.
00:11:04.190 You may not be the ones pushing the code, but I guarantee the role you play is much larger and more crucial in these scenarios than you likely realize.
00:11:10.930 How you react will set the example for the rest of the team. Be their cheerleader, be their defender, and be whatever they need you to be—but above all else, trust that they can do it.
00:11:17.190 That trust goes a long way towards helping the team believe in themselves, which is crucial for keeping morale up, especially during a long-running incident such as this one.
00:11:25.310 Lastly, I want to highlight the incredible team of engineers I worked with throughout this upgrade. It goes without saying that during any significant software upgrade, your team matters.
00:11:32.250 Being a developer or engineer is not just about working with computers; you also have to work with people. This outage drove that point home for me.
00:11:40.670 There was a team of three of us who were working 15-plus hour days trying to get everything fixed, and it was brutal. We went through every emotion in the book—from sad to angry to despondent.
00:11:48.130 Rather than these emotions breaking us down, they bonded us together. Everyone helped one another out—we all supported each other when needed.
00:11:55.760 This was a significant 'aha' moment for me because it made me realize that character is everything in a time of crisis. You can teach people tech, show them how to code, and instill good architectural principles, but you can’t teach character.
00:12:04.740 When you are hiring, look at the people you're interviewing. Get to know them and assess whether they are someone you would want by your side in a crisis.
00:12:15.260 Will they jump in when you need them to, no questions asked? If the answer to that is yes, hire them because that is not something you can teach.
00:12:22.320 And with that, our list of lessons learned is complete. On the technical side, have a rollback plan; do performance testing; and don't ignore the small warning signs.
00:12:28.270 On the non-technical side of the equation, use the community; remember that leadership and management support are crucial; and finally, know that your team matters.
00:12:34.370 If I’m being honest, while the technical side is important, I firmly believe that these last three points are the most critical. The reason I say this is because you can have a rollback plan, you can do performance testing, and you can note and investigate every little warning sign that pops up.
00:12:43.900 But, in the end, there will still be times when things go wrong. It’s inevitable in our line of work; things are going to break. But if, when things go wrong, you leverage the community and have the right team and leadership in place, you will be fine.
00:12:56.920 You can survive any outage or high-stress scenario that comes your way if you remember to put into practice these three non-technical lessons.
00:13:06.030 The Elasticsearch outage of 2017 is infamous at Kenna Security these days, but not in the way that you would think. As brutal as it was for the team and as bad for business as it was at the time, it helped us build the foundation for the kind of engineering culture we have today.
00:13:13.370 It gave everyone a story to point to and say, 'This is us. What happened there? That’s who we are.' With that in mind, I think there’s a bonus lesson here: embrace your mistakes.
00:13:24.030 When I say mistakes here, I mean you personally, and I also mean your team and your company. This outage was caused by a team mistake, and we own that—we embrace that.
00:13:31.780 Often in our industry, outages and downtimes are taboo. When they happen, someone gets blamed—maybe even fired. We write a postmortem about it and then quickly shove it under the rug and hope everyone forgets about it.
00:13:39.170 That’s not how it should work. I firmly believe that embracing outages and mistakes is the only way we can really learn from them. Sharing these stories and being open about them benefits everyone.
00:13:50.350 I’m sincerely grateful that RubyConf chose to have the hindsight track because learning from these mistakes and outages is going to make us all better engineers.
00:14:00.850 The next time you have a big upgrade looming in your future, remember these six lessons so that you can prepare your code and your team to make the upgrade the best experience possible.
00:14:09.230 I sincerely hope that what we learned at Kenna will prevent others from experiencing an outage like we did. But in the event that an outage does occur, whether it’s during an upgrade or not, embrace it, learn from it, and share it with others. Thank you all for your time and attention.
00:26:35.890 Great question! So, the question was how long Elasticsearch v5 had been out when we ran into that problem. I believe at the time, it had been just under a year.
00:26:40.560 We ended up upgrading to version 5.2 once the patch was pushed and released. We then upgraded to version 5.4 to fix everything.
00:26:48.500 In hindsight, it was probably a little early, so this next upgrade that we're going to undertake will wait a little longer for ES6 to bake a bit.
00:26:57.760 What was the bug? So, for those of you who use Elasticsearch, there are two types of queries you can execute: you can do a scoring query or a match query.
00:27:05.490 This bug took all of our non-scoring queries and turned them into scoring queries, meaning that Elasticsearch had to do all this extra work that it never had to do in the past, which is why it kept failing.
00:27:20.150 How did no one notice this before? Well, the actual bug was in filters and aggregations, which is a special feature you can use to group documents in Elasticsearch.
00:27:31.520 Not many people filter their aggregations, or they don't filter a lot of them, so no one else had run into it until we did. They found it mainly because we gave Jason a heap dump, and he was able to look through that and realize the scoring queries weren't supposed to be scored.
00:27:42.910 Okay! I want to talk to anyone who has questions after the session. Thank you.
Explore all talks recorded at RubyConf 2019
+88