00:00:12.389
Excellent! Hi everyone, my name is Molly Struve.
00:00:14.170
Before I get started, I want to point out that my Twitter handle is in the lower right-hand corner of all the slides: @Molly_Struve. I have already tweeted out this slide deck, so if you'd like to follow along with the presentation, head over to Twitter and click on the SlideShare link.
00:00:20.920
Welcome to 'Elasticsearch 5 or Bust.' Currently, I am the Lead Site Reliability Engineer for a blogging platform called Deb.
00:00:22.690
However, the story I'm about to tell you today is from my time at my previous employer, Kenna Security.
00:00:26.350
This story is a cautionary tale of what happens when you naively assume that your next upgrade will go just as smoothly as your previous one.
00:00:30.690
TLDR: It's a bad idea to assume that. I could end the talk right there, but what fun would that be? Since I know most of you came here for the gory details, let's dive in.
00:00:34.460
Before I delve into the juicy part of the story, I want to provide some background on Elasticsearch and Kenna Security, so you can better understand the gravity of the situation.
00:00:39.310
First, let's talk about Elasticsearch and the role it plays at Kenna Security. Elasticsearch is a data store that allows you to search over huge amounts of data—think millions of data points in seconds.
00:00:44.410
So, how does Kenna use Elasticsearch? Kenna is a cybersecurity company that helps Fortune 500 companies manage their cybersecurity risk. One of the defining features of Kenna is that we allow our clients to search through all their cybersecurity data in seconds, thanks to Elasticsearch.
00:00:51.009
Elasticsearch is really at the cornerstone of Kenna's platform and is what sets us apart from our competitors.
00:00:55.160
This fact is crucial to keep in mind as I narrate this story so you can grasp just how significant the events I'm about to describe were for Kenna.
00:01:02.410
Now that you understand how important Elasticsearch is for Kenna, I want to quickly cover some Elasticsearch terminology so that you can follow along.
00:01:09.520
First, there may be times when I refer to Elasticsearch as 'ES.' Anytime I say ES, I'm simply using shorthand for Elasticsearch. You will also hear me mention the term 'node.' A node is a server that is running Elasticsearch.
00:01:17.620
Elasticsearch has several different types of nodes, but for the purpose of this presentation, all you need to know is that a node is a server running Elasticsearch. These nodes are then grouped together to form what is called a 'cluster.'
00:01:25.190
A cluster is a group of nodes that all work together to serve Elasticsearch requests. So now that you know about nodes and clusters, the last piece that needs a little explanation is the actual version upgrade itself.
00:01:32.410
During this story, you will hear me talk about upgrading from Elasticsearch 2 to Elasticsearch 5. Now, despite the fact that the version numbers jump by a value of three, Elasticsearch version 5 is actually only a single version ahead of 2.
00:01:40.169
At that time, Elasticsearch decided to update how they numbered their versions to align with the underlying language that powers it, which is why there is such a jump in the numbering system.
00:01:48.169
Now you know the lingo, and you've got some background on Elasticsearch and Kenna. Let's get to the good stuff—the story.
00:01:56.769
The year was 2017. It was late March, and the time had come to upgrade our massive 21-node Elasticsearch cluster. Preparation had been underway for weeks, getting the codebase ready for the upgrade, and everyone was super excited.
00:02:12.479
The reason we were all so excited was that when we upgraded from version one, we saw huge performance gains. So we couldn't wait to see what this upgrade had in store for us.
00:02:20.889
We chose to run the upgrade on March 23rd in the evening. March 23rd is a Thursday, as we would often do upgrades on Thursdays so that we would have a workday Friday to sort out any issues without anyone having to work into the weekend.
00:02:33.290
What did this upgrade involve? This upgrade involved a few different steps: first, we had to shut down the cluster; then, we had to upgrade Elasticsearch on all the nodes. After that, it was my job to deploy all of our Elasticsearch 5 code changes for our application. Last but not least, we had to start Elasticsearch back up again.
00:02:45.160
All these steps went off without a hitch, and as the night wore on, we were feeling pretty good about everything. Once all the nodes were turned back on, we took our application out of maintenance mode and decided to take it for a spin.
00:02:58.150
However, when we did this, we started to see a bunch of CPU and load spikes on our nodes. This was concerning and not a great sign, but the cluster was still doing some internal balancing, so we chalked it up to that and figured it would be sorted out by morning.
00:03:06.630
We decided to call it a night. This brings us to Friday, March 24th. Much to our dismay, we woke up to find that some of our Elasticsearch nodes had crashed overnight.
00:03:10.350
We were unsure why but thought once again it might have been due to that internal balancing. Once again, we dismissed it. We restarted the crashed nodes and continued to monitor the cluster.
00:03:20.840
Everything seemed pretty good until about 9 a.m., when site traffic started to pick up. Once again, we saw more CPU and load spikes, and then, just like overnight, we saw nodes crash.
00:03:29.490
However, this time, before we could restart the nodes, the entire cluster went down. At this point, we couldn't deny it anymore—something was very wrong.
00:03:35.220
We immediately jumped into full-on panic mode to try to figure out why our cluster was suddenly hanging on by a thread. We started combing through logs and reading stack traces, trying to find any hint as to why this was happening.
00:03:45.850
We tried googling things like 'Elasticsearch 5 upgrade followed by cluster crash,' 'Elasticsearch 5 cluster stability,' and 'Why does Elasticsearch 5 suck so much?'—well, maybe not that last one, but seriously, we googled everything we could think of.
00:03:50.850
No matter what we tried, we couldn't find anything that explained what we were seeing. We came up with a few theories on our own and tried some fixes, but none of them worked.
00:04:01.270
By Friday afternoon, we decided to reach out for help and create a post on the Elastic Discuss forum. The Elastic Discuss forum is a website I had used in the past to find answers when I had issues with Elasticsearch. I thought it might be a good place to ask a question and see if anyone else had any ideas.
00:04:10.356
After about an hour of gathering data, I put together our post: 'ES 2 to 5 upgrade followed by major cluster instability.' We had no idea if anyone would answer, as most of you know that when you ask a question like this on a forum or Stack Overflow, you're either going to get an answer or you're going to get crickets.
00:04:23.270
Lucky for us, it was the former. Much to our surprise, someone did answer—not just anyone, but one of the senior engineers at Elastic, Jason Tedder. Not going to lie, we were a bit starstruck when we found out one of the core developers of Elasticsearch was working on our case.
00:04:36.179
The discussion in the post turned into a private email between us and Jason, with us sharing everything we could to try to help him figure out what was going on. This back-and-forth lasted all weekend and into the following week.
00:04:48.220
I'm not going to sugarcoat it: during this time, our team was in a special level of hell. We were working 15-plus hour days trying to figure out what was causing all these issues while simultaneously doing everything we could to keep the application afloat.
00:04:55.130
However, no matter what we did, that ship just kept sinking. Every time ES went down, we had to boot it back up again. As I mentioned earlier, Elasticsearch is at the cornerstone of Kenna's platform, so as you might imagine, customers and management were not happy during this time.
00:05:03.950
Our VP of Engineering was constantly receiving phone calls and messages asking for updates, and with no solution in sight, we started discussing the 'R-word.' Was it time to roll back? Unfortunately, we had no plan for this, but we figured it couldn't be that hard, right? Oh, how wrong we were.
00:05:13.550
We soon learned that once we had upgraded, we couldn't actually roll back. Surprise!—though not quite the surprise we were hoping for. In order to return to Elasticsearch 2, we would have to stand up an entirely new cluster with Elasticsearch 2 and then copy all of our data from the Elasticsearch 5 cluster into it.
00:05:20.180
We calculated that moving all of our data would take us five days to accomplish, which was not good news, considering we had already been down for almost a week. However, with no other options, we stood up the new cluster with Elasticsearch 2 and began the slog of copying all of our data into it.
00:05:30.860
Then, on Wednesday, March 29th rolled around. By this point, the team was exhausted, but we were still working hard to get Kenna back on its feet. That's when it happened.
00:05:37.290
We finally received the news we had been waiting for. Jason, the Elasticsearch engineer assisting us, sent a message saying he had found a bug. Hallelujah!
00:05:45.878
It turned out that the bug he found was in the Elasticsearch source code, and he had discovered it thanks to the information we had provided to him. He issued a patch for Elasticsearch and gave us a workaround to use until that patch was merged and officially released.
00:05:54.310
When I implemented the workaround, it was like night and day. The cluster immediately stabilized, which you can see on these two graphs. The top graph shows no JVM heap levels, and the ideal pattern is calm. To the right of the line, the bottom graph shows our garbage collection times.
00:06:02.370
Garbage collection should not take a lot of time, and as you can see, after the workaround was deployed, it returned to normal levels where it should be. At this point, our team was ecstatic. I think everyone cried. The battle we had been fighting for nearly a week was finally over.
00:06:16.310
However, even though the incident was over, the learning had just begun. While this makes for a great story to tell over drinks, as you can imagine, our team learned several critical lessons from that upgrade.
00:06:22.520
That’s the real reason why I’m here today—to share those lessons learned with you so you don’t have to go through the nightmare my team and I experienced during your next big software upgrade.
00:06:29.530
So without further ado, let's get right to it. First lesson learned: have a rollback plan. When doing any sort of upgrade, you must know what rolling back entails in the event of a problem.
00:06:36.170
Can you roll back the software in line? If you can’t roll it back in line, how would you handle rolling back to the original version? How long and hard will a rollback be? If it’s going to take a long time, that is something you want to prepare for ahead of time.
00:06:44.770
Basically, you need to worst-case scenario the outcome of that upgrade so you're prepared for anything. That way, if and when something comes up, you have a plan.
00:06:54.470
It can be really easy for us as software engineers to focus solely on our code. The code for the Elasticsearch 5 upgrade could have been rolled back with a simple revert PR, but we never considered that the Elasticsearch software itself could be rolled back, and that was a huge mistake.
00:07:02.990
So the first lesson learned is to have a rollback plan. The second lesson we learned was to perform thorough performance testing. We approached this upgrade with some very wrong assumptions, one of which was that since the last upgrade was great, this one would be too.
00:07:12.109
Software only gets better, right? Yeah, I hear some of you laughing out there because you know as well as I do that that's a load of crap. But because of that blind assumption, we never conducted any heavy performance testing.
00:07:20.779
We validated that all the code worked with the new version of Elasticsearch and that was it. Always test new software. I don't care what type of software you're upgrading to; I don't care how stable and widely used that software is. You have to performance test it.
00:07:29.799
While the software as a whole might look good and stable to 99% of its users, you never know if your use case might be the small piece that has a bug or might be the small piece that is unoptimized.
00:07:39.340
Many companies had been running Elasticsearch 5 when we upgraded to it, but none of them encountered the particular bug we did. Since the outage, we've implemented a testing strategy that allows us to send requests to multiple clusters to test their performance.
00:07:47.990
Now, whenever we're making big cluster changes, we use that strategy to avoid any surprises during your next upgrade. The next lesson we learned from this whole debacle was to not ignore small warning signs.
00:07:53.530
This is something I didn’t cover in the initial story because it actually happened well before the upgrade took place. When we were initially working on the code changes for this upgrade, we did all the testing locally.
00:08:01.670
During that testing, we crashed Elasticsearch in our local environments multiple times. Once again, we figured there was no way it could be the software; it had to be our issue with our configuration or settings.
00:08:09.670
So we tweaked our settings and configuration to stabilize it and then went about our way making additional code changes.
00:08:16.639
In hindsight, ignoring that small warning sign was probably the biggest mistake. Yet again, our trust in the software and our past experiences twisted our bias so much that we never considered the problem could lie with Elasticsearch.
00:08:24.000
If you get small warning signs, investigate further. Don’t just get it into a working state and assume it will be fine because it probably won’t be. Make sure you understand what is happening before you dismiss a warning sign as unimportant.
00:08:32.490
So, this gives us three lessons learned. Unfortunately, all of these lessons were learned the hard way by not following them during this upgrade.
00:08:40.769
Now I want to shift gears because the next three lessons are things that actually went incredibly well during the upgrade.
00:08:43.940
When I look back on the whole experience, it's apparent that if these next three lessons had not been followed, the fallout from the upgrade could have been much worse.
00:08:53.310
I'm sure some of you are thinking, 'Wait, worse? You were down for nearly a week! How does it get worse than that?' Oh, you'll see how it could have been.
00:09:02.800
So number four on my list of lessons learned is to use the community. Never discount the help that can be found within the community, whether it's online or simply calling up a former coworker.
00:09:06.730
It can be hard and scary to ask for help sometimes. No one wants to be the one to ask that question that seems like a one-line answer.
00:09:12.760
However, don’t let that stop you. Wouldn't you rather have that one-line answer than to chase your tail for hours or days? This is one area where I believe our team excelled.
00:09:22.747
Immediately after the upgrade, we reached out to the community for help by posting on Elastic's Discuss forum. It probably took me an hour to write that post, but it was the most valuable hour of the entire upgrade because that is what got us our answer.
00:09:31.540
Many will reach out to the community as a last resort. I’m here to tell you: don’t wait until you’re at the end of your rope. Don’t wait until you’ve suffered through days of downtime before asking for help.
00:09:40.740
Ask when you need it because you might save yourself and your team a whole lot of struggle and frustration in the process. I’m very proud that we turned to the community so quickly, and it definitely paid off.
00:09:51.200
Sure, it might not pay off every time, but it’s so simple to ask—why not do it? Worst case scenario, it allows you to organize your thoughts around your debugging efforts.
00:09:55.610
So, before your back is against the wall, remember to take advantage of the community. The fifth lesson learned is one we successfully took advantage of during this upgrade, and that is that leadership and management support are crucial during any high-risk software upgrade.
00:10:01.860
The leadership and management team backing your engineers is extremely important. When you look at this outage story as a whole, it's easy to focus on the engineers and everything they did.
00:10:08.680
One of the key reasons we engineers were able to do our jobs was because of our VP of Engineering. It wasn’t just the engineers in the trenches on this one; our VP of Engineering was right there with us the entire time.
00:10:15.130
He was up with us late into the evenings and online at the crack of dawn every single day during the incident. He was not only there to offer technical help, but more importantly, he was our cheerleader.
00:10:23.800
There were many times when our team just wanted to throw in the towel and give up, but he kept encouraging us and pushing us because he believed we would figure it out.
00:10:27.580
In addition to being our cheerleader, he also acted as our defender. He fielded all calls and messages from upper management, which allowed us to focus on fixing things.
00:10:34.170
We never had to worry about explaining ourselves because he handled everything. He shielded us from all the additional worry and panic that other stakeholders were experiencing during the outage.
00:10:41.130
Above all else, our VP never wavered in his trust that we would eventually figure it out. He was the epitome of calm, cool, and collected throughout the entire situation, which is what kept pushing us forward.
00:10:49.230
His favorite saying is 'fail forward,' and he demonstrated that belief clearly during this Elasticsearch outage. Today, 'fail forward' is one of the cornerstone values at Kenna's engineering culture, and we use this story as a shining example.
00:10:56.610
If we had a different VP, I am sure things would have turned out very differently for us. So for those of you staying in this room, who are VPs, managers, or C-suite executives, listen up.
00:11:04.190
You may not be the ones pushing the code, but I guarantee the role you play is much larger and more crucial in these scenarios than you likely realize.
00:11:10.930
How you react will set the example for the rest of the team. Be their cheerleader, be their defender, and be whatever they need you to be—but above all else, trust that they can do it.
00:11:17.190
That trust goes a long way towards helping the team believe in themselves, which is crucial for keeping morale up, especially during a long-running incident such as this one.
00:11:25.310
Lastly, I want to highlight the incredible team of engineers I worked with throughout this upgrade. It goes without saying that during any significant software upgrade, your team matters.
00:11:32.250
Being a developer or engineer is not just about working with computers; you also have to work with people. This outage drove that point home for me.
00:11:40.670
There was a team of three of us who were working 15-plus hour days trying to get everything fixed, and it was brutal. We went through every emotion in the book—from sad to angry to despondent.
00:11:48.130
Rather than these emotions breaking us down, they bonded us together. Everyone helped one another out—we all supported each other when needed.
00:11:55.760
This was a significant 'aha' moment for me because it made me realize that character is everything in a time of crisis. You can teach people tech, show them how to code, and instill good architectural principles, but you can’t teach character.
00:12:04.740
When you are hiring, look at the people you're interviewing. Get to know them and assess whether they are someone you would want by your side in a crisis.
00:12:15.260
Will they jump in when you need them to, no questions asked? If the answer to that is yes, hire them because that is not something you can teach.
00:12:22.320
And with that, our list of lessons learned is complete. On the technical side, have a rollback plan; do performance testing; and don't ignore the small warning signs.
00:12:28.270
On the non-technical side of the equation, use the community; remember that leadership and management support are crucial; and finally, know that your team matters.
00:12:34.370
If I’m being honest, while the technical side is important, I firmly believe that these last three points are the most critical. The reason I say this is because you can have a rollback plan, you can do performance testing, and you can note and investigate every little warning sign that pops up.
00:12:43.900
But, in the end, there will still be times when things go wrong. It’s inevitable in our line of work; things are going to break. But if, when things go wrong, you leverage the community and have the right team and leadership in place, you will be fine.
00:12:56.920
You can survive any outage or high-stress scenario that comes your way if you remember to put into practice these three non-technical lessons.
00:13:06.030
The Elasticsearch outage of 2017 is infamous at Kenna Security these days, but not in the way that you would think. As brutal as it was for the team and as bad for business as it was at the time, it helped us build the foundation for the kind of engineering culture we have today.
00:13:13.370
It gave everyone a story to point to and say, 'This is us. What happened there? That’s who we are.' With that in mind, I think there’s a bonus lesson here: embrace your mistakes.
00:13:24.030
When I say mistakes here, I mean you personally, and I also mean your team and your company. This outage was caused by a team mistake, and we own that—we embrace that.
00:13:31.780
Often in our industry, outages and downtimes are taboo. When they happen, someone gets blamed—maybe even fired. We write a postmortem about it and then quickly shove it under the rug and hope everyone forgets about it.
00:13:39.170
That’s not how it should work. I firmly believe that embracing outages and mistakes is the only way we can really learn from them. Sharing these stories and being open about them benefits everyone.
00:13:50.350
I’m sincerely grateful that RubyConf chose to have the hindsight track because learning from these mistakes and outages is going to make us all better engineers.
00:14:00.850
The next time you have a big upgrade looming in your future, remember these six lessons so that you can prepare your code and your team to make the upgrade the best experience possible.
00:14:09.230
I sincerely hope that what we learned at Kenna will prevent others from experiencing an outage like we did. But in the event that an outage does occur, whether it’s during an upgrade or not, embrace it, learn from it, and share it with others. Thank you all for your time and attention.
00:26:35.890
Great question! So, the question was how long Elasticsearch v5 had been out when we ran into that problem. I believe at the time, it had been just under a year.
00:26:40.560
We ended up upgrading to version 5.2 once the patch was pushed and released. We then upgraded to version 5.4 to fix everything.
00:26:48.500
In hindsight, it was probably a little early, so this next upgrade that we're going to undertake will wait a little longer for ES6 to bake a bit.
00:26:57.760
What was the bug? So, for those of you who use Elasticsearch, there are two types of queries you can execute: you can do a scoring query or a match query.
00:27:05.490
This bug took all of our non-scoring queries and turned them into scoring queries, meaning that Elasticsearch had to do all this extra work that it never had to do in the past, which is why it kept failing.
00:27:20.150
How did no one notice this before? Well, the actual bug was in filters and aggregations, which is a special feature you can use to group documents in Elasticsearch.
00:27:31.520
Not many people filter their aggregations, or they don't filter a lot of them, so no one else had run into it until we did. They found it mainly because we gave Jason a heap dump, and he was able to look through that and realize the scoring queries weren't supposed to be scored.
00:27:42.910
Okay! I want to talk to anyone who has questions after the session. Thank you.