Buuuuugs iiiiin Spaaaaace!

by Colin Fulton

The talk "Buuuuugs iiiiin Spaaaaace!" presented by Colin Fulton at RubyConf 2017 explores the intriguing intersection of software bugs and space missions, showcasing the importance of resilient software engineering practices. The presentation is structured around several riveting stories from space exploration that illustrate how software failures can lead to significant consequences and how creative solutions have been applied in high-pressure environments.

Key points discussed in the talk include:
- Miscommunication in Systems: The crash of a NASA Mars mission illustrates the disastrous effects of using inconsistent measurement units within software, leading to the loss of a multimillion-dollar satellite.
- Innovative Fixes: During Apollo 14, engineers had to rapidly create a patch to prevent an abort trigger from mistakenly executing before landing on the moon, showcasing real-time problem-solving under pressure.
- Critical Impact of Code Quality: The failure of the Ariane 5 rocket due to a line of code that caused a floating-point overflow highlights the necessity of thorough testing and the dangers of assuming code operates correctly based on previous experiences.
- Agile Development Lessons: The Soviet Union's approach towards space engineering emphasized rapid testing and iteration, featuring the NK-33 engine, which used numerous smaller engines instead of a few large ones, leading to innovative results despite challenges.
- The Fragility of Space Systems: The malfunction of the Galileo probe’s magnetometer due to a single corrupted bit illustrates how delicate and complex space systems can be. The creative use of Forth programming led to an effective patch that saved the system.

Fulton's conclusion encourages an appreciation for the lessons learned from past failures in technology and engineering, urging developers to heed historical insights. He suggests resources such as the documentary 'Moon Machines' and the book 'Digital Apollo' for further exploration into the topics discussed.

00:00:10 All righty, I think it's about time to get started. If people want to keep filing in, there are plenty of seats to go ahead and sit down wherever. There's no need to stay in the back. My name is Colin Fulton, and I'm a front-end developer at Duo Security. If you want to reach me on Twitter, you can hit me up. I don’t really use Twitter, so if you actually want to reach me personally, just email me at my Gmail, which is just [email protected]. If I don't respond, just send an email again, sometimes I forget to respond to things.

00:00:27 All the slides, as well as a number of other resources, are going to be available on GitHub. They are up and ready for you to browse at github.com/justcolin. So many of you may remember the time NASA sent a rocket to Mars, and it crashed. The reason it crashed was because they had one piece of software that used pounds per second while the rest of their software used Newtons per second. This was very embarrassing for NASA.

00:00:53 But how many of you have ever done a multi-corporation project that takes years, if not decades, to complete, with many people working on specifications that are hundreds of pages long, without writing a single bug? In this case, the bug was pretty severe because they lost a multi-million dollar satellite. Still, I think we can all relate to this kind of problem.

00:01:35 All of you may also remember when we initially launched the Hubble Space Telescope into orbit. The images it took initially were really blurry, and they couldn't do anything with them. This was because they had actually gotten the geometry of the mirror wrong. They had to send up a number of astronauts in the space shuttle to install effectively a pair of glasses to correct it.

00:02:05 What a lot of people don't know is that they were actually able to measure how blurry the images were. The glasses not only got them back to how good the images would have been if they had initially gotten it right; they actually were able to correct for a number of other things and get sharper images than they would have if they hadn't had this bug. This was an expensive way to discover a method to achieve sharper images, but it worked, nonetheless.

00:02:29 Today, I want to tell you four stories about bugs in space programs, because there's a lot we can learn from bugs, and space programs are really cool. The bugs are more exciting. I'm going to tell you right up front that I'm not some guru on top of a mountain, handing out sage advice. I am going to editorialize a little bit, but I'm not going to be able to give you grand lessons in this talk. I'm just a workaday developer like all of you.

00:02:57 In fact, many of you here, if not most, are probably smarter than me and know more about these subjects than I do. I'm going to share these stories for you to reflect upon your own experiences, and see what insights you can glean from them. I'll provide a couple of small editorial comments to point out where I think it's interesting to ponder these stories.

00:03:29 So many of you have probably heard the phrase: 'We can send a man to the moon, but we can't get a printer to work.' I don’t know if any of you have ever dealt with printer firmware, but it's actually more complicated than it may seem. Getting to the moon wasn’t done perfectly either; there were numerous bugs along the way.

00:03:55 This is just one of them, specifically with Apollo 14. No, this isn't a typo – this isn't about Apollo 13, the famous one with the explosion that you may have seen in the exciting movies. No, this was a much more subtle, but still almost dangerous bug. Now, in Apollo 14, they managed to orbit the moon, and it was less than four hours before they were about to land when the crew on the ground noticed that the abort bit was set in the computer.

00:04:29 That's the bit that gets set when you push the abort button, which the computer then checks and says, 'Oh no, the abort bit is set; we have to abort whatever we are doing and get back into lunar orbit.' Thankfully, they were in one of the procedures that actually checked the abort bit yet, so it was ignoring it. What they ended up discovering was that there was probably a little bit of metal or solder floating around inside the abort button.

00:05:05 It was shorting occasionally, which meant the abort button was acting as if it were randomly being pushed every so often. This is fine while you're orbiting the moon, but while landing, that meant there was a percentage chance that the rocket thruster would turn on unexpectedly and send you back into orbit. You only have enough fuel to land on the moon once, because it's really difficult to get all that fuel into space.

00:05:32 This was a serious problem. They contacted the people who had written the Apollo guidance computer code and asked how they could help. They had about three to four hours to figure out a way to patch the solution using the Apollo guidance computer so that it would ignore this particular abort button. They also needed to make sure there was another way to perform an abort.

00:06:10 They had those three to four hours to create the patch, test it, send it over to Houston, where they would test it again, and then relay that patch up to the astronauts for them to manually input it. The problem was that the Apollo guidance computer really did not want to ignore the abort button for a variety of reasons.

00:06:46 This is some Apollo guidance computer source code. If you’ve never looked at assembly code, this may look a little strange, but it's actually very similar to assembly programs written today. The first column optionally contains a label that you can jump to, which is like a name that serves as documentation or directs elsewhere in your code.

00:07:13 Then the next bit is what's called the operator. These are short strings representing different commands. In this case, 'ca' means clear register, and the operator indicates what you are going to operate on. Essentially, we're going to clear a register and then add a particular flag to it. Interestingly, the comments in the code use hashtag comments just like Ruby.

00:07:49 The first line of code clears a register and then sets it to a word containing a bunch of different flag bits. The 'Nestle' command masks out all but the abort bit, ensuring that all other bits will be zero except if that abort bit is set to one. There’s a little bit of magic involved here because they didn't have enough space to work with all the commands, prompting them to call this magic extend command.

00:08:13 Subsequently, there’s a branch if zero command, which checks if the number currently in the register is zero. If it is, it jumps to the landing procedure; if not, it continues with the abort sequence. Thus, they had the abort flag stored in memory. So, we needed to set the abort flag.

00:08:56 Let’s go ahead and set it. If we set it to zero, this code will run, and an abort will not happen. The problem, however, is that it isn’t that simple. There are various programs that get run which also change that abort flag. They need to set that flag during the landing to indicate, ‘Hey, we're currently in the middle of a landing.’ This means they may want to abort.

00:09:30 Consequently, there is a narrow window of time between when the computer sets it and they manually set it, during which theoretically that abort button could short out, creating a significant problem. Some of you may be thinking we can just send up a patch to the code that ignores the abort button; however, the code itself is woven with metal wires by a team in Massachusetts, so if we wanted to patch the code, it would require weaving a bunch of wires together, packaging them in epoxy, and sending them back to the moon.

00:10:09 You can't really achieve that in four hours, which is why editing this code was incredibly difficult. The solution they found was to check another line of code. Immediately after checking the abort flag, they check the mod reg to determine if they are in program 71, which is one of the abort programs; program 70 is the other one.

00:10:42 If the computer senses it’s already in an abort sequence, it won’t go into another abort sequence. To avoid setting the abort flag, they set the mod reg instead. By doing that, they could trick the computer into thinking they were already in an abort sequence while allowing the system to continue as intended.

00:11:22 This solved the problem, but it complicated things because there are other programs checking that particular register to understand their landing operations. Therefore, the sequence they created included first setting the mod reg to program 71, signaling the computer to ignore aborts subsequently.

00:11:52 After setting that, they wait until ignition is initiated, which will automatically set the LED abort flag. They can ignore that because they previously changed the mod reg register. At that point, they push the throttle to full, as the computer won’t do this itself since it believes it’s in the middle of an abort sequence.

00:12:32 Next, they set the zoom flag, letting the computer know it can throttle up the rocket in the future. After that, they can reset the LED abort flag back to zero, making it safe to do so, as it won’t be set again.

00:13:00 They then set the mode register back to program 63, which is what the computer believes it should be in, and this allows it to check the zoom flag and throttle up the engine appropriately for landing. They can then reduce the throttle and allow the computer to take control.

00:13:38 If you are confused, don’t worry; it is a little convoluted, but all this combined allowed them to safely ignore that abort button for the lunar landing. On future missions, they set up additional register routines to do this in a slightly less convoluted manner in case this situation arose again.

00:14:15 I'm not sure about you, but I think monkey-patching is really great when you need it. You can fix amazing problems, like this one; if they didn’t have the ability to manually intervene and edit all the registers and memory while the program was running, they would have been in serious trouble.

00:14:49 They might have aborted midway through the flight. While monkey-patching is beneficial, it should really only be used when absolutely necessary, which is almost never. I don’t land people on the moon very often; most of the time, I have more than four hours to fix a bug.

00:15:23 The next subject I want to cover is how much damage one line of code can do, specifically regarding the Ariane 5 rocket. This is one of three heavy-lifting rockets in the world, capable of launching tens of thousands of pounds into space. These rockets are big, expensive, and complicated, and we don’t often launch them.

00:15:55 Flight 501 was set to be the first launch of the rocket. The European Space Union spent nearly a decade developing this rocket, going through careful testing of all their systems, as this project cost around seven billion dollars. If something went wrong, it would be a massive loss.

00:16:33 These rockets are disposable; they're not reused every single time. The first launch, however, was going to be an actual mission, launching satellites into space. If this mission failed, they would lose the rocket and all the satellites they were trying to launch.

00:17:01 Initially, everything seemed normal during flight 501. A few minor hiccups occurred, but this is typical with a new rocket. Less than a minute into the launch, the thrusters suddenly gimbaled, causing the rocket to turn fiercely. Rockets have a pointy end; you want to keep that pointy end into the wind because otherwise, the wind can impact forcefully against the side, causing the rocket to disintegrate.

00:17:35 As expected, the rocket began to disintegrate; there's video footage of this occurring. Eventually, the rocket exploded, which is typically seen as a negative outcome. However, in this situation, it is a fortunate event, because rockets like this have a self-destruct sequence to prevent worse outcomes during a dangerous situation.

00:18:09 The rocket detected it was falling apart, so it exploded intentionally to prevent further disasters from occurring; when a rocket breaks apart and scatters bits everywhere, that's bad, but even worse is when a functional rocket crashes down to the ground.

00:18:45 The line of code that caused a failure was problematic. While I don’t know Ada code, this section was written in Ada. It appears relatively easy to read, but they didn’t use the most intuitive variable or function names. Understanding the line that failed requires some context. In this specific bit, they were taking a 64-bit floating-point number and converting it into a 16-bit integer.

00:19:21 Sixty-four is bigger than 16, and they failed to check for overflow. This isn't like Ruby, where you can add arbitrarily large numbers together and magically get accurate results. Luckily, they had indeed detected all possible overflows in their system by the time this rocket had launched.

00:20:01 The rocket engineers were doing actual rocket science, so they were aware of the potential for overflow and had fixed nearly all cases, except for one. They had established a rule that their computer should not exceed 80% of its capability, as it’s a resource-constrained system. Working the computer too hard would cause failures.

00:20:38 An interesting reason they didn’t check around this specific bit of code was that they ran their tests with actual data. The catch is that they tested using data from the Ariane 4 rocket, which had a different flight profile; for the Ariane 4, this code would never overflow.

00:21:15 Unfortunately, the Ariane 5 rocket had a significantly larger horizontal velocity. As a consequence, the floating-point number overflowed. Luckily for them, they had two copies of the computer running this code, but only the backup failed first due to the overflow.

00:21:55 The problem was that they were both getting the same data, so immediately after just milliseconds following the backup failure, the primary computer failed as well. Consequently, both redundant systems went down. A team was formed to investigate what happened and what went wrong.

00:22:40 The committee discovered multiple conclusions. One of them was that the computer itself was set up to protect against random faults, which is why they had two computers. If one happened to fail due to a bad capacitor or solder joint, the other computer could take over.

00:23:14 They weren’t protecting against design faults and assumed that all faults were essentially soft faults unless proven otherwise. If you apply test-driven development principles, this is similar to assuming that your code is flawed until it passes tests.

00:23:52 The next issue identified was that the computer assumed everything would operate smoothly. When it encountered this error, it simply ceased functioning and signaled an error message sent to the rocket gimbals; they interpreted it as a movement command, resulting in the failure.

00:24:30 Another key takeaway from the investigation reinforced the necessity of developing both thorough code and quality documentation. Much of the investigation depended on the detailed documentation provided about their processes to understand the systemic issues.

00:25:06 The final recommendation was to eliminate dead code. This came as a significant point of the story regarding the calibration code intended to calibrate gyroscopes inside the rocket to determine the direction of 'up.' This program ran when the rocket sat on the launch pad, continually calibrating those gyroscopes.

00:25:49 However, post-launch, the rocket began to turn. Thus, keeping that calibration active was nonsensical once it launched. In the previous Ariane 4 rocket, the constant calibration was a clever hack designed to ensure that, if there was a delay before the launch, the calibration would operate on the gyroscopes.

00:26:26 However, in the case of the Ariane 5, this dead code ended up leading to its explosion. The committee concluded that while clever shortcuts might be useful, dead code should be removed, as it led to negative outcomes. The first launch of the Ariane 5 rocket was not good.

00:27:06 Despite the initial failure, later launches were successful, leading to a record of 81 consecutive successful launches in a total history of nearly a hundred launches. This rocket is undoubtedly a success through the investigation, which helped fix many of the issues in their systems.

00:27:46 Next, I want to discuss agile development in the Soviet Union. Although the US won the space race, it's crucial to acknowledge that the USSR achieved many milestones beforehand. They accomplished many feats leading up to the US landing on the moon.

00:28:29 Despite the challenges faced, they never managed to send a person to the moon. The answer is that getting to the moon is incredibly difficult. A significant aspect of this is needing to launch an immense rocket capable of carrying cargo from low earth orbit to the moon.

00:29:06 The engine powering the US rocket was the F1 engine – an engineering marvel. For perspective, this engine absorbed fuel at a rate so high that it could drain an average-sized bathtub in approximately 200 milliseconds. They had five of these engines on the first stage.

00:29:44 The Soviet Union, however, did not have the industrial capability to create engines of that size and power. As a result, they took a different approach and designed the NK-33 engine, focusing on efficiency. They aimed for an exceptional thrust-to-weight ratio. Because it was lighter, they could afford to use more engines.

00:30:24 While their engines lacked the power of the F1 engines, they compensated by including 33 of them. Keeping in mind the need for simplicity and reliability, the engineers from the US were careful with their designs to minimize risk and ensure a greater chance of success with each launch.

00:31:04 Meanwhile, the Soviet Union adopted a hardware-centered process, allowing for numerous test launches to iterate and improve their designs based on what was learned. Their approach led to some failures; however, they were not hindered by the press like the United States.

00:31:45 They conducted many of their test launches in isolated areas without media oversight, allowing them to modify and improve their rockets without public scrutiny. The NK-33 engine was developed as a successor to the previous N1 rocket designed for lunar missions.

00:32:21 The first four launches exploded; still, the NK-33 was slated for use on the fifth launch, but that mission was ultimately canceled after the US had successfully landed on the moon. Ironically, the Soviet government ordered the destruction of NK-33 engines that were prepared but unused. Fortunately, some engineers hid them away in a warehouse.

00:33:09 Not revealed until the 1990s, these engineers contacted American aerospace engineers to showcase their NK-33 engines. American engineers found it hard to believe, as these engines provided a thrust-to-weight ratio unprecedented in the US rocket engineering.

00:33:51 Furthermore, they utilized closed cycle technology; this was deemed impossible in the US due to the risks associated with high temperatures leading to metal corrosion. After American engineers tested these engines, they found that the Soviet Union had, indeed, designed an extraordinary engine.

00:34:31 This is comparable to someone walking up today and offering powerful servers designed on technology deemed impossible, claiming to sell them at a premium. The Soviet engineers sold their NK-33 rockets for $1.1 million each and utilized them into the 90s; their derivatives can still be seen in modern rockets.

00:35:12 Despite their superior engineering capabilities, the Soviet Union could not complete the journey to the moon, partially due to political issues and lack of political will. Lastly, I want to tell you about one single bit.

00:36:06 We've seen how much damage one line of code can do. Now, let's examine what could happen with just one little bit. Specifically, one bit in the Galileo space probe was designed and built in the 1980s, launched in the 90s, and operated successfully into the Jovian system.

00:36:37 Gathering beautiful images, the Galileo probe expanded our understanding of Jupiter and its moons, but it faced several challenges along the way. If you look at illustrations of Galileo from its early days, you’d see a large antenna dish on top.

00:37:08 In later illustrations, however, you'd instead find a much smaller dish that resembled a half-open umbrella, as the original dish failed to expand fully. That’s another story for another day. What I want to focus on is the magnetometer, which measures the magnetic field strength around Jupiter.

00:37:42 The magnetometer on the Galileo probe malfunctioned due to one tiny bit in the source code becoming corrupt. Space environments, particularly in Jupiter, are incredibly unforgiving. High radiation levels in space likely caused a gamma ray to hit a memory bit and fry it.

00:38:10 Imagine you're at JPL. You have the faulty code for the magnetometer alongside the original source code to determine what went awry. The catch is that JPL had eliminated all their Apple IIs, the older computer system where the code had been developed.

00:38:48 How would you feel? You need to fix a bug but lack the computer or environment for compiling the code. A dilemma arose because the engineers deemed the effort to recreate the development tools wildly uneconomical.

00:39:27 However, Ron Garrett believed he could resolve it more quickly than anticipated. He had a couple of advantages; one was that the magnetometer operated on a straightforward 1802 processor. This processor’s accessibility makes it easy to write code for it.

00:40:06 The second advantage stemmed from the program’s use of Forth, a relatively straightforward programming language that’s often found in space applications. The simplicity of Forth allowed him to configure a method to compile the code.

00:40:59 In about a month, with some aid from the magnetometer team at JPL to validate his patch, he managed to write a corrective solution. This experience reminds us all that sometimes, with creativity and determination, we can save whole systems, even in dire situations.

00:41:21 In conclusion, I won’t overwhelm you with my takeaways, but I want to emphasize that history is rich in lessons related to technology, engineering, and programming. If we apply the knowledge learned over decades, even centuries, we can cultivate a deeper understanding of our current challenges.

00:41:58 I encourage you to explore the stories of failures and successes from the past. One resource I'd recommend is the Google Tech Talk by Ron Garrett on the remote agent experiment, where he dives deeper into Lisp programming and organizational challenges.

00:42:32 I also suggest reading 'Digital Apollo,' which discusses the Apollo guidance computer and its fascinating design processes. Furthermore, if you want a closer examination, the Apollo 11 source code is available on GitHub for you to view.

00:43:06 Finally, if you want some engaging visuals, I recommend seeking out the documentary series 'Moon Machines,' which covers various subjects within the space program, including the Apollo guidance computer and f-1 rocket engines.

00:43:44 In closing, I would like to express my gratitude to Ron Garrett for his contributions, sharing his code on GitHub. A special thanks to Rachel McGuffin for the remarkable illustrations featured in this presentation. Please join me in applauding both for their hard work.

00:44:29 If you have any questions at all, I'm more than happy to discuss them. Thank you again!