00:00:23.980
Yes, hello! Thank you. Hello! I am Chad.
00:00:31.990
As he said, I need no introduction, so I won't introduce myself any further. I may be the biggest non-Indian fan of India.
00:00:40.770
So this year, I’m back in Bangalore! Wow!
00:00:51.160
My Bengali is bad, and my Hindi is worse, but I'm trying my best.
00:00:57.850
My German is okay, but I mix it with Hindi sometimes. I’ll switch back now, sorry if you don’t understand Hindi. I said nothing of value, and it was all wrong, but I was trying to say my Hindi is bad because I’m learning German.
00:01:10.900
Currently, I’m working at 6 Wunderkinder on a product called Wunderlist. It is a productivity application that runs on every client you can think of. We have native clients, a backend, and millions of active users.
00:01:30.460
I’m telling you this not so that you’ll go download it, although you can do that too. I want to share the challenges I face and how I’m starting to think about systems architecture and design. That’s what I’m going to talk about today.
00:01:55.720
I’ll show you some real things we are implementing and some ideas that may sound like fantasy but hopefully will help you think about system architecture and how to build things that can last a long time.
00:02:25.660
First, I want to mention a graph from the Standish Chaos Report. I’ve extracted some years and raw data because they don’t matter for this point. Each bar in the graph represents successful software projects.
00:02:49.150
The green bars represent successful projects, the silver or white ones are challenged projects, and the red ones are failed. Challenged means significantly over time or budget, which, to me, means failed.
00:03:02.439
So, basically we're terrible. We, all of us here, are terrible. We call ourselves engineers, but it's a disgrace. We very rarely actually launch things that work. It’s kind of sad.
00:03:30.819
Once software is launched, anecdotally, as you might experience in your own work lives, business software gets killed after about five years. So, you barely ever launch it successfully, and within about five years, you find yourself needing a big rewrite, throwing everything away and replacing it.
00:03:49.150
There is always that project to get rid of the old Java code you wrote five years ago, and in five years, you’ll be replacing your old Ruby code that didn’t work with something else.
00:04:05.139
You probably all know the term legacy software, right? I’m sure you think of it negatively—as that ugly code that doesn’t work and is brittle. You can’t change it, and you’re all afraid of it. But there’s also a positive connotation of the word legacy; it’s about leaving behind something that future generations can benefit from.
00:04:30.370
But if we rarely launch successful projects and those we do tend to die within five years, none of us are actually creating a legacy in our work. We are just creating things that get thrown away—kind of sad.
00:04:49.150
We create this legacy software that’s hard to change, which is why it ends up getting thrown away. If the software worked and could be changed to meet business needs, you wouldn’t need to perform a big rewrite.
00:05:05.139
We create large, tightly coupled systems. I don’t just mean one application, but many applications that are all tightly coupled. You have this thing talking to the database of another system, so if you change the columns to update the view of a webpage, you ruin your billing system.
00:05:26.000
This makes it hard to change. The way we work is the default setting. If we were robots churning out code and had a preferences panel, it would lead us to create terrible software that gets thrown away in five years.
00:05:42.000
It’s just how we work as human beings. When we write code, our instincts lead us to create systems that are tightly coupled, hard to change, ultimately thrown away, and unable to scale.
00:06:00.880
We try to implement tests, and adopt test-driven development (TDD), but we end up with test suites that take 45 minutes to run. I’m sure many teams have faced this situation. You start focusing on speeding up the test suite instead of making meaningful progress.
00:06:29.700
You might think, 'If it only fails 90% of the time, that’s okay, right?' Right now it’s taking 45 minutes we want to reduce that time to 10 minutes. The test suite becomes a liability instead of a benefit because everything is so tightly coupled.
00:06:53.360
You're terrified to deploy! I recall the last big Java project I worked on. Once a week, we deployed with 15 people working all night, copying class files and restarting servers. Today’s systems are better, but it’s still terrifying.
00:07:14.320
You deploy code; you change it in production and are unsure what might break because testing these large integrated components is very challenging. Upgrading technology stacks is intimidating.
00:07:33.980
How many of you have been using Rails for more than three years? Anyone still have rails 2 apps in production? That's a lot of people. Wow, that’s terrifying! I’ve recently encountered situations with Rails 2 apps in production.
00:07:52.200
Security patches were rolling out, and we applied our own versions out of fear. We’d rather hack the code than upgrade because we didn't know what would happen.
00:08:11.240
You re-implement everything yourself, wasting time and burning out on obsolete software when you should be utilizing the new patches.
00:08:29.890
This is a challenge I see Ruby has inflicted on us. I've been using Ruby for 13 years now, and we create these mountains of abstractions, burying logic in them.
00:08:42.640
In Java, it was static classes and design pattern soup, but in Ruby, it’s modules and mix-ins. All these complex ways of hiding reality from us.
00:09:00.480
When you look at the code, it becomes opaque. This complexity creates a software-specific problem. Cars built long ago, which are older than any software you run, still drive just fine.
00:09:18.720
How do they function? Our bodies, despite being abused, still work. We can survive long flights. How does that happen? It’s homeostasis.
00:09:34.500
I won’t define homeostasis beyond saying it’s essentially maintaining balance through various components that help regulate the system.
00:09:51.470
For instance, if the liver overperforms or malfunctions, another component kicks in and corrects it. Our bodies thrive because we have internal agents managing various risks.
00:10:11.100
This balancing act, known as homeostasis, is crucial. An inability to do this can lead to severe health problems.
00:10:30.010
Good news? We're all dying constantly. About 50 trillion cells in our bodies die at a rate of around 3 million per second.
00:10:50.770
Physically, you aren’t the same person you were a few years ago. Yet, you're still the same system.
00:11:06.230
You can think of software similarly; if components can be replaced, like cells, the overall system continues to survive.
00:11:24.140
Focus on constant small changes to ensure longer-term survival. This talk is about the solution: mimic the characteristics of living organisms.
00:11:37.090
One key takeaway I’ll emphasize is that small things are good. Small projects, small commitments, small classes, small teams—these are beneficial.
00:11:56.390
If we see software as an organism, what is a "cell" in that context? A cell is a tiny component. That’s a subjective term, but it’s helpful for thinking. If you make your software from tiny components, each one can die, yet the system remains.
00:12:18.300
You don’t need your code to live forever. The function of the system can take precedence over durability.
00:12:37.100
Ten years ago, we created Ruby Gems at RubyConf 2003 in Austin. I haven't touched it in years, yet it continues to exist, much to others' chagrin.
00:12:53.580
I’m not sure if my original code survives, but that's not important. What’s significant is that the system still operates. I ventured to question on Twitter: 'What are some of the oldest surviving software systems still in use?'
00:13:13.160
Responses often related to UNIX systems. The enduring old systems I've seen tend to consist of components or tiny programs.
00:13:33.680
For example, 'grep' is a tiny program that performs one function. Many old systems conform to this metaphor.
00:13:53.950
In my previous work with GE, we had a system called the 'Bull,' an aging mainframe. Despite various attempts to replace it, users preferred it.
00:14:12.420
The system's longevity stemmed from its clear interfaces and tiny components, which sustained operations despite unsuccessful replacement efforts.
00:14:30.160
Now, how do I approach the task of building systems to survive long-term? One inspiration is Fred George from ThoughtWorks, who shares experiences with microservice architectures.
00:14:50.210
He highlights the importance of tiny components that perform singular tasks and can be replaced when necessary.
00:15:06.800
At 6 Wunderkinder, where I work, we adopted similar principles. Our rules aim to minimize coupling, ensuring fear-free deployments.
00:15:22.680
We strive to reduce cruft, that nasty leftover code in our systems. Our focus is on making code changes trivial and allowing ourselves the freedom to accelerate development.
00:15:40.160
I think no developer desires to work slowly. It often happens because systems constrain our progress, but it often stems from messy architecture.
00:15:57.600
One less controversial rule states that comments are a design smell. Anyone disagree? Comments often signal you should investigate further.
00:16:11.420
Inline comments are particularly suspect, often indicating you may need to create separate methods for clarity.
00:16:28.380
Another idea, albeit more controversial, is that tests can be a design smell. If a test suite is slow and brittle, it signals a flawed system.
00:16:40.360
I’ve found when reviewing slow, complex test suites, they often reveal a poor state of the overall system, leading developers to write excessive tests in fear.
00:16:56.400
A simplified system wouldn’t require numerous test files that take a long time to run. Focus instead on developing small, trivial systems.
00:17:10.600
In my approach, you can write code in any language, as long as it’s compact enough for easy understanding, both for the devs and for the actual code.
00:17:30.060
Ultimately, every component should be small, stand-alone, easily maintained, and in its own repository.
00:17:46.800
The idea is that if you can look at a component and understand it immediately, it reduces risk by lowering complexity.
00:18:02.700
Our systems are heterogeneous by default. Different languages enhance system design. By utilizing varied programming languages, we decrease tight coupling.
00:18:18.780
For instance, I have worked with Objective-C, Ruby, Scala, and more—all without tightly coupling those components, as different languages act as natural barriers.
00:18:36.600
Furthermore, server nodes should be disposable. In my previous jobs, we were overly proud of server uptime.
00:18:53.480
Long uptimes breed fear: you feel hesitant to change or upgrade. Consequently, we embrace disposability over longevity.
00:19:07.260
We deploy new versions of services by creating and replacing servers with load balancers, eliminating the uncertainty of knowing everything on each server.
00:19:24.040
Decision-making is streamlined without concerning ourselves about the specifics within a server since it’s straightforward to recreate.
00:19:39.500
Additionally, provisioning new services should be simple. We’ve transitioned from complex config management to straightforward shell scripts.
00:19:56.700
Instead of overengineering with tools like Chef, we utilize uncomplicated scripts that execute efficiently.
00:20:14.060
We focus on continuous deployment, upholding the idea that if something feels hard, you should do it consistently to turn daunting tasks into routine operations.
00:20:31.860
For instance, where deploying weekly took effort, we’d make it mandatory that any change deployed went live promptly.
00:20:52.880
Deploy frequently; it minimizes fear, allowing teams to confidently manage changes. The average uptime of our servers is around 17 hours.
00:21:08.860
With distributed systems, failures are inevitable, sodesign for resiliency. I recommend studying Joe Armstrong's philosophy about failure and recovery in Erlang.
00:21:25.060
Testing shouldn’t overshadow monitoring; effectively track when things go wrong, measure faults, and aim to resolve them quickly.
00:21:44.430
When monitoring comprehensively—beyond just tracking memory and disk space—you can gain insights about your business metrics.
00:22:03.160
For example, if user signups decrease, it might indicate a larger underlying issue.
00:22:19.930
Understanding business impact often matters more than merely knowing server status.
00:22:36.260
Finally, to leave you with something to think about, foster a culture of urgency and readiness. You may need to overcome fears about potential disasters.
00:22:53.240
Use practices like 'canary in the coal mine' deployments to test new versions incrementally, watching the effects carefully.
00:23:10.630
Gradual change alleviates fears, allowing for continuous deployment without worry that everything might crash at once.
00:23:30.060
As we discuss resilience, envision scenarios where testing offloads burdens, leading to a more spontaneous response to failures.
00:23:47.300
This concept can guide you in architecting software and systems resilient to change: designing systems that can evolve or dissolve without chaos.
00:24:03.500
Experimental ideas, like chaos testing, can test system resilience by purposefully inducing failure to improve future reactions.
00:24:23.050
Additionally, consider whether you can incorporate 'spot instances' where your system remains flexible, adapting responsively to real-time availability.
00:24:41.390
Approach homeostasis in software design: configure systems to shift and scale based on environmental fluctuations.
00:25:01.060
Lastly, spread knowledge. If anyone’s interested in JSON schema and high-performance asynchronous validation, let's connect.
00:25:19.450
Your insights or involvement would be greatly appreciated. I think that’s my time.
00:25:47.440
Thank you very much and let’s continue this conversation during the conference.