Reliable Engineer: From Production Breaker To Enterprise Maker

00:00:08.970 Thank you for the kind words; they nearly made me cry. If you want to contact me later, my name is Denys Medynskyi, and my nickname is Danesco. You can find me on GitHub, Facebook, and Telegram. I also work at TopTal, and we have a table here with lots of stickers, bags, and t-shirts. Please come by and talk to us. We don't have recruiters; we have regular developers who are happy to chat with you.

00:00:19.710 Today, I will talk about reliability. Specifically, I’ll share my journey from being a production breaker to becoming a reliable engineer. I started working at TopTal three years ago, believing I was a cool guy who knew everything. I thought I was very fast and productive—pushing three pull requests when others were making only one. I mistakenly believed they were slow, but the truth was quite the opposite.

00:00:54.539 I made many mistakes and caused production failures multiple times. My thought process would go like this: I’d deploy code, and suddenly the page would break, leaving me disheartened. I thought I had enough experience and that I wouldn’t make mistakes. This led me to ponder the concept of a reliable engineer and how I could improve my reliability and performance.

00:01:58.579 I reached out to people for their thoughts, created a survey, and gathered a cloud of ideas about reliability and its associations. From this, I devised my own definition: a reliable engineer is someone who delivers good quality work without errors and on time.

00:02:12.480 Now, let’s delve into the topic of quality. I’d like to reference a humorous assertion made by Uncle Bob Martin: 'Don’t ship.' It’s a motivational line that effectively emphasizes the importance of recognizing quality in our work. Let’s discuss metrics you can use to monitor your project's health.

00:02:37.860 For example, consider your team’s velocity if you measure development work in story points. If you notice a decline in velocity compared to previous months, it may signal a problem. Perhaps a decision you made needs reevaluation. If you find that the number of bugs is increasing or that deployments result in more issues, that's a clear indicator of decreasing quality. One suggestion is to monitor the number of errors with each production release; if this number goes up, it’s a sign your quality is slipping.

00:03:59.880 Another metric to consider is whether you would recommend your project to a friend. If the answer is no, that’s a troubling indication about your project’s quality. Additionally, utilize tools that help you track and address errors. For instance, you can leverage automated testing or exception handling services that gather data from multiple projects to identify common errors.

00:04:30.150 In terms of handling errors, I acknowledge I’ve made my fair share. Acknowledging mistakes is crucial, and a trend in my surveys indicated that data-related errors were common. When working with data, instead of deleting records, it’s wiser to update them to keep a clear history. This practice can mitigate the chaos that ensues following an accidental deletion.

00:04:47.340 It's also essential to maintain that your staging environment mirrors production. I've faced problems when the data in staging was out of date. To counteract this, we now regularly dump production data into staging to ensure everything aligns. This helps to avoid mismatches between what is tested and what is deployed.

00:05:32.880 Let’s also discuss the importance of zero-downtime migrations. Utilize tools that help manage migrations without affecting live users. For instance, when you remove a database column, make sure that your application no longer references it before deployment. Tools that monitor your exceptions can alert you to potential errors before they affect your users.

00:06:36.330 Moving on, it’s inevitable that mistakes will happen. As an example, a GitLab engineer once accidentally deleted six hours of production data under the assumption he was working on a backup server. The key here is to accept that mistakes occur and focus on establishing a post-mortem approach. Implement root cause analysis by asking 'why' multiple times to get to the core of issues that arise.

00:07:40.560 After facing numerous errors, we began implementing checks across all our public URLs with each deployment. By validating our sitemap and ensuring URL functionality, we improved our confidence in production releases. This approach prevents broken pages from making it to production and has significantly enhanced project reliability.

00:08:59.550 Time management is another challenge developers face, often due to overly optimistic deadlines. To counter this, we can use techniques like Planning Poker, where the team collectively estimates story points for tasks. This method helps us to take advantage of collective wisdom and leads to more accurate estimations.

00:10:21.000 Another technique is maintaining a historical log of estimations and completed tasks. With this data, we can improve future estimations. Furthermore, I personally adopted the Pomodoro Technique for managing my tasks and enhancing my productivity, which helped me understand the amount of effort required for different tasks.

00:11:03.810 In summary, the 'Chavez Monkey' technique, utilized by Netflix, involves randomly terminating servers to find hidden issues. As engineers, we should proactively search for problems in our workflows. By adopting a mindset of continuous improvement, we can enhance both our skills as engineers and the quality of our work.

00:12:14.520 Thank you all for your attention. Let’s continue to strive for excellence in software quality. I would be happy to answer any questions after this.