Late, Over Budget, & Happy: Our Service Extraction Story

by Amy Newell and Nat Budin

The video titled "Late, Over Budget, & Happy: Our Service Extraction Story" presented by Amy Newell and Nat Budin at RubyConf 2019 discusses the challenges and successes of extracting a GraphQL service from a long-standing Rails monolith at PatientsLikeMe. The project initially estimated to take three months extended to nine months, prompting a reflection on the lessons learned throughout the experience. Key points of the presentation include:

Project Background: The background at PatientsLikeMe, focusing on the social network's needs, particularly for chronic condition patients, and the struggles of their monolithic architecture, particularly slow loading times for user feeds.
Initial Solutions: The implementation of Redis for caching user feeds, which improved load times but led to a cap in growth due to mismanagement of resources.
Service Extraction Rationale: The need to transition from a monolithic structure to a microservices architecture to enable more efficient changes and updates, sparked by issues with the pre-existing system's coupling and clutter.
Emotional and Management Challenges: The presentation highlights the emotional complexities involved in leadership, especially when friends work together, and how personal mental health challenges impacted the project timeline and team dynamics.
Project Lessons: Important lessons included the necessity of selling and marketing the project upfront, maintaining a collaborative team environment, and avoiding the sunk cost fallacy by ensuring transparency and support from the start.
Deployment Success: By mid-2017, despite delays, the Newswire service was successfully shipped, demonstrating the significance of resilience in projects, and the eventual positive outcomes, including improved user feeds and new application features.
Future Implications: The experience underscored that engineering projects can feel risky but highlighted the importance of learning from both successes and setbacks, recognizing how adequate resource management and emotional awareness can influence project outcomes.

Overall, Newell and Budin concluded that while their project faced significant challenges, it ultimately succeeded and transformed their services, leading to greater flexibility and responsiveness for users.

00:00:12.230 We both used to work at a company called PatientsLikeMe. I was formerly the Director of Engineering there. I'm now at Wistia, which is an excellent place to work.

00:00:19.020 I'm based in Boston. I was formerly a Principal Software Engineer at PatientsLikeMe on Amy's team, and I am now at ActBlue, which is another excellent place to work. I'm based in Seattle and work remotely. I was also working remotely when I worked for Amy.

00:00:35.429 This talk is about a project that we undertook while we were both at PatientsLikeMe. Throughout this presentation, you'll see rectangular boxes that Keynote provided us, which we call hindsight boxes. They represent lessons that we learned from this experience, and we would like to share them with you.

00:00:54.149 I want to give a little content warning. If you've seen me speak before, you know I tend to discuss topics rather frankly that are relevant to the story we're telling. If this is difficult for you, please take care of yourself in whatever way you need to. We won't be offended.

00:01:15.720 Now, let me tell you a little bit about PatientsLikeMe. It is a social network for people with serious and life-changing, sometimes fatal, chronic conditions. It's a place where users can track their symptoms, treatments, and the progress of their conditions. However, most users come to connect with others going through similar experiences, seeking support and information to help lead better lives.

00:01:43.080 This screenshot I found on my laptop from roughly 2013 shows what the development version of the site looked like. It features the news feeds on the website, which help connect users with each other. PatientsLikeMe uses Facebook-like social news feeds as one of the key features that assists people when they were originally built around 2010 or 2011.

00:02:00.870 Originally, the idea was based on a concept called data-driven journaling (DDJ). Instead of making a free-form post on a newsfeed, users would update their medical history, prompted by the site to describe what had changed and how to describe it. This way, data donation becomes a method of connecting with others.

00:02:36.879 Interestingly, the newsfeed would post updates regardless of whether the user opted to write something or not. This classic Rails monolith stored user-generated content as well as metadata about medical changes in a table called stream_event in our Postgres database.

00:03:10.689 When the controllers rendered a user’s newsfeed, they made queries to determine what content to present based on their follows. However, this began to slow down considerably, taking 30 seconds to one minute for our heaviest users to log in and view that feed.

00:03:39.519 To address this, we attempted caching the slowest parts of the request. We used Redis and created sorted sets for each user and then precalculated the results. When new posts were added, we added them to the front of the correct user's sorted set. This solution improved load times dramatically to less than a second.

00:04:11.319 However, there was a growth cap because, at the time, we had not set Redis up to be clustered properly. When we started running out of room for users and could not take down everything for downtime to cluster, we ended up truncating people's feeds to save space.

00:04:55.770 This situation arose during the summer of 2016, as we thought of using a column store like Cassandra as a potential solution, despite not knowing much about it. After a fortuitous email about a conference in Seattle focused on column data stores, I attended and gained insights from various speakers about different options.

00:06:12.610 I learned that many folks faced challenges with new data stores too. They struggled with maintenance and configuration, which ultimately led to them reverting to relational databases. One case was the Art Genome Project, which transitioned back to MySQL but indexed everything in Elasticsearch.

00:06:53.250 This sparked a realization for me. We already had operational knowledge of managing Elasticsearch, so I decided to build a proof of concept. I moved our entire stream_events table into an Elasticsearch instance, creating a microservice named Newswire.

00:07:13.800 The new system worked similarly to the old one but moved the stream events to Newswire. It would index them in Elasticsearch while also storing them in Postgres. Recognizing that there was a demand to experiment with feed algorithms, I wanted to provide a querying API to make it flexible.

00:07:51.970 Instead of reinventing GraphQL, I decided to use it. A key takeaway is that sometimes the answer to technical issues is right in front of you, but it can go unnoticed due to preconceived notions.

00:08:34.189 Hearing the conference talk expanded my perspective on potential solutions. So, I continued developing this proof of concept with minimal oversight, as I was working remotely with Amy in another location.

00:09:02.790 Why did we pursue this service extraction? We had a monolithic Rails application that made it difficult to implement changes efficiently. As people began to wonder why engineering seemed so slow, the popular answer became to 'try microservices.' We wanted to see if this helped resolve some of our challenges.

00:09:45.480 We decided to rewrite feeds as a service because we recognized existing coupling within the system. The original stream events had been built to tag medical history changes but began to clutter the feed.

00:10:09.480 We had to ensure that pertinent stream events wouldn't show up in people's feeds. The project grew complicated, as we had to balance maintaining integrity within the query and services adding real boundaries.

00:10:44.780 I want to highlight that our friendship made it difficult to uphold those boundaries. As close friends, Amy and I felt those pressures. We had to navigate emotions carefully on our leadership journey, realizing that leading with realistic optimism was crucial.

00:11:19.140 The added anxiety of project delays took a toll on my mental health which ultimately impacted how I could effectively lead. I learned that as a manager, it's part of the responsibility to manage the emotional climate of your team.

00:12:15.300 When I made an estimate on how long our project would take, I thought three months. However, project timelines in engineering can often be unreliable, and my number ended up getting lost in translation.

00:12:41.880 In retrospect, we shouldn't have started this as a skunkworks project. This led us to several hindsight lessons. First, it's crucial to market your project and get buy-in upfront. If I see a significant technical project that needs to get done without adequate resources, I’m now more likely to defer.

00:13:18.760 Secondly, it's critical to have a collaborative team. When other developers finally joined the project, they faced challenges understanding the code. The situation was cumbersome because someone held all the context in their head, leading to the need for substantial rewrites.

00:13:53.330 We also realized it would have been beneficial to involve specialist QA and infrastructure teams early to guide decisions. Unfortunately, not having a formal project structure made it difficult to engage those resources.

00:14:26.210 We fell victim to the classic blunder of the sunk cost fallacy. While skunkworks projects may appear to cost less, they often lead to higher costs in the long run. By starting on a project secretly, you create hurdles down the line when attempting to gather resources.

00:15:03.050 That approach can lead to difficulties when it's time to gather buy-in and complete your project because you'll find that scope creep and necessary collaboration eventually arise. Successful projects require visibility and support—first and foremost.

00:15:46.960 So, after three months, we weren't ready to ship. However, a turning point came when a project manager who wanted to increase engagement on the website arrived. He pushed more fervently for progress, leading to a small team gaining traction to complete the project.

00:16:32.940 Despite rewriting most of the code, we began to feel some momentum entering early 2017. By this time, two additional requirements emerged from the process: zero downtime needed for the deployment and maintaining the old iOS app during the rebuild.

00:17:39.490 These requirements were unexpected, adding layers of complexity. I had been grappling with personal challenges, including a depressive episode while waiting on treatment. Our timeline began to stretch in ways I hadn't anticipated.

00:18:20.670 It’s important to connect decisions to actual business needs rather than arbitrary standards. We sought zero downtime to adhere to a belief of good engineering practices, whereas downtime wouldn't truly harm our business.

00:19:35.060 By June 2017, we crossed the nine-month mark from our originally estimated three-month project. However, two significant events transpired: we successfully shipped Newswire to production, and I began my ketamine infusions as part of mental health treatment.

00:20:15.370 Amidst this migration, we encountered under-provisioned resources for our Elasticsearch cluster. Fortunately, our expert was able to implement additional nodes as the system adapted to our increasing load efficiently.

00:21:28.030 Moreover, the system we had built proved resilient during the migration as our release did not cause complete disruption in user feeds. We learned an important lesson: as long as you can showcase the successful outcomes of a lengthy project, people will appreciate the success you achieved.

00:22:00.630 In mid-2017, we were internally frustrated due to project delays. However, over time, positive changes were noticed as we enhanced feed algorithms and launched new iOS and Android applications, eventually receiving positive feedback.

00:23:07.230 Feedback noted that the substantial investment made initially allowed us to rapidly release new beta features, and the successful integration over a year's work positioned our services to be more flexible and responsive.

00:23:36.730 In conclusion, our experiences showcased that many engineering projects can feel risky, and while we were successful, mostly due to sheer luck, it’s essential to recognize that not every attempt will lead to favorable outcomes. Ultimately, it’s crucial to learn from your journey.

00:24:59.030 I want to acknowledge those who worked hard on this project, including Stephanie, who picked things up after I left the company. Thank you for your contributions, and I appreciate everyone for being attentive! Please feel free to reach out with any questions.