00:00:12.230
We both used to work at a company called PatientsLikeMe. I was formerly the Director of Engineering there. I'm now at Wistia, which is an excellent place to work.
00:00:19.020
I'm based in Boston. I was formerly a Principal Software Engineer at PatientsLikeMe on Amy's team, and I am now at ActBlue, which is another excellent place to work. I'm based in Seattle and work remotely. I was also working remotely when I worked for Amy.
00:00:35.429
This talk is about a project that we undertook while we were both at PatientsLikeMe. Throughout this presentation, you'll see rectangular boxes that Keynote provided us, which we call hindsight boxes. They represent lessons that we learned from this experience, and we would like to share them with you.
00:00:54.149
I want to give a little content warning. If you've seen me speak before, you know I tend to discuss topics rather frankly that are relevant to the story we're telling. If this is difficult for you, please take care of yourself in whatever way you need to. We won't be offended.
00:01:15.720
Now, let me tell you a little bit about PatientsLikeMe. It is a social network for people with serious and life-changing, sometimes fatal, chronic conditions. It's a place where users can track their symptoms, treatments, and the progress of their conditions. However, most users come to connect with others going through similar experiences, seeking support and information to help lead better lives.
00:01:43.080
This screenshot I found on my laptop from roughly 2013 shows what the development version of the site looked like. It features the news feeds on the website, which help connect users with each other. PatientsLikeMe uses Facebook-like social news feeds as one of the key features that assists people when they were originally built around 2010 or 2011.
00:02:00.870
Originally, the idea was based on a concept called data-driven journaling (DDJ). Instead of making a free-form post on a newsfeed, users would update their medical history, prompted by the site to describe what had changed and how to describe it. This way, data donation becomes a method of connecting with others.
00:02:36.879
Interestingly, the newsfeed would post updates regardless of whether the user opted to write something or not. This classic Rails monolith stored user-generated content as well as metadata about medical changes in a table called stream_event in our Postgres database.
00:03:10.689
When the controllers rendered a user’s newsfeed, they made queries to determine what content to present based on their follows. However, this began to slow down considerably, taking 30 seconds to one minute for our heaviest users to log in and view that feed.
00:03:39.519
To address this, we attempted caching the slowest parts of the request. We used Redis and created sorted sets for each user and then precalculated the results. When new posts were added, we added them to the front of the correct user's sorted set. This solution improved load times dramatically to less than a second.
00:04:11.319
However, there was a growth cap because, at the time, we had not set Redis up to be clustered properly. When we started running out of room for users and could not take down everything for downtime to cluster, we ended up truncating people's feeds to save space.
00:04:55.770
This situation arose during the summer of 2016, as we thought of using a column store like Cassandra as a potential solution, despite not knowing much about it. After a fortuitous email about a conference in Seattle focused on column data stores, I attended and gained insights from various speakers about different options.
00:06:12.610
I learned that many folks faced challenges with new data stores too. They struggled with maintenance and configuration, which ultimately led to them reverting to relational databases. One case was the Art Genome Project, which transitioned back to MySQL but indexed everything in Elasticsearch.
00:06:53.250
This sparked a realization for me. We already had operational knowledge of managing Elasticsearch, so I decided to build a proof of concept. I moved our entire stream_events table into an Elasticsearch instance, creating a microservice named Newswire.
00:07:13.800
The new system worked similarly to the old one but moved the stream events to Newswire. It would index them in Elasticsearch while also storing them in Postgres. Recognizing that there was a demand to experiment with feed algorithms, I wanted to provide a querying API to make it flexible.
00:07:51.970
Instead of reinventing GraphQL, I decided to use it. A key takeaway is that sometimes the answer to technical issues is right in front of you, but it can go unnoticed due to preconceived notions.
00:08:34.189
Hearing the conference talk expanded my perspective on potential solutions. So, I continued developing this proof of concept with minimal oversight, as I was working remotely with Amy in another location.
00:09:02.790
Why did we pursue this service extraction? We had a monolithic Rails application that made it difficult to implement changes efficiently. As people began to wonder why engineering seemed so slow, the popular answer became to 'try microservices.' We wanted to see if this helped resolve some of our challenges.
00:09:45.480
We decided to rewrite feeds as a service because we recognized existing coupling within the system. The original stream events had been built to tag medical history changes but began to clutter the feed.
00:10:09.480
We had to ensure that pertinent stream events wouldn't show up in people's feeds. The project grew complicated, as we had to balance maintaining integrity within the query and services adding real boundaries.
00:10:44.780
I want to highlight that our friendship made it difficult to uphold those boundaries. As close friends, Amy and I felt those pressures. We had to navigate emotions carefully on our leadership journey, realizing that leading with realistic optimism was crucial.
00:11:19.140
The added anxiety of project delays took a toll on my mental health which ultimately impacted how I could effectively lead. I learned that as a manager, it's part of the responsibility to manage the emotional climate of your team.
00:12:15.300
When I made an estimate on how long our project would take, I thought three months. However, project timelines in engineering can often be unreliable, and my number ended up getting lost in translation.
00:12:41.880
In retrospect, we shouldn't have started this as a skunkworks project. This led us to several hindsight lessons. First, it's crucial to market your project and get buy-in upfront. If I see a significant technical project that needs to get done without adequate resources, I’m now more likely to defer.
00:13:18.760
Secondly, it's critical to have a collaborative team. When other developers finally joined the project, they faced challenges understanding the code. The situation was cumbersome because someone held all the context in their head, leading to the need for substantial rewrites.
00:13:53.330
We also realized it would have been beneficial to involve specialist QA and infrastructure teams early to guide decisions. Unfortunately, not having a formal project structure made it difficult to engage those resources.
00:14:26.210
We fell victim to the classic blunder of the sunk cost fallacy. While skunkworks projects may appear to cost less, they often lead to higher costs in the long run. By starting on a project secretly, you create hurdles down the line when attempting to gather resources.
00:15:03.050
That approach can lead to difficulties when it's time to gather buy-in and complete your project because you'll find that scope creep and necessary collaboration eventually arise. Successful projects require visibility and support—first and foremost.
00:15:46.960
So, after three months, we weren't ready to ship. However, a turning point came when a project manager who wanted to increase engagement on the website arrived. He pushed more fervently for progress, leading to a small team gaining traction to complete the project.
00:16:32.940
Despite rewriting most of the code, we began to feel some momentum entering early 2017. By this time, two additional requirements emerged from the process: zero downtime needed for the deployment and maintaining the old iOS app during the rebuild.
00:17:39.490
These requirements were unexpected, adding layers of complexity. I had been grappling with personal challenges, including a depressive episode while waiting on treatment. Our timeline began to stretch in ways I hadn't anticipated.
00:18:20.670
It’s important to connect decisions to actual business needs rather than arbitrary standards. We sought zero downtime to adhere to a belief of good engineering practices, whereas downtime wouldn't truly harm our business.
00:19:35.060
By June 2017, we crossed the nine-month mark from our originally estimated three-month project. However, two significant events transpired: we successfully shipped Newswire to production, and I began my ketamine infusions as part of mental health treatment.
00:20:15.370
Amidst this migration, we encountered under-provisioned resources for our Elasticsearch cluster. Fortunately, our expert was able to implement additional nodes as the system adapted to our increasing load efficiently.
00:21:28.030
Moreover, the system we had built proved resilient during the migration as our release did not cause complete disruption in user feeds. We learned an important lesson: as long as you can showcase the successful outcomes of a lengthy project, people will appreciate the success you achieved.
00:22:00.630
In mid-2017, we were internally frustrated due to project delays. However, over time, positive changes were noticed as we enhanced feed algorithms and launched new iOS and Android applications, eventually receiving positive feedback.
00:23:07.230
Feedback noted that the substantial investment made initially allowed us to rapidly release new beta features, and the successful integration over a year's work positioned our services to be more flexible and responsive.
00:23:36.730
In conclusion, our experiences showcased that many engineering projects can feel risky, and while we were successful, mostly due to sheer luck, it’s essential to recognize that not every attempt will lead to favorable outcomes. Ultimately, it’s crucial to learn from your journey.
00:24:59.030
I want to acknowledge those who worked hard on this project, including Stephanie, who picked things up after I left the company. Thank you for your contributions, and I appreciate everyone for being attentive! Please feel free to reach out with any questions.