00:00:18.740
Well, thank you everyone for being here on this last afternoon of the conference.
00:00:24.539
It's been an amazing conference so far. It's my first time attending a Rails conference and speaking at one.
00:00:38.219
Thank you! I'm quite stoked, excited, caffeinated, and sufficiently nervous.
00:00:45.000
Incidents happen—unpredictable things occur with computers in production, and humans are paged to help resolve them.
00:00:51.660
Some incidents are quite stressful, while others are relatively alright. Now, engaging with a small audience activity—how many of you have written code as part of your application that has behaved unpredictably or resulted in an outage, catching you off guard? Great, almost all of us! That's great because bad things happen in production all the time.
00:01:16.799
Is anyone on call right now? Okay, hopefully not, but if you are, and if you get paid, it could be a fun conference activity to live debug a page together. I'm just kidding.
00:01:28.619
Today, I'm not going to tell you how to write the most perfect Rails code or PostgreSQL queries so that your application never faces an incident. I don’t claim to be an expert in either of these technologies; however, I love Rails and I love PostgreSQL. Of course, I don't love outages, especially those resulting in customer impact or business implications.
00:01:55.979
However, I think that's what makes these incidents so meaningful. They provide a unique opportunity to improve your current systems and processes. If you reflect deeply on all the contributing factors that lead to an outage, you often come out learning something new about your system.
00:02:02.399
Today, I'll share a bit about a specific production incident that, surprisingly, I’m glad occurred. It was an incident that forced us to rethink our approach from first principles, helped identify interesting edge cases, and prompted us to design our system in a way that scales better using some intriguing PostgreSQL and Rails concepts.
00:02:14.459
We'll start by discussing an incident where we lost all our dynamically scheduled customer jobs. I will share the lessons we learned and how we went back to first principles in designing a more predictable and scalable system using PostgreSQL techniques. I'll also cover different kinds of locking mechanisms in PostgreSQL and what worked well for us, especially skip locks, and some interesting ways we found it useful.
00:02:51.480
Next, we’ll talk a bit about advisory locking and how we can use that to synchronize concurrent application workloads without running into lock contention. If you attended the ‘Off to the Races’ talk right before this one, some of this might feel familiar. Lastly, we’ll conclude by discussing timeouts, one of my favorite topics, and some useful patterns for auto-healing applications in production when a connection issue occurs in ActiveRecord.
00:04:01.860
My goal for today is to leave you with really short, simple, and actionable takeaways about Rails and PostgreSQL. While it's possible that everything I share today may not directly apply to you, I hope you’ll find something new and informative.
00:04:13.920
I'm Shayon. I've been programming in Ruby and Ruby on Rails for about 10 years now. My very first experience with Rails was working on a Google Summer of Code project, and from there, there was no stopping me.
00:04:20.340
Here are my social handles if you’d like to stay in touch. I live in Cambridge, and that's my dog, Ali. I just need an excuse to insert his photo wherever I can! His name may or may not have been inspired by the shorthand form '011y,' which I thought would be fun given my interest in observability at the time.
00:05:02.340
On a more serious note, I work at Times on the platform engineering team, tied to the no-code automation platform for security teams. We have customers ranging from startups to Fortune 1000 companies, and we like to use Rails for all our business-critical operations.
00:05:20.880
So, let’s dive right in. This incident happened in the fall of last year when our support team was alerted by a curious customer expecting a specific job of theirs to run on time at the hour mark, but it didn’t.
00:05:33.600
Before diving into the details of the incident, let’s give some context about how scheduling works in our application.
00:05:44.460
At a high level, our time setup isn't very different from your modern-day Rails applications. We have a front-end with React, Relay, and GraphQL, and on the back end, it's Ruby on Rails. It’s primarily a monolith with different invocation points for certain application workloads, like web services and various worker services. It is deployed across multiple containers that can be horizontally scaled out and vertically scaled up.
00:06:02.039
For databases, we use PostgreSQL, and for caching, we use Redis. We also use Sidekiq for processing all our asynchronous workloads. Notably, Times is a multi-tenant application, and for our enterprise customers, we have the ability to spin up Times in a single clustered mode, providing each customer with their own dedicated database, Redis, and VPC subnet.
00:06:20.940
This is a screenshot of how one of the product screens looks within Times, referred to as a story. A story has the ability to have many actions, each with different characteristics. You can connect one action to another, passing along event data generated from any previous action, allowing you to perform and receive HTTP requests.
00:06:39.060
You can massage the data in various formats using built-in formulas, communicate with third-party APIs, send emails, and execute many other functions, all without having to write any code. I promise that's the last time I'll be talking about Times as a product today.
00:07:10.380
Now, let's take an example: a story that is responsible for querying a list of Cisco Security Advisories, looping over them, deduping any advisories based on certain logic, and then depending on the result, triggering an email to either the WebEx customers or the UCS Director customers. In this case, it's very simple.
00:07:33.600
Let’s say I want the story to run every hour because my compliance team has instructed that we must respect our SLAs for any critical advisories we've discovered. It’s crucial to ensure that emails are dispatched in a timely manner.
00:07:55.680
Now, this is what the data stored in the database to look like during the actions table, along with additional information related to the actions' scheduling, including the cron equivalent of when that action should run and how often.
00:08:08.520
Let’s talk about how an action that needs to run at a specific time is scheduled. There are several ways to schedule a job with Sidekiq—we use a popular gem called Static Scheduler. When scheduling jobs, there are two types: static jobs and dynamic jobs.
00:08:39.600
Static jobs are a finite set of Sidekiq jobs defined in a YAML file, primarily internal jobs that run on a schedule. For example, we run a job that keeps the Gravatars up-to-date every hour, and every five minutes, we send some data to our analytics store. Dynamic jobs allow you to tell the Sidekiq scheduler to manage jobs by setting the dynamic attribute to true.
00:09:09.420
By setting the dynamic attribute to true, the Static Scheduler spins up an internal thread that runs every five seconds to see if any new jobs need to be scheduled. If so, it enqueues the job in Sidekiq. The Sidekiq scheduler uses Rufus Scheduler, another popular gem.
00:09:36.480
When a new dynamic job is added or updated by a customer through the UI, we append the same in Redis just as we do in our primary database in PostgreSQL. We have a helper function in the Sidekiq module called set_schedule which appends the job in a Redis set. The Rufus scheduler checks the Redis set every five seconds for newly scheduled jobs to be enqueued.
00:10:07.680
An interesting thing to note is that the Sidekiq scheduler clears the schedule for both static and dynamic jobs in Redis whenever a Sidekiq worker shuts down. When a Sidekiq worker starts up, it reads the schedule and populates the set in Redis. Moreover, we have an internal job that ensures all dynamic jobs in PostgreSQL are reflected in Redis whenever a Sidekiq process starts as a precautionary measure.
00:10:45.840
This internal job first clears the schedule and then re-adds both the static and dynamic jobs. So, during a deployment, there are three parts of the system that are updating the schedule set: first, when the Sidekiq server starts up, it prompts the static scheduler to clear the schedule from Redis.
00:11:30.540
Second, when the Sidekiq server starts up, our internal job resets the schedule first and then repopulates the static and dynamic jobs. Finally, the additional thread spun up by the Static Scheduler runs every five seconds to reconcile the schedule and enqueue any scheduled jobs to ensure they run at the correct time.
00:11:46.860
Even though these operations are intended to be atomic, a lot can happen, especially during a deployment. If anyone in the audience has experience with distributed systems, your spidey senses might be tingling now.
00:12:07.620
Now that you have a clearer understanding of the setup, let’s look at the incident. On the day of the incident, around 12:56 PM, a deployment was conducted. A deployment typically performs a rolling restart of all our containers in batches, and around 12:59 PM, old containers were shutting down while new containers were coming up.
00:12:56.380
At 12:59 and 49 seconds, one of the new processes experienced a fatal exception and crashed. At one second past 1 PM, a new container came online and registered the schedule in Redis. During this time, any scheduled action that was supposed to run at 1 PM was missed.
00:13:14.760
This was due to a 12-second window after 12:59:49 where there was no schedule registered in Redis, resulting in the skipping of any hourly job. Even if it was just one second past 1 PM, jobs that were supposed to run at this time would be pushed to the next hourly schedule.
00:13:41.880
The fatal exception occurred in the Sidekiq scheduler due to a race condition caused by our internal job. This job, responsible for updating the schedule, ran every five seconds, reconciling any differences and enqueuing static and dynamically scheduled jobs.
00:14:01.260
During the startup of a static container, the reconciliation thread pulled the list of jobs from Redis just as a new container, part of the deployment, began to enqueue dynamic jobs. The internal job started by clearing the list from Redis first.
00:14:09.660
The timing of the internal job clearing the list from Redis overlapped with the reconcile thread trying to update the schedule, causing the reconcile thread to crash when it could not correctly reference the list of jobs.
00:14:26.580
At that moment, there were no active scheduler instances, and no scheduled jobs were queued at the hour mark. The reconcile thread crashed because it could not correctly reference the scheduled jobs due to the absence of registered jobs in Redis.
00:15:01.020
The crash and lack of jobs registered meant that any job required to run missed its time altogether, which was critical because this incident forced us to evaluate the various contributing factors.
00:15:39.720
We realized that the scheduling system had its shortcomings and that we needed to improve promptly. A new scheduler system was essential to address these issues and enhance reliability.
00:16:00.600
Moving back to first principles, we started by gathering requirements for a new potential system. As a product offering, the lowest denomination we wanted to support was running action jobs per minute, meaning customers could schedule jobs to run as frequently as once a minute, or less.
00:16:24.360
We aimed to have fewer moving pieces that update a central part of the system to reduce unintended code complexities. Also, we needed to ensure that the system would be suitable for customers who host Times in self-hosted environments or on-premise setups.
00:16:35.220
Furthermore, I prefer ‘boring’ technology; that is, introducing new technology incurs operational costs, engineering time, and training, so ‘boring’ solutions force innovation by encouraging the reuse of existing technologies.
00:17:12.279
Finally, we recognized that it’s acceptable for action runs to be delayed slightly instead of completely dropping them. For instance, if an hourly job runs 10 seconds late, it's better than missing it entirely.
00:17:25.020
With this understanding, we decided to continue using the Sidekiq scheduler for all our static jobs, which performed adequately, and implement a new scheduler for dynamically scheduled customer jobs.
00:17:41.520
We’re now using PostgreSQL as the primary source for all schedules and their subsequent runs. Every minute, we run a Sidekiq job queued by the Sidekiq scheduler that acts as a fan-out job, querying the database for action runs scheduled for the current minute and queuing those jobs for processing.
00:18:09.960
This new static job, let’s call it the fan-out job, runs every minute, pulling all scheduled customer jobs in the table that are meant to be executed at that moment. It invokes an instance function named ‘run’ for each action schedule object.
00:18:37.260
To ensure accurate processing, this function checks if the ‘next_run_at’ column for the action schedule instance is past the current time. If it is, the job is skipped, ensuring it will be picked up during the next run.
00:19:01.860
Then we calculate when the job should be run next, persist that in the action schedule row, and enqueue the job. We do this while applying optimistic concurrency control to avoid scheduling the same job twice.
00:19:15.660
For example, if two copies of the fan-out job were running at the same time, only one should enqueue the job successfully. We achieve this by ensuring that any update succeeds only if we fetch the same row.
00:19:45.720
Now, let’s zoom out and understand how optimistic and pessimistic locking works with PostgreSQL. Optimistic locking allows a specific record in the database table to be open for all users and transactions. It does not lock any database records upon reading but checks for changes at the time of updates.
00:20:25.470
In contrast, pessimistic locking explicitly locks records during the current transaction. Other transactions must wait for the current transaction to finish. If not designed carefully, this can lead to deadlock scenarios.
00:20:45.660
Our new scheduler system exemplifies optimistic concurrency control. In our implementation, we query the row by several attributes and call ‘update_all’ on the same row with a future date. If no results yield from the select call, the update call effectively becomes a no-op, allowing us to skip the enqueue step.
00:21:14.580
In contrast, the pessimistic locking method would explicitly lock the record, which poses downsides: the risk of lock contention and the potential for enqueuing jobs twice, where one job waits for the first job to release the lock.
00:21:42.780
Thus, we opted for optimistic concurrency control to ensure an at-most-once delivery of jobs with the new system. We can now confidently assert that even if the fan-out job is slightly late, the hourly job won't be missed like before.
00:22:00.120
The original fan-out query retrieves all relevant rows from the database without missing any. We were able to prototype the change within a few days because the code required was minimal.
00:22:20.100
Once we deployed this to some internal stacks, it performed much better than the previous system, with no incidents reported. Our monitoring metrics indicated no missed scheduling runs.
00:22:42.420
After a week, we rolled this out to all customers without encountering any issues since then. Moving on, let’s explore the Skip Lock feature in PostgreSQL—one of my favorite aspects of the database.
00:22:58.560
Skip Lock was introduced back in PostgreSQL version 9.5 and serves a significant function: building reliable concurrent workloads within job queueing systems. Traditionally, when an update query runs, PostgreSQL locks each row until the transaction is committed or rolled back.
00:23:19.920
For most cases, this behavior ensures concurrent updates are applied correctly or throw deadlock errors. However, if only the first or last update matters, you can utilize Skip Lock to bypass contention.
00:23:39.960
Consider the Actions table we discussed earlier. Every action generates an event that captures relevant metadata, including columns like last_event_at and last_error_log. Regular workloads typically don’t pose issues, but with large lists of vulnerabilities processed simultaneously, we risk contention.
00:24:00.000
Let’s say multiple Sidekiq workers update the same action record concurrently, such as calling ‘agent.touch’ for the last_event_at column. This can lead to block contention and reduced database performance.
00:24:20.220
To resolve this, we can introduce a function called ‘update_columns_with_skip_lock’ that safely updates a column without encountering lock contention. By leveraging this function, we can avoid blocking situations and allow for greater throughput on concurrent workloads.
00:24:44.160
Here’s how it would function: we take advantage of an existing record method on the application record class, appending the skip lock clause to the select statement from the find_by method.
00:25:05.040
Therefore, we proceed to update the row without concern that another transaction is already modifying it. With this new function, we can utilize it across all model instances and reduce lock contention, leading to better performance.
00:25:25.680
This leap in efficiency allowed us to lower our P99 latency from 1.5 seconds to 100 milliseconds and enabled processing of higher event volumes. These performance metrics are from our Honeycomb instrumentation, revealing the impact of the changes.
00:25:44.760
Next, let’s discuss advisory locks, which provide a controlled means for managing locks directly through the application. These locks are ideal where standard MVCC mechanisms may not suffice.
00:26:00.480
By using advisory locks, we can synchronize access to shared resources or third-party APIs, preventing simultaneous operations across various system components. To acquire an advisory lock, you can provide a unique 64-bit number to the relevant function.
00:26:35.760
Alternatively, you can utilize two 32-bit numbers. Instead of working directly with SQL, we can leverage Ruby to achieve this. These two gems are among my favorites for managing advisory locks.
00:27:00.840
For example, you would wrap a Sidekiq job in an advisory lock block with a timeout. If another Sidekiq job is scheduled while the lock is held, it won’t perform actions inside the block, effectively preventing duplicate processing.
00:27:38.760
Now for the last chapter, I want to provide lightweight operational tips and best practices for PostgreSQL in production, starting with timeouts. It's far better to timeout a single query than allow a slow query to occupy 100% of database resources and hinder other system components.
00:28:06.600
If you're using PgBouncer or another connection pooler, you can set query timeouts to manage performance effectively. For instance, I like setting timeouts in the database.yaml file for consistency across application connections.
00:28:24.960
Moreover, ensure you have session timeouts in place to prevent lingering idle transactions, which can be detrimental to performance. These figures serve as examples, and you can calibrate them based on your application needs.
00:28:44.880
Next are connection pools. If a database failover occurs, and it self-recovers, application connection pools may still be pinned to the previous writer instance. In such cases, connections might hang waiting for a response, which necessitates a manual intervention, such as a rolling restart of the application.
00:29:15.360
To facilitate self-healing, we can monkey patch ActiveRecord to handle exceptions raised by PostgreSQL during a failover. If a common exception occurs, we flush the local Connection, aiming to eliminate stale connections. We also retry non-transactional queries with an exponential backoff to avoid being stuck in infinite loops. If a query was in progress, we re-raise the error post-clear.
00:30:25.200
Thus, when the same worker or thread attempts a new query, it can create a connection from the pool that establishes a fresh connection. This approach helps achieve auto-healing in production as it allows applying new queries until the database becomes fully operational again.
00:30:43.920
If you want to explore this further, we've recently open-sourced our solution as a gem, available at github.com/time-rails/pg-adapter. The behavior resembles what I've outlined here, but there are some interesting advancements being made thanks to Rails core and Shopify.
00:31:02.400
In conclusion, we've covered a complex incident that I hope was engaging. We examined practical uses of optimistic and pessimistic locking, skip locks, and advisory locks while exploring operational best practices such as timeouts and auto-reconnects.
00:31:36.600
Thank you for sticking with me throughout this session! A special thanks to Noel, Serena, Connor, and my team at Times for their feedback and support. If you're curious about our team's name, Team Carbon, feel free to find me afterward—I’m happy to share.
00:31:53.520
Lastly, if you found value in what I've discussed today, we have a few open roles at Times. Significant work is happening across our product and engineering teams, so check out our careers page! Once again, if you'd like to get in touch, these are my handles. I'll be here in the room on that side to answer any questions.