From Request To Response And Everything In Between

00:00:11.200 Thank you for being here. Your next presenter is a senior software technical lead for Home Chef and hosts the Day as a Dev podcast where he interviews developers from all walks of life. This offers students and those interested in tech a practical look at life in the industry, so please welcome Kevin Lesht.

00:00:18.199 Thank you, and thank you all for coming! For those of you who don't know my name is Kevin Lesht and my talk today is titled "From Request to Response and Everything in Between." Well, most everything in between. Okay, okay, some stuff in between. It turns out there's a lot.

00:00:31.880 Today we're mostly going to focus on how requests move through our app environment to receive a response and how we can configure our app environment to handle many requests coming in all at once. But first, a little about me. I work at Home Chef where I'm a member of our engineering team. I've been with Home Chef for eight years, so it's a pretty great place to work. It's also given me the experience to see a whole lot of different traffic patterns as our business has grown over the years.

00:01:05.320 I followed the work that our platform team was doing to manage all this traffic, but it really wasn't until producing this talk that I felt like I had a solid grasp on everything going on. What I've bundled up here is what I wish I had known years ago. Two other quick notes before we get started: first, in 2022, I gave a talk titled "SK Location" which explained IP address networking fundamentals, database operations, and query performance tuning. The deck for that talk came in at 187 slides and I didn't think there was any chance that I would ever go beyond that, but here we are today with 359 slides for this talk. So buckle up and try not to blink!

00:01:59.280 Second note: credit where credit is due. This talk was built from a ton of research from resources provided by the community, particularly Nate B's work on and around Puma. All of that research will be linked at the end, and I definitely encourage you to check it out if you're interested in learning more about any of the topics we cover. Now, onto the show.

00:02:51.440 So, I woke up this morning and, as I do every day, made myself an omelet. I took a picture and posted that picture to my omelet blog, FiveMinutes.com. The plating was not so great on this one, but we'll talk more about that later. Now, Five Minutes is a Rails app—why not? To produce that web page, what we're all probably familiar with is a request coming into our app, the router pointing that request to the controller, the controller running any logic, and then a response, such as a view render, being returned.

00:03:26.440 But what's really happening here? How does that request even make it through our app environment to receive a response? And what happens when we have a lot of them coming in all at once? To begin, we're going to start by looking at a Rails environment configured to only handle one request at a time. This means that when a request is sent to a DNS record (like five minutes.com) and directed to the backing IP address, our hosting provider's load balancer will accept that request and direct it to a server, where it’s received by the server socket.

00:04:38.000 Our web application server will then send that request to our Rails process, where it is serviced by a thread and returns the response. Only after we return a response will our process be available to accept another request for work. If other requests come in while we're in the process of serving a request, those requests are held on the server socket in a queue. If you’re like me, you’ve probably experienced this without ever really appreciating it. Going back to the omelet blog, I was really trying to make this thing go viral by the way—tell your friends! My plating abilities aren’t the only issue going on with the site, we do have a few bugs.

00:06:22.680 We decided to give one of those a look by throwing a debugger like binding.pry into our favorite problematic part of the app. Then, we send a web request to catch it; but just then, we get a little distracted and want to visit some other page on our site, so we open up a new tab and try to go to localhost:3000, but nothing happens. If you're luckier than me, you’ll quickly remember that you have a pry in place and keep holding that earlier request open until you release that pry and allow that request to return a response. All other incoming requests, like the later visit, are held at the socket and enter into waiting mode until our process is available to handle new work.

00:07:29.800 Once we finally let go of that pry, our process can finish servicing that first request to return a response and signal that it's ready for new work, at which point our socket will provide our process with the next request. So, although our pry example here of a request being left open for a long amount of time and blocking other requests from being accepted until a response can be returned is something that we really only experience in our development environment, it's a good illustration of what might happen in a system where only one request can be processed at a time.

00:09:13.679 Now, I do want to call out that requests are a good thing. The alternative would be that if our process wasn't in a position to accept a new request—without a queue—any other incoming request would just be rejected. Our queue helps us hold on to any incoming requests while our process is busy, until it's in a position to accept new work. To frame this as a real-world example, let’s imagine our Rails conference scenario. In our example, we’ll have one of our fantastic volunteers working at a station for moving attendees into the conference while there’s a big long line of attendees.

00:10:30.760 The volunteer accepts an attendee who presents their ticket for scanning. The volunteer then grabs the attendee's name badge and t-shirt, hands them off, and sends the attendee on their way. Once that attendee has moved through the station, the volunteer can now accept another attendee, where the same process would then repeat. The line of attendees can be thought of as requests waiting on our server. Our station can be thought of as a Rails process, and our volunteer can be thought of as a thread within that process.

00:11:47.679 Now, our volunteers at RailsConf are amazing! They’re able to get our attendees into the conference super quickly, and just like that, a performant Rails app will be able to process a request super quickly too. However, RailsConf is a pretty popular conference, and I’m sure all of our Rails apps are popular as well. So, even with the best volunteers and the best Rails app, if we're limited to only handling one attendee (or request) at a time, and the traffic we receive outpaces the time it takes to process that traffic, our queue of requests is just going to grow longer and longer.

00:13:05.520 We really want to be able to handle more than one request at a time, but we set our environment up as a system only able to service one request at a time. So how do we then work on multiple requests all at once? Let’s think about how we’d approach this in our Rails example. Right now we’ve only got one station and one lonely volunteer responsible for managing everybody trying to get into RailsConf, and they’re great, but they need some help. So we call a friend, and now we’ve got two volunteers, each with their own station.

00:14:14.680 We still have our big long line of attendees trying to get into the conference, but now any open station can accept an attendee, where a volunteer can scan their ticket, get their name badge, t-shirt, and send the attendees they’re working with into the conference. Then, again, they can accept a new attendee to repeat the process. By adding more stations, we’ve expanded our ability to take attendees off of the line, and we’ve increased the throughput of attendees able to enter the conference. Just like that, we’ve doubled our capacity for work.

00:15:29.160 So, if we were looking to apply this to our Rails environment, we simply spin up more processes. Yes, it’s time to talk about web application servers, which for our conversation is going to be Puma. There are other web application servers out there that you can use to manage your Rails infrastructure, like Unicorn, Passenger, and others. For the most part, what we’ll talk about today would apply when using those as well.

00:16:50.720 Puma is a web server that handles requests and offers a system for managing the processes that will service those requests. I say processes here because with Puma, it doesn’t have to be just one process per server. Puma is capable of running multiple processes per server. There’s a general rule that you can run a Rails process for each CPU provided by your server. Now, each one of these processes will still only be working on one request at a time, but at least now we can handle a request for each process that we have running, all at the same time.

00:18:08.680 As requests arrive at our server, Puma will direct those requests to any process available for handling. Should requests come in while all of our processes are busy, they’ll be held on the server socket in the request queue until any one of our processes becomes available to accept new requests for work. We now have a system capable of handling multiple requests all at the same time. But how do we know how many requests we might need to handle at the same time?

00:19:38.240 To answer that question, let’s re-center around our number of processes—whatever that number might be. Knowing that each process is running a single thread, and each thread can handle one request at a time, our process count offers us a total number of requests that can be worked on at any given point in time. To service requests faster than the pace at which we receive them, we need to ensure that we don't back up. There are two parameters to look at for reaching an equation that provides the optimal number of processes. Those parameters are: the number of requests that we’re receiving per second (our request arrival rate) and the average total time it takes to process one of those requests.

00:21:05.640 Where the total time is represented by the amount of time a request spends in the queue in addition to the amount of service time. To visualize this, let’s go back to our example of getting into RailsConf. Say we’ve got five attendees arriving per second, and a volunteer at one of our stations can process an attendee in 75 milliseconds, which is totally reasonable. At this pace, the first attendee wouldn't spend any time in line; they walk right into a station where our volunteers scan their tickets, grab their name badge and t-shirt, and move them through the door in 75 milliseconds.

00:22:25.500 The next attendee has been waiting in line for 75 milliseconds while that first attendee was being serviced. Now that our station is open for work, our second attendee moves through the door in another 75 milliseconds. The third attendee has now been waiting in line for 150 milliseconds and makes it through the door in another 75 milliseconds. The fourth attendee has now been waiting in line for 225 milliseconds and makes it through the door in another 75 milliseconds.

00:24:05.040 Finally, our fifth attendee has now been waiting in line for 300 milliseconds and makes it to the door in 75 milliseconds. As we carry on here, working to get our attendees into the conference, at each second, another five attendees show up and need to be handled. To make sure we’re able to keep up with our arrival rate of attendees and that our queue doesn’t grow to a critical state, we need to ensure that the time it takes to clear our arrival rate is less than our arrival rate. In our example, if five attendees are showing up every second, then we need to make sure that we’re moving those five attendees through the doors in at most one second of time. Any longer than that, and our volunteers wouldn’t be able to keep up.

00:25:49.200 Now, if we were to frame this around web requests, here we have a breakdown of five requests, the time they're spending in the queue, and after being served, the total time it took from the moment the request arrived at our server to the moment that request received a response. If we were to sum up all the total response times and then divide by the number of requests that occurred in our one-second window, we see that, in this case, our average response time is 225 milliseconds, which isn’t terrible for an average response time. However, our total time taken to process those five requests comes in at 1,125 milliseconds—2.25 seconds over one second, and so with a request arrival rate of five requests per second, when the next set of five requests comes, we are already running behind schedule.

00:27:30.080 That first request from our first set of traffic was accepted right away and experienced a wait time of 0 milliseconds. Now in this next set of traffic, that first request will need to wait 125 milliseconds before it begins being processed. Summing up our total response times and dividing that number by the total requests, we can see that our average response time is now just 350 milliseconds. If we again take the sum of total response times, we’re now at 1,750 milliseconds for five requests, or 750 milliseconds over that one-second mark.

00:29:05.920 As we compound this issue, in the next second, another set of requests arrives and now the first request from this new set is waiting 750 milliseconds in the queue before it can be serviced. Once again, summing total response times and dividing by the number of requests, now our average response time lands at 975 milliseconds, with our total response time landing at about 5 seconds. We are now about four seconds over that one-second limit.

00:30:25.000 So, just a little over 125 milliseconds has quickly run away from us! Luckily, we can use our parameters of request arrival rate and average total response time to derive a number of processes that will put us in a place where we can keep up with the traffic. Here’s that equation: request arrival rate multiplied by average total response time equals the number of processes needed. Plugging our numbers from that first set of requests with a request arrival rate of five requests per second and an average response time of 225 milliseconds, we would want to have a little over one process available. We would want to round up here because otherwise we’d fall behind.

00:31:40.360 To manage our traffic, we want to bring in a second process to handle the load. Now, with two processes available, two requests can be accepted right away with no wait time in the queue, and each one can be served in 75 milliseconds. Then, the next two requests can be picked up and so on. Again, taking the sum of our total response times and dividing by the number of requests, we land at an average response time of 135 milliseconds and a total time to process our set of traffic at 675 milliseconds, putting us well on track at a faster pace than our arrival rate. Our queue doesn't back up.

00:32:54.360 There’s one other component of our equation to call out, and that is that in real-world scenarios we're going to deal with some variability—maybe a lot of variability. When looking at performance monitoring services, a request arrival rate is usually presented as requests per minute, and we won’t really have eyes on any spikes or downturns in traffic within that one-minute timeframe. Translating that requests per minute to requests per second can be a little fuzzy, and that number of requests per minute might be changing minute to minute.

00:34:24.400 On the serving side of things, we might have fast endpoints and slow endpoints, with some in the middle. The things happening inside our application might be subject to change. For example, if our traffic were to start behaving a little differently—like moving from primarily hitting our fast endpoints to our slow ones—or if our software started behaving differently—like if a third-party API request started to return more slowly—our service time will also be subject to change. To navigate around this variability, we had a buffer into our equation that offers us some space to handle the uncertainties that come from facing the real world.

00:35:57.080 Nate B, in his research, found that we could effectively handle requests when we're operating at about 75% processing capacity. So the multiplier we’ll introduce into our equation will be 1.333. Running our formula one more time with our multiplier taken into account, we find that with a request arrival rate of five requests per second and an average response time of 225 milliseconds, we have a multiplication of the request arrival rate by the average response time, multiplied by the buffer factor of 1.333. This yields approximately 6.67 processes that we should have available, which means that we should round up to 7 processes to manage our traffic efficiently.

00:37:30.680 What's great about this formula is that by calculating the number of processes needed for constantly handling our traffic, we can provide our site visitors the best response time possible. By knowing how to handle traffic as we see the traffic parameters change, we can scale our number of processes up and down to best accommodate our visitors while also managing our resources. Should we get hit by a large influx of traffic or see response times slow down, we can scale up our number of processes; and when that traffic dissipates or our response times decrease, we can scale back down to make sure we don't overcommit resources.

00:39:00.080 Now we have a good handle on how a process running a single thread operates and an understanding of how many of those processes we might need online to handle various traffic patterns. Let’s talk about another scaling option: multi-threaded processes.

00:40:05.600 So, if we set up a process running a single thread, it is a system that is only capable of working on one request at a time. A request sent to our site hits our server socket. Our process accepts that request for work; a thread services that request, and until a response can be returned to that request, any other incoming request is held in the socket in a queue. In a multi-threaded system, we can configure our process to accept and service many requests all at the same time.

00:41:43.080 In this design, as requests arrive at our server, each thread that’s available for work can accept a request off the queue and service it. Just the same, as in our single-threaded process, if all of our threads are busy, any other requests at our server will be held until responses are returned and threads become available for work. So, if we want to increase our capacity for handling requests, we can just add more threads, right? We can, but there are a couple of important things to clarify here.

00:43:15.200 The first is that in order to run a multi-threaded process, your application code and all the libraries you might be using must be thread-safe. We don’t have time to get deep into what makes code thread-safe, but I do have some resources included at the end that you can refer to to determine if that is something applicable to your own application.

00:44:50.480 What we'll focus on here is the other important thing to note. Rails, being built on Ruby, comes with what’s called the Global VM Lock, a mechanism that keeps our Ruby code running. The way it does that is by only allowing one thread within a process to execute Ruby code at a time. So if we have a process running multiple threads, and we've accepted multiple requests for work, how are those threads supposed to service our requests simultaneously?

00:46:35.720 Well, in a typical Rails application, and in particular a Rails application where you might benefit from running multiple threads, there are IO operations happening within our request that produce responses—things we might call IO operations like database queries or waiting on network data from a request to a third-party API. To visualize this, let’s go back to our RailsConf example one more time.

00:48:05.760 In our example demonstrating a single process, when an attendee gets accepted for servicing, our volunteer scans their ticket, grabs their name badge, t-shirt, and sends the attendee into the conference. At which point, our volunteer can accept the next attendee from the queue to be served. Now, let’s say we have three volunteers all working the same station. Each one can accept an attendee for servicing at the same time. Here’s where our station has only one ticket scanner; while our volunteers are working, our ticket scanning machine needs to be passed between each volunteer to scan each attendee's ticket. However, while an attendee is waiting for their ticket to be scanned, the volunteer can run off to get their name badge.

00:49:50.720 So even though our one ticket scanning machine presents a limitation, each volunteer still manages to accomplish other tasks while waiting on the scanning machine. Framing this against our Rails environment, the GVL can be thought of as our single ticket scanning machine in a process running multiple threads. When a thread needs to run some Ruby code, it requests the GVL, which then applies to that thread, and meanwhile, other threads servicing requests within that process can be doing other non-Ruby things.

00:51:59.520 I’ve chosen three as the number of threads for our example because if you’re running a multi-threaded Puma process, that’s the default value that the Rails community has landed on as a good number of threads. It took a long time to determine a good number, but to better understand how many threads we might want to run within our process, let’s look back at a traffic pattern being handled by one process running a single thread.

00:53:53.680 When a set of requests is served by a single process, single thread, each request is processed one after the other, and every request must wait for the preceding request to finish before it can start processing. In this example here, our total time to process those five requests lands at just over one second. But looking at that same traffic pattern as handled by one process running three threads, our process can now accept three requests from the socket right out of the gate to service them in parallel.

00:55:51.480 In this example, our total time to process those five requests lands at 560 milliseconds, and this is excellent. Our multi-threaded process is now handling requests at about three times the speed as our single-threaded process. What you might have already noticed, though, is that in our multi-threaded example, there is increased service time as well. That’s to simulate the introduction of wait time due to queueing for the GBL.

00:57:36.180 Remember, in a Rails process, the GVL limits us to only running Ruby code one thread at a time. So at some point, each thread will have to request the GVL to execute Ruby code, and while it waits, it's unavailable. It's an essential point to highlight because increased queuing can cause your response time to skyrocket. To optimize your application for an optimal number of threads, you need to perform some benchmarking.

00:59:30.360 If you do find yourself in a spot running multiple threads—as long as your response time is acceptable—the questions to ask yourself would be, do I want response times to be as quick as they can be but risk needing to run more servers online, using more processes running with fewer threads? Or do I want to be able to accept more requests with fewer processes, which means I don’t have to bring as many server resources online, but that might lead to longer response times. This is something you'll need to answer for your application and business value, but hopefully, the material here provides a starting point in understanding this dilemma.

01:01:42.360 Now, pivoting just a bit here, there’s one other element of our system where we need to manage requests: the database. Earlier, we established that for each process we're running, we have the capability to bring online any number of threads within that process, where each thread can perform work on a request. As part of working on a request, a thread might need to query our database, and at any given moment, it might not just be one thread calling for data but potentially many in parallel.

01:03:50.640 To illustrate how this works, let’s talk about Active Record's connection pool. The connection pool is a service that lives within our Rails process, and its job is to synchronize thread access to a limited number of database connections. Whenever a thread needs to retrieve data, it will call the connection pool to check out a database connection, use that connection to run a query, and then return it back to the pool when the result for the query has been retrieved.

01:05:34.600 Now, a connection can only be used by one thread at a time, so if we have threads asking for connections that are unavailable, the connection pool will queue those connection requests until a connection does become available. If a connection request waits too long, it will error out as the limit expires. Therefore, to ensure that we don’t introduce an additional point of contention within our system that could lead to increased response times and even request timeouts, we want to provide our Rails process with a database connection for each thread that's running.

01:07:15.680 Or, set another way, when setting the database connection pool for a process, we want our number of available database connections for that process to be equal to the maximum number of threads running. This will ensure that every thread in a Puma process can connect to the database in parallel and without any time spent waiting.

01:08:41.080 There is one important note here, and it’s something to keep in mind while scaling: your database technology is most likely going to come with a connection limit. This is something you need to consider whenever determining how many processes and threads your system might be able to run. Consider a database where the hosting provider of that database has enabled 20 connections for you.

01:09:51.560 In a system like we’ve set up for this talk, where we might have four processes running within a server and three threads executing within each of those processes, we would want to make 12 connections available total across our server. If our traffic pattern reaches a point where we need to bring another server online—making 12 more connections available—we’d then have at least 24 connections total. Yet we only have access to 20. So what happens? Remember, connections in a pool are not held open at all times. Connections are checked out by threads only when needed.

01:11:09.080 Even if we were maxing out every single connection available to us, we can hit the point where requests would error out on no connection being available. So the connection limit is an important consideration when determining how far we are able to scale by the number of processes and threads. Additionally, apart from the database connections we may need to support our web servers, we also want to factor in an appropriate amount of connections for servers running background jobs, and we'll want to factor in some extra space for one-off processes like cron jobs and other tasks.

01:13:23.680 Thus, if it appears that you may be approaching the maximum number of connections supported by your database, it’s time to work with your database hosting provider and upgrade to an offering with a higher number of connections. Otherwise, should we receive a large traffic pattern, we wouldn't be able to manage that traffic if we’re limited by available connections versus the number of workable threads.

01:15:25.920 Alright, we’ve come a long way! Let’s put it all together. Starting from the top, when a connection is sent to our site, like five minutes.com, our hosting provider's load balancer will accept that request and direct it to a listener socket on our server. Our web application server will then direct that request to our Rails process, where it can service and return a response.

01:16:11.880 Only after we return a response can our process accept another request for work. Should other requests arrive while our request is being processed, those requests are held in a request queue on our server socket until our process is available to accept new work.

01:16:57.600 If we keep pace with our traffic, we can multiply our average response time by our request arrival rate while factoring in some buffer to arrive at a result for how many processes we can bring online that will ensure we process requests quickly enough to keep our queue from backing up.

01:18:32.320 We can bring processes online up to the number of CPU cores available on our server. If we find that we need more processes online than our server can support, we can bring more servers online, all while running multiple threads within each process.

01:19:24.640 If we’re running multiple threads within a process, we need to remember that Ruby's GVL limits us to only running Ruby code on one of those threads at a time. However, other non-Ruby tasks can happen across all threads simultaneously, with one of those being queries to the database.

01:20:26.720 For each thread we’re running, we want to ensure that a database connection is available, and each thread checks out a connection to run queries. To prevent database connections from timing out, we want to make sure that we actually have enough connections available.

01:21:31.600 That’s the full picture! A totally easy diagram to digest that illustrates everything that happens from request to response. There is still so much more to cover out there, and that's where you come in.

01:21:58.880 One closing note, on some of the other topics that we weren’t able to cover in this talk—each one of these things would probably be a talk in itself, so give them a look and bring what you’ve learned back to the community. With that, I’ll see you all at the next RailsConf. Thank you!