APIs

You have 2 seconds to respond

You have 2 seconds to respond

by Justin Powers

In his talk titled "You Have Two Seconds to Respond" at RailsConf 2022, Justin Powers discusses the critical nature of real-time transaction approval in the trucking industry, specifically regarding fuel card usage at gas stations. He represents A to B, a company focused on improving financial infrastructure for commercial fleets by providing a modern fuel card solution that works universally at gas stations and offers significant savings for truckers.

Key points of the presentation include:
- The Importance of Speed: The core of the talk revolves around the need to respond to Visa's authorization requests for fuel card swipes within two seconds to avoid auto-approvals that could lead to fraud.
- Understanding Constraints: Powers outlines how various constraints, particularly time-related, demand careful consideration in system design, emphasizing the need for correctness to ensure truckers do not get stranded due to declined transactions.
- Innovation within Limits: Innovations such as tracking truck locations via legal devices, unique PINs for drivers, and SMS unlock features are introduced to mitigate fraud risk without compromising the response time.
- Building Confidence in Changes: Powers discusses the significance of isolation in system components, advocating for the single responsibility principle to simplify testing and maintain clear boundaries between different functionalities.
- Asynchronous Processing: He recommends using tools like Sidekiq to offload tasks that may delay the transaction response, allowing for concurrent checks without slowing down approval times.
- Observability and Monitoring: The presentation highlights the need for robust monitoring to track transaction metrics, including the ratio of approvals to denials, thus ensuring that any anomalies are addressed swiftly.
- Shadow Mode Implementation: Powers introduces the concept of shadow mode where new code is tested against production requests in parallel, allowing for comparisons of outputs from both old and new systems.

By the end of the talk, Powers emphasizes the continuous improvement and innovation required to navigate the complexities of real-time systems while maintaining a commitment to service reliability. He invites participants to explore job opportunities at A to B, indicating ongoing growth and the need for talented individuals to join their team.

00:00:00.900 Foreign.
00:00:12.799 Well, uh, it looks like this is a pretty popular talk. We could wait a few minutes for more people to trickle in, but I think we'll go ahead and get started.
00:00:18.840 Um, thank you so much everyone for coming to my talk. I know there's a lot of really great presentations, especially in this time slot, and I really appreciate you all taking the time to come and have this conversation with me. The talk's titled "You Have Two Seconds to Respond."
00:00:30.960 Um, and I know what you're all thinking: "Oh, that's that guy that had that absurdly large mask walking around!" Now you know why—it's because I'm deathly afraid of my beard catching COVID if it has to isolate for two weeks; it's really going to damage our working relationship.
00:00:43.020 I don't know if any of you got to check out the lightning talks; they were downstairs for a good portion of the day yesterday. There was a fellow towards the beginning that had some really great ideas for sales—he was selling ideas on how to give your talk.
00:01:06.119 So I'm speaking on behalf of A to B. It's pronounced A to B, as in we're allowing trucks the ability to go from point A to point B. At A to B, we're building the financial infrastructure for commercial fleets. To get a little more specific, our flagship product is a fuel card for trucking fleets. So that as these trucks are going down the road, if they need to stop and get gas, they swipe our fuel card to purchase their fuel.
00:01:31.320 There are a few reasons why this is needed in the industry. Legacy fuel cards don't work at most gas stations; you have to go to very specific gas stations. They have very limited tooling and are kind of unreliable. Most truckers I've spoken to really despise their legacy fuel cards, and so we're kind of reinventing it on the Visa network so that it'll work at any gas station, not just specific ones.
00:01:50.700 We also don't charge any fees. Not only do we not charge fees, but we also provide discounts, so they're actually saving more money when they use our card. We offer a lot more tooling, and we can get a lot more sophisticated about how our customers can use our fuel cards, allowing them to use it for other expenses such as maintenance, repairs, and so on.
00:02:09.300 We've also launched a payroll product. It's a great way for our truckers, many of whom are underbanked; they don't have access to banking services or have small banks in rural areas, or they don't have banks at all. Our payroll product allows them to get paid directly while they're on the road without dealing with paper checks that may get mailed back to their home. That's a really great product, and we're continuously adding more financial products to our portfolio.
00:02:24.300 A to B has only been around for about a year now, and we already have over 25,000 transportation businesses that are using our products. We've got about 20 to 25% month-over-month growth, and we raised Series A and Series B funding, with over 100 million dollars in venture capital raised so far. Also, we're hiring!
00:02:40.680 Just a brief bit about me: I'm Justin Powers. Depending on where you saw the talk or where it was printed or posted online, you might have seen like three different names, but I'm Justin Powers, and I'm giving the talk today. I've been working remotely in the Southern Sierras for the last decade or so, and I've enjoyed it so much that I created a co-working space so other people could work while they're playing. Feel free to come join me in Kernville; spend all the time you’d like.
00:03:03.000 Let's change gears and talk about the title of the talk. We called it "You Have Two Seconds to Respond." What we're referring to is when the customer is at the gas station, they swipe their card at the fuel pump. Visa gives us two seconds to respond with an approval or a denial, and in some cases, how much we approve it for.
00:03:37.440 You can imagine if they haven't actually pumped the gas yet. We don't want them to pump an unlimited amount depending on how much credit they have available. We can authorize a very specific amount, and this one endpoint that Visa hits when they're looking for that approval or denial is unquestionably the most critical part of our infrastructure. You can imagine that if we don't respond within two seconds, what's going to happen? The way we have it set up is it will automatically approve the transaction, which opens us up to a lot of risk and potential fraud.
00:04:26.820 So we just can't have that. We can't let these transactions auto-approve. If we make some mistake and approve transactions that we shouldn't, then that opens us up to a lot more risk. What's even scarier to me is if we decline transactions that we should be approving! That can lead to a trucker being stuck on the side of the road without a way to get to the next gas station because they can't get any fuel.
00:04:51.900 Um, that one in particular is my personal nightmare. I don't want to be out pushing out changes and working on this infrastructure without being worried about what might happen if I mess something up and some poor person is out in the middle of the night at 3 AM, just trying to get across the country while they can't because our fuel card's not responding.
00:05:03.000 This endpoint cannot be blocked for any reason, and what that means is it creates some very specific constraints around this endpoint in different ways—not just the time constraint. There are other constraints we have to be aware of as well. This talk is about how you build when you have constraints and how that changes the way you build and deliver functionality.
00:05:40.200 Some of the constraints we have to think about are the constraints regarding the inputs and how they can vary, as well as how they might be highly constrained about the outputs of whatever functionality you're developing. Sometimes, you have very specific constraints in that regard. Sometimes you have time constraints. Oftentimes, we're kind of lucky and have more leeway, but in some cases like we have, we have a very specific time constraint.
00:06:06.420 Then there's correctness, which is usually the one we focus on. What are the consequences if something goes wrong? Is that going to directly impact your business? How do you build things if that is a very tight, important constraint?
00:06:22.440 So, talking about the inputs, for this card swipe event, when somebody swipes the card, we receive a request from Visa. There are actually two different kinds of inputs that we're concerned about: Visa gives us some input, saying, "Hey, we have an authorization request that's coming from a merchant with this name." It may come with a merchant category code indicating what kind of merchant it is.
00:06:48.420 As I mentioned before, our cards work for fuel stations, maintenance, repairs, but you can't go and buy something off Amazon with them; the cards are highly restricted based on the merchant category. Sometimes it comes with a dollar amount, and sometimes it doesn't. Visa wants us to actually give them the proper dollar amount for how much we would authorize.
00:07:06.240 The other sources of inputs are the actual state of the customer and their account. Do they have certain merchant categories enabled for them? How much available credit do they have? That part is a bit more complicated than you might expect. You might think it's a straightforward calculation to just look at how much they've spent this month and how much they have left, but our payment terms are on a net-seven basis.
00:07:26.880 At the end of the week, we look at what your balance was, how much you spent during that week, and you have a week to make a payment on that, which changes how much credit you have for the next week. So, it’s a little complicated, and there are a lot of edge cases that can affect what your available credit might be.
00:07:51.419 Now, talking about outputs, in our situation, there aren't as many considerations. Outputs aren't as constrained as inputs. We do want to pass back some metadata to Visa when we're approving these transactions, and if we push incorrect metadata or if it's formatted incorrectly, it'll raise an error and the transaction gets declined.
00:08:03.900 So, we have to be very careful about any change we're making, ensuring we're not changing something accidentally that could cause those transactions to get declined. Of course, with the outputs, you have to worry that they're correct. We've got a large variability in inputs with many different edge cases, and in our customers' accounts with how much they paid and when they paid, along with extra payments they've made to try to raise their credit limit.
00:08:25.080 It's just a lot more than your typical test cases can easily cover. There are many edge cases that we're not aware of until we actually see them in production.
00:08:37.440 As I alluded to before, we have a hard constraint of two seconds. First, I want to address the elephant in the room: in the world of web APIs, two seconds is like forever!
00:08:51.780 Especially if you go to any other conference other than RailsConf, if you tell them that we're trying to keep an API endpoint under two seconds, they'd probably laugh you off the stage. I'm not saying that we're doing anything innovative by consistently being under two seconds, but I do want to talk about how those constraints actually change how you design your system and how you deliver functionality.
00:09:11.220 I mentioned this is the most critical part of our infrastructure, but it's also the part that we want to invest the most time in. We want to make the most changes in and innovate the way that we approve or deny these transactions.
00:09:29.760 For example, we want to add a feature: if your truck's in California but the card is swiped in Tennessee, we can say, "Oh, well that wasn't you; you must have been filling up your personal vehicle or something like that!" We can do this because every truck in the United States has to have a legal tracking device, and we can integrate with those devices to ensure the truck is within two miles of the gas station where the card was swiped.
00:09:54.540 However, there's a lot of complexity and edge cases. For instance, the location of the merchant could be mislocated by Visa, meaning they're actually across the country. There are also other security features where companies can send an SMS to A to B to unlock the card for the next 30 minutes. That's a great protection against fraud.
00:10:15.000 We have features such as allowing each driver to have an individual pin specific to their card. Additionally, we're integrating machine learning and automated fraud detection into our authorization endpoint.
00:10:36.480 As we're adding all these features, we're terrified to make changes because it might leave a driver stranded on the side of the road. How can we change the way we build things so that we're much more confident in that?
00:10:49.800 There are several things we can do to safely make changes with a lot of confidence. One of the things to look out for when building is isolation; we don't want a failure in one part of our system to take out the rest of the system.
00:11:06.660 If one part of our system is being really hard on the database or has a memory leak or something like that, you don't want it to impact the most critical part of your system. There are many strategies for isolation; this is a broad topic that could fill entire talks.
00:11:27.420 Another factor to ensure confidence in your changes is to be very devoted to the single responsibility principle. Build your classes and code in such a way that each piece of code is only doing one thing.
00:11:41.640 One piece of advice I would give is to avoid mixing your business logic with your models or your database layer. You want to be able to have specific inputs and specific outputs that you can test without having to worry about the state of the database in the middle of it all.
00:11:54.480 This simplifies testing in isolation and enhances many of the strategies we'll touch on towards the end of the talk. While I could spend a whole talk on just designing your classes to make them easy to test, I think there have been other talks this week that have covered that pretty well.
00:12:10.560 Also, only do the work that you must at that particular moment. If there's any work that can be done later, use some sort of asynchronous process. There are many tools out there for this. Personally, I like to fawn over Sidekiq; Mike Perham is here, and has been a great part of our community and RailsConf.
00:12:27.840 It's a fantastic library that enables us to offload tasks that would otherwise block that two second transaction. For example, in our system, when you swipe your card, I mentioned you get discounts. Some calculation takes place, reaching out to third-party vendors.
00:12:42.360 We don't want to block that two second transaction, so we kick off a Sidekiq job and say we'll deal with that later.
00:13:02.220 Concurrently, we have a lot of different things happening at the same time, ensuring the transaction's validity. As mentioned, we need to calculate their running balance and how much they have available this week, but we're also checking the location of the truck, checking the location of the vendor, and checking if they've unlocked the card via SMS.
00:13:16.900 We need a mechanism that, as we're adding more checks, doesn’t increase the time to respond to Visa. Therefore, we decided that running these checks in parallel was the best approach.
00:13:34.680 A word of warning: concurrency can make things complicated. To quote the philosopher Aaron Patterson, "Just because you can, doesn’t mean that you should." We should strive to keep our systems simple, easy to understand, and maintain. Multi-threading is fantastic for frameworks, but if you add it to your application code, you're introducing complexity that will be difficult to mitigate.
00:13:57.420 Clara, and I may mispronounce her name, did a great talk about using concurrency in Ruby on Tuesday. If you get a chance, watch that talk on YouTube—it was very good. Clara mentioned some different strategies for using concurrency and some typical bugs you might run into, as well as some risks.
00:14:11.640 There's a long-standing use of `thread.new`, but there are newer features in Ruby like fibers and actors. We're using `thread.new`, but we're exploring if one of these other solutions might be a good fit for our tasks.
00:14:27.600 If you're considering spawning threads in your application code, pay attention to your connection pools. If you’ve calculated how many database connections you need for your Puma workers, and then you start spawning extra threads connecting to the database, those will get out of control.
00:14:43.380 And, be mindful of shared context. You also need to be more purposeful about observability. For example, since we're now doing things in separate threads, it's essential to know not just how long it takes for the endpoint to return, but how long each of these threads is taking.
00:15:03.600 We need to be proactive about dealing with issues. Testing this kind of functionality can be challenging, especially if you don't want to rely on sleep functions to test timeouts.
00:15:20.160 Here's a humorous XKCD about regexes, but I believe it applies to threads too: "I've got 99 problems, so I use threads, and now I have 100 problems!" So be cautious.
00:15:34.260 Here’s a quick sketch of what that endpoint is doing. We have this authorizer runner that gets the request, and we spawn `thread.new` for each different kind of sub-authorizer that we have.
00:15:48.600 In this example, there's a credit authorizer, one checks the truck's location, and another checks the SMS unlock status. Anyone who's worked with Ruby knows about `thread.join`, which waits for the thread to finish and gets the result.
00:16:07.020 I recently realized you can actually pass a timeout to that method, which specifies the number of seconds. In this example, we pass 1.5 seconds, allowing threads to finish or continue processing without killing them.
00:16:26.520 For example, let's say the credit authorizer took about a second; it said yes, we want to approve it based on their available credit for $500 worth of fuel. However, the location authorizer looked and indicated that the truck was nowhere near the gas station, which means we deny it.
00:16:57.840 If something happens with the SMS unlock and it takes too long—over 1.5 seconds—we don't care about the result because we use what we've got to make the best decision possible.
00:17:09.900 We’re going to deny it based on the truck not being in the appropriate location. The code that accomplishes this is a bit intricate and challenging to detail in a short talk, but we will provide some code at the end for another feature we're working on.
00:17:25.680 As we focus on working with these constraints, how do you gain the observability you need to know if that business-critical part of your system is functioning well?
00:17:41.820 The biggest factor will be observability. You must be sure that this critical part of your functionality is working correctly; you need to know how it’s performing. For that, monitoring is key.
00:18:03.060 We use a tool that's not sponsoring this conference, so I will only mention the sponsors. New Relic is in the room; they're a fantastic solution for monitoring. Scout is also sponsoring this conference, along with Honeybadger.
00:18:23.880 You can use all of these tools to gain better insights into those critical parts of your application. Great talks detail how to implement these systems effectively, but just keep in mind that it’s crucial to know not just how long something takes.
00:18:39.900 You also want to monitor things like the ratio of denials to approvals. If we suddenly start approving 100% of our transactions, something's probably wrong, and we should page the on-call team and roll back whatever code was just released.
00:18:56.640 If it goes the other way, and we're denying a large portion of approvals, we need to page someone and roll it back. Observing approval amounts and applying some anomaly detection can also be vital for catching issues.
00:19:11.640 With higher-end monitoring tools like New Relic, you can get anomaly detection features that keep an eye on potential problems. You can instrument specific parts of your code to see how long they're taking to respond.
00:19:27.480 Create easy-to-use, easy-to-understand dashboards that allow your support teams to see if something is wrong within seconds. Designing dashboards is something we often overlook, but it's critical.
00:19:42.840 If something goes wrong, ensure you have proactive alerts. You want someone paged when a critical part of your business is malfunctioning.
00:20:03.600 Another strategy we're using—this one is quite fun, and we'll spend the rest of our time on it—is shadow mode. Monitoring isn't enough; we don't want to push code and find out that it isn't working—resulting in drivers potentially stranded on the side of the road, which is not acceptable.
00:20:18.840 We want to continuously introduce new functionalities. One common approach is called shadow mode, which can be implemented in various ways. We've chosen a method where every time we receive a request in production, we handle it using the existing code and then push the result back to Visa.
00:20:32.520 Simultaneously, we kick off a Sidekiq job and run it against the existing code and the new code we want to roll out, allowing us to compare the outputs.
00:20:48.120 If we're making refactoring changes, those outputs should remain identical. If they change, we'll analyze why and what is different, maintaining the observability required.
00:21:05.760 I've mentioned the single responsibility principle; while we're not perfect at it, we continue improving and refactoring our functionality. Tracing functionality has been useful, as it lets you see the logic state while stepping through the code and reviewing the data.
00:21:22.380 Another popular strategy is to record all your API traffic and replay it against a non-production environment to see if everything operates correctly. However, we've found our approach works best.
00:21:42.960 Before we step through the code I've written, I want to mention two tools we can't do without: Sidekiq and Flipper. If you're not using the Enterprise version of Sidekiq, talk to your employers to support this open-source project.
00:22:01.620 Flipper, a feature flag framework with Enterprise options, is also invaluable. If any maintainers of these projects are here today, I would like to buy them a beverage of their choice, as they’ve saved me a lot!
00:22:19.080 I only have a few minutes left, so I will quickly run through this. Our real-time authorizer kicks off a Sidekiq job. In that job, we'll run the same code as before but additionally instrument it.
00:22:31.560 We’ll also capture the outputs and add tracing for the values within the code. Each of these records are saved to the database using Active Record.
00:22:41.640 We are capturing inputs and outputs, and for each run, we save whether it's a control or variant and analyze the differences in outputs, allowing for future comparison.
00:22:56.460 The first thing we do is define a list of hashes that represent our interests. As I step through the code, when we reach the credit calculator, for instance, we want to know what variables were at that point.
00:23:09.780 We're interested in that for both control and variant runs.
00:23:15.900 We run the same block of code twice, once with the feature flag turned on and once with it off.
00:23:21.960 This piece of code can be complex, as the tracer we created has a method called capture. We pass that a block and run the code corresponding to the credit authorizer.
00:23:31.980 We check whether the feature is enabled while also running the given request from Visa. Afterward, we’ll perform cleanup and save the information to Active Record.
00:23:43.560 This includes whether it’s a variant, the associated inputs, outputs, and what we captured, which will then enable us to calculate the differences.
00:23:56.640 I realize I may have deleted some code details, as some functionality will perform a bit differently than expected. That's one of the tools we've used to confidently push changes and verify their functionality.
00:24:10.260 I’m out of time, but I want to let you know we're hiring. Please come see us down at the Expo Center as we’re looking for candidates for various roles.
00:24:21.480 If you're attending this conference, you're the right audience for us—we want to hire you! Thank you for your time, and while I can't take questions here, I'll be available on the side if anyone has any questions.