Debugging Your Services with Distributed Tracing

Ruby

Stella Cotton

7 talks

#distributed-tracing

#microservices

#application-performance-monitoring

#heroku

#ruby

#instrumentation

#performance-optimization

Debugging Your Services with Distributed Tracing

by Stella Cotton

In the presentation titled 'Debugging Your Services with Distributed Tracing' at RubyConf Australia 2018, Stella Cotton discusses the importance and implementation of distributed tracing within software systems, specifically within the context of Heroku. Distributed tracing is introduced as a method to trace requests across distributed systems, particularly when dealing with microservices. Stella illustrates key concepts through relatable examples, notably an e-commerce application that transitions from a monolithic architecture to a microservices architecture.

Key points discussed include:
- What is Distributed Tracing?: It allows teams to visualize and track the flow of requests between services, making it easier to identify bottlenecks in performance.
- Examples of Distributed Systems: The presentation details a scenario in an e-commerce site where requests might take longer than expected due to slow responses from a recommendation service, emphasizing the complexities added by multiple services.
- Types of Tracing: Stella describes 'black box' and 'white box' tracing. While black box tracing is useful, it often does not provide sufficient detail about system operations, particularly with event-driven systems. She advocates for white box tracing, which logs and tracks information about service requests and their relationships.
- Key Components: She explains various components of a tracing system: the tracer (instrumentation in the application), transport (how data is sent), and UI (how results are visualized).
- Implementation Considerations: Important factors for implementing distributed tracing include ecosystem compatibility, infrastructure requirements, and security concerns regarding sensitive data.
- Suggested Next Steps: Stella encourages teams to start small by experimenting with tools like Zipkin, providing practical advice on how to integrate tracing into existing applications.

The presentation emphasizes that as systems grow, proper tracing becomes essential for operational success, enabling better decision-making and debugging processes. Stella reinforces the idea that distributed tracing can democratize knowledge of system performance across teams, ultimately leading to more efficient and understandable services.

00:00:05.460 This is actually my third year at RubyConf Australia, and it is my favorite conference in the whole wide world. Such a treat!

00:00:11.219 So, like Mel said, I'm an engineer and today I'm going to talk a little bit about distributed tracing. I know it's basically the end of the conference and everybody is super tired and really fried.

00:00:20.580 I use a really cheap trick to keep people awake but also to encourage myself to drink water, so it's like the worst thing ever but I love it.

00:00:27.630 When I take a sip of water, I would like everyone to clap and cheer for me like I'm doing something awesome, so we're going to try it out!

00:00:34.230 Although, like Amy mentioned earlier, we ran out of Heroku swag, but if you would like us to mail you some, we will actually mail it to you if you go to this URL.

00:00:41.070 So, let's talk a little bit about distributed tracing. This is my team. I work on the tools team, and we spent some time implementing distributed tracing at Heroku for our internal services.

00:00:48.120 I will say that a lot of the trial, effort, and discovery that went into this talk was actually a team effort from these awesome folks.

00:00:53.670 Let's start with the basics because I feel like a lot of engineers don't know what distributed tracing is. When I mention it, I get a lot of blank looks, which is great because you are in the right place.

00:01:06.030 I just want to get a little raised hands: who actually knows what distributed tracing is here? This is great! Awesome! Okay, a couple of people.

00:01:12.070 For those who do, how many have it at their company? Amy? All right, we've got a couple.

00:01:20.170 Distributed tracing gives us the ability to trace requests across a distributed system boundary.

00:01:26.320 I'm going to give a simple use case: let's say you have an e-commerce site and you want to show users all of the recent orders they've placed.

00:01:32.050 In a monolithic architecture, you have one web process that's going to return information in that web request.

00:01:37.930 You will have users who place orders, and orders have items. It's pretty standard Rails stuff.

00:01:43.570 So, we're going to finish fetching a user's orders and render them on a page with a single app and single process.

00:01:50.920 But because software is always changing, you suddenly get a mobile app, and now you need authentication on two different sites.

00:02:02.229 So, let's break that out into a service. This could be running in separate processes on the same server, a different server, or a totally different server.

00:02:09.310 Now you have two applications, and let's say things change again.

00:02:14.350 Now you want to show users recommendations for things to buy based on their purchases.

00:02:20.830 So, the team in charge of making recommendations includes data scientists and people who do machine learning.

00:02:26.819 As your engineering teams grow and your products grow, you don’t have to be unlike a microservices bandwagon to find yourself in this situation.

00:02:33.900 A service might be written in a different language or have infrastructure requirements than another.

00:02:39.599 As our web apps grow larger, they start to look less like a garden and more like a collection of different plants in different pots.

00:02:47.640 So, you might find yourself needing to change your tooling.

00:02:53.940 Where does distributed tracing fit into this bigger picture?

00:03:04.599 Let's say one day your e-commerce page starts loading really, really slowly.

00:03:10.890 From the perspective of you as the e-commerce team, you can look in your application performance monitoring software, like New Relic.

00:03:17.579 You can profile your app with your profiling tool, and you can see this recommendation service is taking a really long time to return recommendations.

00:03:23.220 So, with this single process monitoring tool, all the other services in your system look just like third-party API calls.

00:03:30.570 You get just as much information about calls to Stripe to charge users' credit cards as you do for a service running in-house by a different team.

00:03:36.990 From a user's perspective, it takes an extra 500 milliseconds to get recommendations, but you don't know why.

00:03:43.919 Without reaching out to the team managing the recommendation service, you have to get access to their profiling tools.

00:03:49.410 You also have to get time on their roadmap to start digging into your issue, and it becomes increasingly complicated.

00:03:55.320 In the end, you can't tell a coherent macro story about your application just by monitoring individual processes.

00:04:02.300 If you have ever done performance work, you will find that people are terrible guessers. It's always in the last place you look.

00:04:11.700 You may be very convinced about one thing, but a lot of times you have no idea where the bottleneck is.

00:04:18.690 So what can we do to increase visibility into these systems and have a macro-level story?

00:04:25.779 That’s where distributed tracing can help you out. It’s a way of commoditizing knowledge.

00:04:30.879 Adrienne Cole said this. He’s one of the systems maintainers, and he talks about how in these increasingly complex systems,

00:04:36.989 you want to give everyone involved the tools to understand the system as a whole without having to have experts in each individual silo do the investigation.

00:04:43.989 This is awesome! Maybe you’re like, cool, sell! I’m on board. Let’s do this.

00:04:50.289 You’ve got your services running in different processes, so why is distributed tracing not just easy?

00:04:57.089 There are a couple of different things. One big thing for me is that it is not a Ruby-specific thing.

00:05:05.770 The Ruby community doesn't talk a lot about tracing, and the people working in this ecosystem are largely in Java, Go, and Python.

00:05:12.309 It's not that there isn't Ruby tracing; it's that a lot of the examples you see online are written in other languages.

00:05:19.719 So that could be a bit of a hurdle. There’s also a lot of domain-specific vocabulary.

00:05:26.959 Reading through the docs can feel really slow, even if you're a very senior engineer.

00:05:32.899 Finally, the most frustrating thing for me was that the ecosystem is very fractured.

00:05:39.949 It’s still changing pretty rapidly, and it's about tracing everything everywhere.

00:05:47.079 So a distributed tracing solution has to support tons of different languages and different kinds of frameworks.

00:05:54.849 Navigating all those solutions to find out which one works for your company is not trivial.

00:06:00.669 How can you get past these hurdles? We're going to talk a little bit about the theory of how distributed tracing works, about the fundamentals.

00:06:07.089 Then we'll have a little short checklist for how you can maybe take this from theory to practice.

00:06:12.309 The basics: I'm going to start by talking about black box tracing.

00:06:19.879 The idea of a black box is that you can change anything you don't know about your system itself.

00:06:26.449 If you capture and log incoming traffic and outgoing traffic at a lower level than your application, like at the TCP level,

00:06:35.289 And then all that data from your internal services goes into a single log, you can guess the behavior based on lots of processing.

00:06:43.969 However, I'm not actually going to talk about black box tracing.

00:06:49.329 For us at Heroku and for many other companies, it turns out it’s not a great fit, and we'll talk about why that might be.

00:06:55.000 One reason is that you need a lot of data because it relies on statistical inference.

00:07:02.210 Also, because it uses statistical analysis, there could be a delay in returning results.

00:07:10.810 The biggest problem with black box tracing is that in an event-driven system, like if you use something like Sidekiq for async work,

00:07:18.130 or if you have a multi-threading system, you can't guarantee causality.

00:07:24.900 What does that mean from a coding perspective? There are serious papers on this topic.

00:07:31.640 To illustrate, let's look at a simple example. You've got Service 1 kicking off an asynchronous job.

00:07:36.710 Then immediately, you're going to make a synchronous request to Service 2.

00:07:43.450 If there’s no queue delay in your job, the timestamps will look like Service 1 calls the async job first, and then Service 2.

00:07:50.680 But if you start to have a delay, like simulating this with some sleep time, your timestamps will actually make it look like the second service made that call instead of the first service.

00:08:01.149 We can solve this problem using white box tracing.

00:08:07.430 White box tracing assumes that not only do you understand the system you're using but that you’re actually able to make changes to the systems themselves.

00:08:14.289 You can be sure about the path a request takes through your full system using something called metadata propagation.

00:08:21.990 In this type of tracing, you explicitly pass along information about where that request is and where it’s going.

00:08:28.000 This approach allows for real-time tracing without statistical inference being run.

00:08:36.440 Let’s take a little history lesson. The most well-known example for anyone starting to read about tracing is Dapper.

00:08:42.930 The open-source library that it inspired, Zipkin, came from this.

00:08:48.990 The Dapper paper was published by Google in 2010, and it wasn’t the first distributed system debugging tool built,

00:08:55.200 but it was super interesting because it ran at Google scale for years before publishing the paper.

00:09:01.210 It showed that this kind of tracing was not only viable but also super valuable.

00:09:08.230 At Google, they found it integral to their process, and thus, Zipkin came from that.

00:09:15.560 It started as a happy project during a hack week at Twitter, and it was open-sourced in 2012.

00:09:23.700 The project is currently led and maintained by Adrian Cole.

00:09:30.130 So when I mention distributed tracing, I’ll be using that term specifically to refer to Dapper and Zipkin-like systems.

00:09:37.130 White box metadata propagation tracing linked systems is too hard to say and I would screw it up every time.

00:09:44.230 So let's see how we actually do this. I’m going to walk us through the major components that power most distributed tracing systems.

00:09:52.390 First of all, we have the tracer, which is the instrumentation you actually install in your application.

00:09:58.600 You can use Bundler with a Ruby gem, then there's a transport component.

00:10:05.210 That takes the data collected in your application and transports it to a collector.

00:10:11.380 Finally, there’s a UI component that allows you to view all that collected data.

00:10:17.110 Let's talk about the level closest to your application— the tracer.

00:10:23.360 This is where we're actually tracing the individual requests, and it typically lives inside your application.

00:10:30.040 You install it using a gem, just like you would with New Relic.

00:10:36.400 It will start tracing your application and record data so we can tell the story as the request goes through each application.

00:10:43.920 Think of a request in our e-commerce application as a single trace, and the components that make it up are called spans.

00:10:50.700 A trace is made up of many spans. Each piece of the story is a span.

00:10:57.240 For example, calling out to the order service and getting a response back is a single span.

00:11:03.130 Any discrete piece of work can be traced, including your database calls, and it doesn’t have to be network-related.

00:11:09.520 We want to map out the system and understand what's happening. We start with an ID, the easiest thing.

00:11:15.700 This ID says this is trace 1, 2, 3, and it went all the way through our system.

00:11:22.580 You can query your logs later to see that trace 1, 2, 3 did all of these things.

00:11:29.930 However, you’ll run into the same issue you've seen with black box tracing, which is that you can’t guarantee causality strictly based on your logs.

00:11:35.800 You need to explicitly create a relationship between each of the components using a parent-child relationship.

00:11:42.010 The first request that comes into our system doesn’t have a parent, so we know that's the top of the tree.

00:11:50.650 When the auth process talks to the e-commerce process, it modifies the request headers and passes a randomly generated ID.

00:11:56.950 This parent ID might be set to one, but it could be anything, and it keeps doing that for each request.

00:12:02.570 A trace is ultimately made up of many parent-child relationships.

00:12:09.050 This relationship forms a directed acyclic graph. With all of this information, you’re able to describe the system with data structure.

00:12:16.200 You can visualize this graph to see how the components work together.

00:12:22.420 What you really want to know in your service world is what’s happening— what problems are occurring?

00:12:29.900 By passing along timestamp annotations, you can begin to understand latency in your system.

00:12:38.160 You shouldn’t see a full second of latency between each of these things.

00:12:45.320 Let’s look at how these annotations fit into our system.

00:12:52.070 The authentication process will annotate that it sent the request.

00:12:57.990 The e-commerce app will annotate that it received a request.

00:13:04.320 By comparing the timestamps, you can see the network latency between the two.

00:13:09.020 If you have many requests queuing up, you’ll see that time go up between them.

00:13:16.080 On the other hand, if you compare the time between the server receiving the information and sending it back,

00:13:22.300 this will show you how long it actually took to process the request.

00:13:29.330 If your app is really slow, that’s where you’d see your time start going up.

00:13:36.880 Finally, the client received annotation will close out that span, indicating that it’s complete.

00:13:44.050 This data needs to be transported for you to look at it, sending it via a transport layer to a separate application.

00:13:50.960 It’s going to aggregate all that data and display it.

00:13:56.490 How does this not add latency to your system?

00:14:03.590 First, it only propagates IDs in-band; it passes along those IDs in the headers.

00:14:10.540 But it reports out-of-band, using an additional thread to communicate with the system, so it doesn't block your requests.

00:14:16.780 Zipkin uses the sucker punch gem to make a threaded call-out to the server.

00:14:24.230 Finally, it's ready to be viewed in a UI. What does that UI actually look like?

00:14:32.020 The graph we saw earlier is great for showing the flow of the request.

00:14:40.470 It effectively illustrates how our systems are connected, but it doesn’t good for solving latency questions.

00:14:48.449 Distributed tracing systems use swim lanes or Gantt charts instead.

00:14:54.710 The open tracing documentation has really good examples of this.

00:15:02.180 In this format, you’ll see the different services, while the visualization helps you see how much time is spent in each sub-request.

00:15:09.170 Those gray bars represent things instrumented inside the same service.

00:15:15.450 The billing service in green is blocked by the authorization service in red.

00:15:23.210 If you see one of those bars widening—such as the red bar— it suggests that your service itself is getting slower.

00:15:29.620 If you see a widening gap between two adjacent blocking services, that could mean there’s queuing behind other network requests.

00:15:44.700 Now we talked about what we want; that's what we want to see; how is it done? We know we want to record information when a request comes into our systems.

00:15:51.920 How do we do that programmatically? Rack middleware is a pretty common way.

00:15:58.220 If you're running a Ruby app, you'll probably be using Rack. Ruby uses this interface called Rack.

00:16:04.170 It serves as a single entry and exit point for client requests that come in and out of your applications.

00:16:10.440 You add middleware that sits between your server and your application, allowing you to customize requests and responses.

00:16:17.870 Don’t worry too much if you're not familiar with what a Rack app looks like; this is a TL;DR.

00:16:24.760 Your middleware will mimic a Rack app—it’s a neat object that responds to the 'call' method.

00:16:31.590 It takes one parameter, the environment hash, and returns a response—the status, headers, and body.

00:16:38.360 Let’s write some pseudocode for the tracing middleware. We’ll have a method called trace that takes a block and does some tracing.

00:16:45.270 The trace method will start a new span on every request because we want to tell a new story for each service.

00:16:52.380 It will record that it received the request, yield to the Rack app to let it operate, then record that the server sent the response.

00:16:59.300 This is pseudocode, but if you want to see a great example of what this looks like, the Zipkin documentation is open source and super easy to read.

00:17:06.200 We would tell our Ruby app to use our middleware to instrument all incoming and outgoing requests.

00:17:12.410 You never want to sample every request; this is one common confusion.

00:17:19.280 People want to see every single request and drill down, but it’s really not what distributed tracing was designed for.

00:17:26.030 You’ll slow down your application or take down your tracing infrastructure if you send every single request.

00:17:32.620 Start with a sample rate like 0.1, and always err on the side of a lower sample rate.

00:17:38.970 So now that our requests are coming in, we also need to record annotations on the requests leaving our system.

00:17:45.090 These can be requests to external APIs like Stripe or GitHub.

00:17:51.600 However, if your company controls the next service, you'll be able to explore deeper into that tree.

00:17:58.510 You’ll want to use a middleware that incorporates tracing.

00:18:05.010 If you're using an HTTP client that supports middleware, it’s pretty easy to incorporate tracing.

00:18:12.390 Faraday is a good example with a familiar structure to Rack middleware.

00:18:18.930 You'll pass in your HTTP client, do some tracing, and call down the chain in your middleware.

00:18:26.320 On the tracing side, you’ll manipulate headers to pass along the tracing IDs.

00:18:34.000 If you call out to an internal service that’s already been instrumented, that service will know it’s being traced.

00:18:41.610 If it's a third-party API like GitHub or Stripe, they will ignore your headers, and you'll see the start and end.

00:18:48.530 Each of these colors would represent an instrumented application.

00:18:55.910 With the incoming instrumentation, we record that the server sent and received requests.

00:19:02.370 We also want to report that we’re starting a client request and ensure that we receive it.

00:19:09.980 As before, we’re adding our middleware, so here we inject it into Faraday.

00:19:17.200 These are the basics of how distributed tracing works.

00:19:24.080 Now let's talk about how you can choose the right tracing solution for you.

00:19:31.190 I don’t have any answers on this; spoiler alert! I wish I did.

00:19:38.000 This part is harder, but I hope to at least raise your awareness of some issues and trade-offs.

00:19:43.800 First major question: how do you get this working? The ecosystem is changing.

00:19:51.820 Even since I wrote this, new tracing infrastructure has been introduced.

00:19:58.770 Should you pay for an out-of-the-box system? Yes, if you can afford it and don't have a team to run it.

00:20:05.520 It’s always nicer to let others manage the software for you.

00:20:12.580 Companies like Traceview or Datadog may come to mind.

00:20:19.340 Your APM provider may actually offer a distributed tracing solution.

00:20:27.310 You may also want to adopt an open-source solution.

00:20:32.800 If your team controls the underlying framework and has capacity for managing infrastructure, this might work.

00:20:39.400 We ended up doing just that, partially because it’s really nice to run things on Heroku.

00:20:47.460 That made it easier for us to get systems up and running.

00:20:53.800 The main open-source options are Zipkin and OpenCensus, which partly came out of the Google ecosystem.

00:21:01.490 One thing that tripped me up early on was OpenTracing. Everyone was talking about it.

00:21:07.910 It took a while to understand that OpenTracing is not an implementation; it’s not something you install.

00:21:15.120 It’s actually an API to standardize instrumentation across providers.

00:21:22.290 If you want to switch from one open-source provider to another or from a paid to another, it theoretically works.

00:21:29.600 However, in practice, it hasn’t been 100% adopted—it’s a work in progress.

00:21:37.020 Each tracing provider has to change their API to match OpenTracing.

00:21:43.920 For example, only LightStep has built a bridge to the OpenTracing API, which makes things complicated.

00:21:51.130 And it doesn't standardize interoperability, so if you end using different solutions, it won’t work smoothly together.

00:21:59.550 Another important factor is to check language support when picking a tracing solution.

00:22:06.930 The minimum support means starting and ending a trace inside of a Ruby app.

00:22:14.630 While a provider may support Ruby, you’ll usually need to write a custom Rack middleware.

00:22:21.360 You might need to repeat this process for every language your ecosystem has.

00:22:28.440 Some big companies build their own systems in-house.

00:22:35.520 However, most companies are trying to contribute back to the ecosystem.

00:22:43.970 If you’re curious about what companies use for their tracing tools, there’s a paper revealing what 26 companies are using.

00:22:51.050 But be weary— it could be outdated.

00:22:57.850 Another part of your checklist should cover infrastructure requirements.

00:23:04.880 Does your ops team have any concerns about maintaining new infrastructure?

00:23:12.380 Make sure all transport layers and the collection layer are compatible with every service you want to implement.

00:23:18.990 For example, my team primarily programs in Ruby, but we also use a Java service.

00:23:26.250 You should be comfortable running something that may be a hindrance compared to your current setup.

00:23:33.280 Some tracing solutions require you to run a collection agent on the actual server.

00:23:38.230 For us, that’s a total deal-breaker, so make sure this is not required.

00:23:45.400 You also need to determine how you can secure who can submit and view data in your system.

00:23:51.760 Zipkin doesn't have built-in authentication, so you'll need to implement something.

00:23:59.640 You might run it inside a VPN or use a reverse proxy for authentication.

00:24:06.120 We ended up using basic auth to secure our Zipkin instance.

00:24:14.350 If your organization has PII or sensitive data, keep an eye on security issues.

00:24:20.770 One is custom instrumentation. Be careful not to send PII inadvertently while logging.

00:24:27.430 Understand what happens if data leaks. If you're using third-party services, know how they manage data.

00:24:35.240 If you’re running services yourself, make sure you can delete sensitive data as required.

00:24:42.510 Lastly, consider the cognitive load of distributed tracing.

00:24:49.640 Getting everyone on board can be challenging if your teams are distributed in time and space.

00:24:56.750 Set yourself up for success. Don’t skip the roadmap, or you’ll be opening PRs.

00:25:04.170 This is a checklist I’ll tweet out later.

00:25:10.370 You might be thinking: this is too much information; where do I start?

00:25:17.990 Try running a local version of Zipkin.

00:25:24.700 It has a test version of Cassandra built-in, even though you need to get the Java app running.

00:25:32.660 It’s relatively easy, and you can deploy it on Heroku.

00:25:39.470 I encourage you to play around with it. Try to instrument an application that makes a single API call.

00:25:46.540 It can be an external API call, and just try instrumenting one application.

00:25:53.190 It will help you become more fluent with how all this works.

00:25:59.660 One final plug for Heroku: thank you for helping me get here.

00:26:05.430 We have another guide, other than Amy’s, who is in our Sydney office. If you’d like to talk business or have any questions, feel free to ask!

00:26:12.510 Thank you!

Stella Cotton

7 talks

RubyConf AU 2018