00:00:05.460
This is actually my third year at RubyConf Australia, and it is my favorite conference in the whole wide world. Such a treat!
00:00:11.219
So, like Mel said, I'm an engineer and today I'm going to talk a little bit about distributed tracing. I know it's basically the end of the conference and everybody is super tired and really fried.
00:00:20.580
I use a really cheap trick to keep people awake but also to encourage myself to drink water, so it's like the worst thing ever but I love it.
00:00:27.630
When I take a sip of water, I would like everyone to clap and cheer for me like I'm doing something awesome, so we're going to try it out!
00:00:34.230
Although, like Amy mentioned earlier, we ran out of Heroku swag, but if you would like us to mail you some, we will actually mail it to you if you go to this URL.
00:00:41.070
So, let's talk a little bit about distributed tracing. This is my team. I work on the tools team, and we spent some time implementing distributed tracing at Heroku for our internal services.
00:00:48.120
I will say that a lot of the trial, effort, and discovery that went into this talk was actually a team effort from these awesome folks.
00:00:53.670
Let's start with the basics because I feel like a lot of engineers don't know what distributed tracing is. When I mention it, I get a lot of blank looks, which is great because you are in the right place.
00:01:06.030
I just want to get a little raised hands: who actually knows what distributed tracing is here? This is great! Awesome! Okay, a couple of people.
00:01:12.070
For those who do, how many have it at their company? Amy? All right, we've got a couple.
00:01:20.170
Distributed tracing gives us the ability to trace requests across a distributed system boundary.
00:01:26.320
I'm going to give a simple use case: let's say you have an e-commerce site and you want to show users all of the recent orders they've placed.
00:01:32.050
In a monolithic architecture, you have one web process that's going to return information in that web request.
00:01:37.930
You will have users who place orders, and orders have items. It's pretty standard Rails stuff.
00:01:43.570
So, we're going to finish fetching a user's orders and render them on a page with a single app and single process.
00:01:50.920
But because software is always changing, you suddenly get a mobile app, and now you need authentication on two different sites.
00:02:02.229
So, let's break that out into a service. This could be running in separate processes on the same server, a different server, or a totally different server.
00:02:09.310
Now you have two applications, and let's say things change again.
00:02:14.350
Now you want to show users recommendations for things to buy based on their purchases.
00:02:20.830
So, the team in charge of making recommendations includes data scientists and people who do machine learning.
00:02:26.819
As your engineering teams grow and your products grow, you don’t have to be unlike a microservices bandwagon to find yourself in this situation.
00:02:33.900
A service might be written in a different language or have infrastructure requirements than another.
00:02:39.599
As our web apps grow larger, they start to look less like a garden and more like a collection of different plants in different pots.
00:02:47.640
So, you might find yourself needing to change your tooling.
00:02:53.940
Where does distributed tracing fit into this bigger picture?
00:03:04.599
Let's say one day your e-commerce page starts loading really, really slowly.
00:03:10.890
From the perspective of you as the e-commerce team, you can look in your application performance monitoring software, like New Relic.
00:03:17.579
You can profile your app with your profiling tool, and you can see this recommendation service is taking a really long time to return recommendations.
00:03:23.220
So, with this single process monitoring tool, all the other services in your system look just like third-party API calls.
00:03:30.570
You get just as much information about calls to Stripe to charge users' credit cards as you do for a service running in-house by a different team.
00:03:36.990
From a user's perspective, it takes an extra 500 milliseconds to get recommendations, but you don't know why.
00:03:43.919
Without reaching out to the team managing the recommendation service, you have to get access to their profiling tools.
00:03:49.410
You also have to get time on their roadmap to start digging into your issue, and it becomes increasingly complicated.
00:03:55.320
In the end, you can't tell a coherent macro story about your application just by monitoring individual processes.
00:04:02.300
If you have ever done performance work, you will find that people are terrible guessers. It's always in the last place you look.
00:04:11.700
You may be very convinced about one thing, but a lot of times you have no idea where the bottleneck is.
00:04:18.690
So what can we do to increase visibility into these systems and have a macro-level story?
00:04:25.779
That’s where distributed tracing can help you out. It’s a way of commoditizing knowledge.
00:04:30.879
Adrienne Cole said this. He’s one of the systems maintainers, and he talks about how in these increasingly complex systems,
00:04:36.989
you want to give everyone involved the tools to understand the system as a whole without having to have experts in each individual silo do the investigation.
00:04:43.989
This is awesome! Maybe you’re like, cool, sell! I’m on board. Let’s do this.
00:04:50.289
You’ve got your services running in different processes, so why is distributed tracing not just easy?
00:04:57.089
There are a couple of different things. One big thing for me is that it is not a Ruby-specific thing.
00:05:05.770
The Ruby community doesn't talk a lot about tracing, and the people working in this ecosystem are largely in Java, Go, and Python.
00:05:12.309
It's not that there isn't Ruby tracing; it's that a lot of the examples you see online are written in other languages.
00:05:19.719
So that could be a bit of a hurdle. There’s also a lot of domain-specific vocabulary.
00:05:26.959
Reading through the docs can feel really slow, even if you're a very senior engineer.
00:05:32.899
Finally, the most frustrating thing for me was that the ecosystem is very fractured.
00:05:39.949
It’s still changing pretty rapidly, and it's about tracing everything everywhere.
00:05:47.079
So a distributed tracing solution has to support tons of different languages and different kinds of frameworks.
00:05:54.849
Navigating all those solutions to find out which one works for your company is not trivial.
00:06:00.669
How can you get past these hurdles? We're going to talk a little bit about the theory of how distributed tracing works, about the fundamentals.
00:06:07.089
Then we'll have a little short checklist for how you can maybe take this from theory to practice.
00:06:12.309
The basics: I'm going to start by talking about black box tracing.
00:06:19.879
The idea of a black box is that you can change anything you don't know about your system itself.
00:06:26.449
If you capture and log incoming traffic and outgoing traffic at a lower level than your application, like at the TCP level,
00:06:35.289
And then all that data from your internal services goes into a single log, you can guess the behavior based on lots of processing.
00:06:43.969
However, I'm not actually going to talk about black box tracing.
00:06:49.329
For us at Heroku and for many other companies, it turns out it’s not a great fit, and we'll talk about why that might be.
00:06:55.000
One reason is that you need a lot of data because it relies on statistical inference.
00:07:02.210
Also, because it uses statistical analysis, there could be a delay in returning results.
00:07:10.810
The biggest problem with black box tracing is that in an event-driven system, like if you use something like Sidekiq for async work,
00:07:18.130
or if you have a multi-threading system, you can't guarantee causality.
00:07:24.900
What does that mean from a coding perspective? There are serious papers on this topic.
00:07:31.640
To illustrate, let's look at a simple example. You've got Service 1 kicking off an asynchronous job.
00:07:36.710
Then immediately, you're going to make a synchronous request to Service 2.
00:07:43.450
If there’s no queue delay in your job, the timestamps will look like Service 1 calls the async job first, and then Service 2.
00:07:50.680
But if you start to have a delay, like simulating this with some sleep time, your timestamps will actually make it look like the second service made that call instead of the first service.
00:08:01.149
We can solve this problem using white box tracing.
00:08:07.430
White box tracing assumes that not only do you understand the system you're using but that you’re actually able to make changes to the systems themselves.
00:08:14.289
You can be sure about the path a request takes through your full system using something called metadata propagation.
00:08:21.990
In this type of tracing, you explicitly pass along information about where that request is and where it’s going.
00:08:28.000
This approach allows for real-time tracing without statistical inference being run.
00:08:36.440
Let’s take a little history lesson. The most well-known example for anyone starting to read about tracing is Dapper.
00:08:42.930
The open-source library that it inspired, Zipkin, came from this.
00:08:48.990
The Dapper paper was published by Google in 2010, and it wasn’t the first distributed system debugging tool built,
00:08:55.200
but it was super interesting because it ran at Google scale for years before publishing the paper.
00:09:01.210
It showed that this kind of tracing was not only viable but also super valuable.
00:09:08.230
At Google, they found it integral to their process, and thus, Zipkin came from that.
00:09:15.560
It started as a happy project during a hack week at Twitter, and it was open-sourced in 2012.
00:09:23.700
The project is currently led and maintained by Adrian Cole.
00:09:30.130
So when I mention distributed tracing, I’ll be using that term specifically to refer to Dapper and Zipkin-like systems.
00:09:37.130
White box metadata propagation tracing linked systems is too hard to say and I would screw it up every time.
00:09:44.230
So let's see how we actually do this. I’m going to walk us through the major components that power most distributed tracing systems.
00:09:52.390
First of all, we have the tracer, which is the instrumentation you actually install in your application.
00:09:58.600
You can use Bundler with a Ruby gem, then there's a transport component.
00:10:05.210
That takes the data collected in your application and transports it to a collector.
00:10:11.380
Finally, there’s a UI component that allows you to view all that collected data.
00:10:17.110
Let's talk about the level closest to your application— the tracer.
00:10:23.360
This is where we're actually tracing the individual requests, and it typically lives inside your application.
00:10:30.040
You install it using a gem, just like you would with New Relic.
00:10:36.400
It will start tracing your application and record data so we can tell the story as the request goes through each application.
00:10:43.920
Think of a request in our e-commerce application as a single trace, and the components that make it up are called spans.
00:10:50.700
A trace is made up of many spans. Each piece of the story is a span.
00:10:57.240
For example, calling out to the order service and getting a response back is a single span.
00:11:03.130
Any discrete piece of work can be traced, including your database calls, and it doesn’t have to be network-related.
00:11:09.520
We want to map out the system and understand what's happening. We start with an ID, the easiest thing.
00:11:15.700
This ID says this is trace 1, 2, 3, and it went all the way through our system.
00:11:22.580
You can query your logs later to see that trace 1, 2, 3 did all of these things.
00:11:29.930
However, you’ll run into the same issue you've seen with black box tracing, which is that you can’t guarantee causality strictly based on your logs.
00:11:35.800
You need to explicitly create a relationship between each of the components using a parent-child relationship.
00:11:42.010
The first request that comes into our system doesn’t have a parent, so we know that's the top of the tree.
00:11:50.650
When the auth process talks to the e-commerce process, it modifies the request headers and passes a randomly generated ID.
00:11:56.950
This parent ID might be set to one, but it could be anything, and it keeps doing that for each request.
00:12:02.570
A trace is ultimately made up of many parent-child relationships.
00:12:09.050
This relationship forms a directed acyclic graph. With all of this information, you’re able to describe the system with data structure.
00:12:16.200
You can visualize this graph to see how the components work together.
00:12:22.420
What you really want to know in your service world is what’s happening— what problems are occurring?
00:12:29.900
By passing along timestamp annotations, you can begin to understand latency in your system.
00:12:38.160
You shouldn’t see a full second of latency between each of these things.
00:12:45.320
Let’s look at how these annotations fit into our system.
00:12:52.070
The authentication process will annotate that it sent the request.
00:12:57.990
The e-commerce app will annotate that it received a request.
00:13:04.320
By comparing the timestamps, you can see the network latency between the two.
00:13:09.020
If you have many requests queuing up, you’ll see that time go up between them.
00:13:16.080
On the other hand, if you compare the time between the server receiving the information and sending it back,
00:13:22.300
this will show you how long it actually took to process the request.
00:13:29.330
If your app is really slow, that’s where you’d see your time start going up.
00:13:36.880
Finally, the client received annotation will close out that span, indicating that it’s complete.
00:13:44.050
This data needs to be transported for you to look at it, sending it via a transport layer to a separate application.
00:13:50.960
It’s going to aggregate all that data and display it.
00:13:56.490
How does this not add latency to your system?
00:14:03.590
First, it only propagates IDs in-band; it passes along those IDs in the headers.
00:14:10.540
But it reports out-of-band, using an additional thread to communicate with the system, so it doesn't block your requests.
00:14:16.780
Zipkin uses the sucker punch gem to make a threaded call-out to the server.
00:14:24.230
Finally, it's ready to be viewed in a UI. What does that UI actually look like?
00:14:32.020
The graph we saw earlier is great for showing the flow of the request.
00:14:40.470
It effectively illustrates how our systems are connected, but it doesn’t good for solving latency questions.
00:14:48.449
Distributed tracing systems use swim lanes or Gantt charts instead.
00:14:54.710
The open tracing documentation has really good examples of this.
00:15:02.180
In this format, you’ll see the different services, while the visualization helps you see how much time is spent in each sub-request.
00:15:09.170
Those gray bars represent things instrumented inside the same service.
00:15:15.450
The billing service in green is blocked by the authorization service in red.
00:15:23.210
If you see one of those bars widening—such as the red bar— it suggests that your service itself is getting slower.
00:15:29.620
If you see a widening gap between two adjacent blocking services, that could mean there’s queuing behind other network requests.
00:15:44.700
Now we talked about what we want; that's what we want to see; how is it done? We know we want to record information when a request comes into our systems.
00:15:51.920
How do we do that programmatically? Rack middleware is a pretty common way.
00:15:58.220
If you're running a Ruby app, you'll probably be using Rack. Ruby uses this interface called Rack.
00:16:04.170
It serves as a single entry and exit point for client requests that come in and out of your applications.
00:16:10.440
You add middleware that sits between your server and your application, allowing you to customize requests and responses.
00:16:17.870
Don’t worry too much if you're not familiar with what a Rack app looks like; this is a TL;DR.
00:16:24.760
Your middleware will mimic a Rack app—it’s a neat object that responds to the 'call' method.
00:16:31.590
It takes one parameter, the environment hash, and returns a response—the status, headers, and body.
00:16:38.360
Let’s write some pseudocode for the tracing middleware. We’ll have a method called trace that takes a block and does some tracing.
00:16:45.270
The trace method will start a new span on every request because we want to tell a new story for each service.
00:16:52.380
It will record that it received the request, yield to the Rack app to let it operate, then record that the server sent the response.
00:16:59.300
This is pseudocode, but if you want to see a great example of what this looks like, the Zipkin documentation is open source and super easy to read.
00:17:06.200
We would tell our Ruby app to use our middleware to instrument all incoming and outgoing requests.
00:17:12.410
You never want to sample every request; this is one common confusion.
00:17:19.280
People want to see every single request and drill down, but it’s really not what distributed tracing was designed for.
00:17:26.030
You’ll slow down your application or take down your tracing infrastructure if you send every single request.
00:17:32.620
Start with a sample rate like 0.1, and always err on the side of a lower sample rate.
00:17:38.970
So now that our requests are coming in, we also need to record annotations on the requests leaving our system.
00:17:45.090
These can be requests to external APIs like Stripe or GitHub.
00:17:51.600
However, if your company controls the next service, you'll be able to explore deeper into that tree.
00:17:58.510
You’ll want to use a middleware that incorporates tracing.
00:18:05.010
If you're using an HTTP client that supports middleware, it’s pretty easy to incorporate tracing.
00:18:12.390
Faraday is a good example with a familiar structure to Rack middleware.
00:18:18.930
You'll pass in your HTTP client, do some tracing, and call down the chain in your middleware.
00:18:26.320
On the tracing side, you’ll manipulate headers to pass along the tracing IDs.
00:18:34.000
If you call out to an internal service that’s already been instrumented, that service will know it’s being traced.
00:18:41.610
If it's a third-party API like GitHub or Stripe, they will ignore your headers, and you'll see the start and end.
00:18:48.530
Each of these colors would represent an instrumented application.
00:18:55.910
With the incoming instrumentation, we record that the server sent and received requests.
00:19:02.370
We also want to report that we’re starting a client request and ensure that we receive it.
00:19:09.980
As before, we’re adding our middleware, so here we inject it into Faraday.
00:19:17.200
These are the basics of how distributed tracing works.
00:19:24.080
Now let's talk about how you can choose the right tracing solution for you.
00:19:31.190
I don’t have any answers on this; spoiler alert! I wish I did.
00:19:38.000
This part is harder, but I hope to at least raise your awareness of some issues and trade-offs.
00:19:43.800
First major question: how do you get this working? The ecosystem is changing.
00:19:51.820
Even since I wrote this, new tracing infrastructure has been introduced.
00:19:58.770
Should you pay for an out-of-the-box system? Yes, if you can afford it and don't have a team to run it.
00:20:05.520
It’s always nicer to let others manage the software for you.
00:20:12.580
Companies like Traceview or Datadog may come to mind.
00:20:19.340
Your APM provider may actually offer a distributed tracing solution.
00:20:27.310
You may also want to adopt an open-source solution.
00:20:32.800
If your team controls the underlying framework and has capacity for managing infrastructure, this might work.
00:20:39.400
We ended up doing just that, partially because it’s really nice to run things on Heroku.
00:20:47.460
That made it easier for us to get systems up and running.
00:20:53.800
The main open-source options are Zipkin and OpenCensus, which partly came out of the Google ecosystem.
00:21:01.490
One thing that tripped me up early on was OpenTracing. Everyone was talking about it.
00:21:07.910
It took a while to understand that OpenTracing is not an implementation; it’s not something you install.
00:21:15.120
It’s actually an API to standardize instrumentation across providers.
00:21:22.290
If you want to switch from one open-source provider to another or from a paid to another, it theoretically works.
00:21:29.600
However, in practice, it hasn’t been 100% adopted—it’s a work in progress.
00:21:37.020
Each tracing provider has to change their API to match OpenTracing.
00:21:43.920
For example, only LightStep has built a bridge to the OpenTracing API, which makes things complicated.
00:21:51.130
And it doesn't standardize interoperability, so if you end using different solutions, it won’t work smoothly together.
00:21:59.550
Another important factor is to check language support when picking a tracing solution.
00:22:06.930
The minimum support means starting and ending a trace inside of a Ruby app.
00:22:14.630
While a provider may support Ruby, you’ll usually need to write a custom Rack middleware.
00:22:21.360
You might need to repeat this process for every language your ecosystem has.
00:22:28.440
Some big companies build their own systems in-house.
00:22:35.520
However, most companies are trying to contribute back to the ecosystem.
00:22:43.970
If you’re curious about what companies use for their tracing tools, there’s a paper revealing what 26 companies are using.
00:22:51.050
But be weary— it could be outdated.
00:22:57.850
Another part of your checklist should cover infrastructure requirements.
00:23:04.880
Does your ops team have any concerns about maintaining new infrastructure?
00:23:12.380
Make sure all transport layers and the collection layer are compatible with every service you want to implement.
00:23:18.990
For example, my team primarily programs in Ruby, but we also use a Java service.
00:23:26.250
You should be comfortable running something that may be a hindrance compared to your current setup.
00:23:33.280
Some tracing solutions require you to run a collection agent on the actual server.
00:23:38.230
For us, that’s a total deal-breaker, so make sure this is not required.
00:23:45.400
You also need to determine how you can secure who can submit and view data in your system.
00:23:51.760
Zipkin doesn't have built-in authentication, so you'll need to implement something.
00:23:59.640
You might run it inside a VPN or use a reverse proxy for authentication.
00:24:06.120
We ended up using basic auth to secure our Zipkin instance.
00:24:14.350
If your organization has PII or sensitive data, keep an eye on security issues.
00:24:20.770
One is custom instrumentation. Be careful not to send PII inadvertently while logging.
00:24:27.430
Understand what happens if data leaks. If you're using third-party services, know how they manage data.
00:24:35.240
If you’re running services yourself, make sure you can delete sensitive data as required.
00:24:42.510
Lastly, consider the cognitive load of distributed tracing.
00:24:49.640
Getting everyone on board can be challenging if your teams are distributed in time and space.
00:24:56.750
Set yourself up for success. Don’t skip the roadmap, or you’ll be opening PRs.
00:25:04.170
This is a checklist I’ll tweet out later.
00:25:10.370
You might be thinking: this is too much information; where do I start?
00:25:17.990
Try running a local version of Zipkin.
00:25:24.700
It has a test version of Cassandra built-in, even though you need to get the Java app running.
00:25:32.660
It’s relatively easy, and you can deploy it on Heroku.
00:25:39.470
I encourage you to play around with it. Try to instrument an application that makes a single API call.
00:25:46.540
It can be an external API call, and just try instrumenting one application.
00:25:53.190
It will help you become more fluent with how all this works.
00:25:59.660
One final plug for Heroku: thank you for helping me get here.
00:26:05.430
We have another guide, other than Amy’s, who is in our Sydney office. If you’d like to talk business or have any questions, feel free to ask!
00:26:12.510
Thank you!