Guardrails: Keeping Customer Data Separate in a Multi Tenant System

00:00:15.280 Hey everyone! Hope you enjoyed the break. How are you enjoying Rails World? Great! Thanks! As you already heard, I'm Miles McGuire, a Staff Engineer on Intercom's data stores team.

00:00:22.400 The data stores team at Intercom is responsible for managing our Rails deployment, caching databases, and various other core technologies. Today, I'm going to talk about a project I worked on last year called 'Guardrails.' As the subtitle suggests, we're focusing on keeping data separate in a multitenant system.

00:00:40.239 So what does that mean, and what will we discuss? Firstly, I want to emphasize that what I present today reflects how we at Intercom solved our specific problems. More specifically, I will share how we applied this on a 12-year-old Rails codebase containing nearly 2 million lines of code. We did all of this while causing the least disruption possible for our engineers. However, I should caution that results may vary; this isn't a one-size-fits-all recipe. It's something to digest and think about in your own context.

00:01:17.760 For additional context, let me briefly describe Intercom. We launched in 2011 with the mission to make internet businesses personal. Our customers use Intercom Messenger to chat with their customers on their websites. We've been built on Rails since day one, and over time, we've sometimes diverged from standard Rails practices, although we've recently updated to Rails 7.

00:01:42.520 If you want to know more about Intercom's history and how we got here, you should check out my colleague Brian's talk tomorrow, which I believe is at 15:45. There's another term we need to define before continuing: what do I mean by multitenant? Simply put, in a multitenant system, if we have two customers—let's call them Customer A and Customer B—even though their data may logically be separate, all that information lives together in one shared database. This shared infrastructure serves all requests.

00:02:05.039 So, what problems have we been facing, and what are we trying to solve? Our data model at Intercom has grown quite complex over 12 years of development. For context, we currently have almost 800 Active Record models, with a large team of engineers working on many interconnected features. We aim for these engineers to move as quickly as possible, but fast-moving in a complicated environment increases the risk of mistakes slipping through. Linting and specs can catch some of these, but they're only effective as far as the original mental model behind them.

00:02:44.000 What do those mistakes look like in practice? For example, we sometimes have a situation with an addressable URI where we're parsing a referrer, then looking up an Active Record model called Help Center Site by custom domain using u.host. On the surface, that seems innocuous, or in another instance, we’re looking up something using Rails cache with a conversation ID. Contextually, we might assume that's directly tied to an Active Record model, but we could be mistakenly appending a string to the end of it, which can lead to issues.

00:03:44.440 Taking that first example further, it turns out that the referrer can sometimes be 'about:blank.' This is a detail that can cause problems: if you pass 'about:blank' to an addressable URI, the host field will become nil. Since the Help Center Site’s custom domain field in Intercom’s data model is nullable—because not every customer uses custom domains—this poses a significant risk.

00:04:12.000 The issue isn't obvious while coding or writing specs, but if you execute that line to find the Help Center Site by custom domain using u.host, with nil returned from 'about:blank' as the referrer, it will retrieve the first row in the database regardless of what you intended to look up. This can lead to significant problems.

00:04:34.400 In the second example, things get even trickier. Although conversation IDs are Active Record models, they're not unique. Our architecture involves a sharded database, and primary key columns are only unique per customer. Compounding the problem is the fact that what used to be a normal Active Record model was once unique. To handle this correctly, you need to be aware of Intercom's complete context. Using non-unique conversation ID parts might lead to a cache hit, reading data that doesn't correspond to the object you're using, which can create serious issues.

00:05:05.000 This highlights the dangers of subverting expectations, as we’ve deviated from standard Rails conventions. Engineers are burdened with extra work to mitigate this issue, creating additional cognitive load to ensure they get things right. At the time, we relied on best practices and tribal knowledge to handle these problems, along with manual QA, while also depending on security researchers in our bug bounty program to identify such issues. This ad hoc approach ultimately slowed down teams across the board.

00:05:46.720 To improve the situation, we needed systems thinking. We took a step back and examined our data model at Intercom to identify the root of our problems. Internally, we refer to customer workspaces as 'apps.' Each database table has an app ID column pointing back to the apps table, with these tables distributed across numerous database clusters—in total, about 10 MySQL clusters.

00:06:03.200 Furthermore, Intercom users are termed 'admins,' and while admins can access multiple apps, they are not tied to any single app. This means that authenticating requests occurs on a per-admin basis. Each session can involve switching between multiple apps, complicating our approach. This leads us to the core problem outlined in our scoping document: 'Apps should never be able to access private app-specific data, whether that be create, read, update, or delete operations.' Unfortunately, we previously had no protection against this and relied on engineering best practices to achieve separation of customer data.

00:06:42.320 Private data can encompass a multitude of elements. To scope our solution effectively, we wanted to address multiple data types: MySQL, Elasticsearch, Memcache, Dynamo DB, S3, and other data stores, each with different access patterns. For example, using Active Record, one might call pluck to pull specific database columns without instantiating a model, or use select to pull certain columns while instantiating. Essentially, we needed to review the various issues we had encountered over time to identify the types of problems we experienced in practice.

00:07:15.520 Our exercise yielded a prioritized list of issues. Priority one was loading Active Record objects from MySQL. Priority two was reading and writing data in Memcache. Priority three, writing to MySQL with Active Record, didn't yield as many issues in practice compared to reading. Finally, priority four involved only reading specific columns of Active Record, as with pluck and select cases. All other data stores, including DynamoDB, Elasticsearch, and S3, became lower priority. This encompassing approach would cover the most potential surface area based on previous problems.

00:07:56.000 Once all this was outlined, we chose to cut scope aggressively, realizing we could omit the bottom three priorities and instead focus on loading Active Record objects in MySQL and reading and writing data from the cache—effectively addressing most prior issues. Our guiding principle became simple: if we could identify something potentially dangerous, we could raise an exception. This would allow us to convert a high-risk security incident into a straightforward availability incident.

00:08:23.000 Now, what did our solution look like? The first step was to define the right app. As mentioned, while our authentication process reveals the right admin, it doesn't clarify which app is right. We need to establish what the correct app is in each request and save that context. We assumed a simple structure where every request has a single right app, allowing us to obtain information without immediately loading from the database, although it is slightly optimistic.

00:09:11.000 To accomplish this, we take the app object and store it in a thread-local variable, creating a wrapper that looks something like this in code. Inside that block, the 'safe app' would be set. In principle, this solution is viable, but rolling it out across our codebase proved challenging, given the multitude of edge cases arising from a decade of development.

00:09:58.320 We handled edge cases by adding a callback to initialize in Active Record. Whenever a model is instantiated, we check if the safe app is set. If so, we compare the app ID on the safe app to the model's app ID. If there is no safe app set, we emit a metric; if a safe app exists and doesn't match, another metric is emitted. This strategy allowed for a progressive rollout while minimizing disruption for engineers.

00:10:35.440 As we gathered those metrics, our goal was to reduce them to zero. For several months, I shared a performance graph with my team showing the number of Active Record objects instantiated with no safe app set every five minutes. Previously, we instantiated around a billion objects every five minutes—12 billion per hour. After shipping a change around 2:30 PM that reduced that figure by nearly 90%, we sustained that momentum until we approached zero. Fixing these edge cases was a massive undertaking that required hundreds of pull requests and a comprehensive review of Intercom's Rails codebase.

00:11:24.120 Through this comprehensive exercise, we documented edge cases that previously went unaddressed. For example, certain controllers behaved differently, or we needed to retrieve the safe app via other models. This identification process revealed critical gaps in our approach, ultimately leading to enhanced resilience in our application. The initiative underscored the importance of formalizing the assumptions we've made regarding our data modeling, which had accumulated organically over time.

00:12:06.600 Once we set the safe app context to attribute requests properly, we discovered intriguing possibilities with this implementation. For one, we began tagging all our traces using Honeycomb, allowing us to visualize the experiences of individual customers effectively. Although this was not the primary goal initially, the additional benefit turned out to be beneficial for understanding customer interactions.

00:13:00.520 By analyzing this data, we gained insights into resource usage across our system. We could determine whether certain customers were profitable based on their compute time utilization, allowing us to make informed business decisions. This also addressed challenges engineers faced when needing to work with different shards, simplifying routing queries and, in emergencies, allowing us to shut down web requests or async jobs for individual customers.

00:13:49.520 In conclusion, the most significant takeaway from this whole process was how critical it was to formalize our existing assumptions concerning data modeling. Documenting all those organic decisions that developed over time turned out to be invaluable. We not only achieved our original goals by creating a more secure infrastructure but also opened doors for numerous other improvements to our platform.

00:14:28.680 Thank you very much for your attention! I appreciate your interest in this topic.