Keynote: Playing Games in the Clouds

00:00:13.910 I want to start with a story about two university students and their final chemistry exam. They had received A's in chemistry all semester, but on the night before their final exam, they went partying in another state and didn't get back to school until the exam was over. They had to think of a determined excuse, so they came up with a story. They approached their professor and said, 'We're really sorry. We went to visit some friends last evening, and we would have made it back in time, except for the fact that we had a flat tire. Can we please set a makeup test?' The professor thought about it, eventually agreed, and wrote out another test for them. When the time came for them to take it, she sent them to two separate rooms. They sat down and turned the page to the first question, which they thought was easy. Then they turned the page to the second question.

00:01:30.620 What would you write in this situation? It was stories like this that got me into game theory. I'm a software engineer at Pivotal Labs in London, and I've been mainly working on their Cloud Foundry platform, which is a Platform as a Service. I found the cloud domain challenging to grasp. When I don't understand something, I find it helpful to frame it into concepts that I'm familiar with; for me, that was game theory. I studied economics at university, and game theory was one of my favorite modules. Through my readings, I discovered how game theory can provide simplicity and clarity in what is a complicated space.

00:01:52.950 We'll explore this by discussing a simple problem. First, we'll have a brief introduction to game theory. Then, I have two games for us to play. The first one looks at bargaining, and the second one presents an auction scenario. After that, we'll look at how these simple examples have led to interesting real-world case studies. By the end, I hope to show you not only that game theory is a useful tool for working with distributed systems, but also that by using game theory, you can discover how to efficiently allocate resources in the systems you build or work with.

00:03:01.370 So let's begin with more about game theory. Game theory is the study of strategic interactions between rational agents, and its applications include economics, politics, biology, and computer science. It was developed by John von Neumann and Oscar Morgenstern in the 1940s, with many other scholars extending it significantly in the 1950s. One such notable scholar was John Nash, who was tragically killed with his wife, Alicia, in a car accident in New Jersey last month. Game theory applies to a range of games and interactions, and we typically see it applied to behavioral scenarios. However, despite its breadth, all games, no matter how complex, have some basic components.

00:03:43.290 So what is a game? There are at least two players, each player has a set of strategies based on the information they have and the actions they can take, and there are payoffs to each player for every possible outcome. We will look at games that help us solve a load balancing problem. Consider a distributed system of two computers, A and B. We have a scenario where jobs are being sent into a task distributor, which then has to decide how many jobs to send to each computer.

00:04:04.810 Let's look at our first game. One afternoon, you're walking along the street with your friend Gemma, and you come across a bag of sweets on the floor. You both want the sweets, and now you have to negotiate how to split them. Bargaining theory attempts to provide a precise answer to how you will ultimately divide those sweets. In other words, bargaining theory asks how surplus will be split between agents. In this case with the sweets, the surplus is simply the acquisition of some sweets, where the maximum surplus you can achieve is equivalent to you getting all the sweets and Gemma getting none. One way of answering this question is to look at something called a Nash bargaining solution. You can plot a chart showing all the different possible payoffs depending on how the sweets are divided up, and the outer edge of the payoff set represents all the possible combinations where the full surplus is achieved.

00:06:03.000 This means that all of the sweets are divided between you and Gemma, while anywhere in the purple area means that after you split up the sweets, you’ve left some behind on the floor. On the axis representing Gemma’s payoff, we have the case where she takes the entire bag of sweets, while on the axis representing your payoff, we see the case where you claim all the sweets. We find something called the Nash bargaining solution point on the frontier of the payoff set, where both you and Gemma receive a high utility. If you do not have any sweets, there is nowhere else in this payoff set where you can increase the payoff for one of you without decreasing the payoff for the other.

00:06:48.120 In this example, I'm assuming that your and Gemma's payoffs are linear and the same with respect to the sweets. This means that one additional sweet, or two, or three in your pocket represent identical payoffs or happiness. However, in many game theory examples, the situation becomes more complicated, and the frontier tends to be curved rather than straight. This represents the fact that agents typically have nonlinear and differing responses to marginal increases or decreases regarding whatever is being bargained for. For simplicity, we'll assume linear payoffs, meaning the bag of sweets will be split in half. But you're probably wondering how this repeated systems approach relates to load balancing.

00:08:15.610 We can use bargaining theory to think about how workload should be allocated within a distributed system. In a system where some computers are heavily loaded and others have lighter loads, cases of poor system performance can arise. We want to mitigate this performance degradation. Returning to our distributed system with computers A and B, we want to consider how to model them. The players are the computers, and each has a different processing rate, which we know from the beginning of the game. Each computer's strategy is to accept a pre-agreed number of incoming jobs, and the payoffs for each computer relate to their final load, assuming that computers with lighter loads are better off.

00:09:37.290 We have two computers: one can handle a maximum of five incoming jobs every second before crashing, while the other can handle two. Together, they need to process four jobs each second. Here's what's happening: jobs are fed in from a main computer, a certain number of jobs are sent to computer A, and a certain number are sent to computer B. We want to know how many jobs each computer will agree to accept in this bargaining game, aiming to distribute the jobs in a way that optimally minimizes execution time. What I mean by optimal is that once we determine the job allocation, we cannot reallocate the jobs to improve performance for at least one computer without worsening performance for the other.

00:11:14.640 To break this down, we assume each computer has a payoff function, which we can view as happiness. In this example, I will say that happiness equals the logarithm of X minus Y, where X is the computer's processing rate, and Y is the rate of jobs arriving at the computer. Don't worry too much about the logarithm; it simply represents that as more jobs arrive at a computer, the response time to complete those jobs increases exponentially. For the computer that can handle five jobs every second, if I'm that computer, I am happiest when I can sit around and do little to no work. As the amount of jobs you send me increases, I become unhappier, but the fewer jobs I have to complete, the shorter the time it takes for me to complete them, and therefore, I am happier.

00:13:16.900 So we want to find the maximum of a function that looks like this. This is what you get when you take one of those happiness functions for each of the computers, substitute in their processing rates, and add them together. We also have the constraint that the total number of jobs arriving at each computer must equal four, as stated in the problem. If we plot this on a graph, we observe where happiness is greatest in the system. When we solve for that point, we find that the number of jobs to send to computer A should be three and a half, meaning half a job should be sent to computer B. Here's what that might look like on a bargaining diagram.

00:14:47.860 We can see the joint payoff set for the system with a curved frontier. This is the line where all four jobs are being completed, and it's curved because the processing rates of the computers mean that marginal increases and decreases in the number of jobs they handle result in different performance levels, depending on which computer we are looking at and its current capacity. With the allocation of three and a half jobs to computer A, the response time based on the numbers given is 1.33 seconds. But what happens if we were to give a bit more work to computer A? We look at this point over here. The optimal response time becomes 1.39 seconds. Now, what if we shifted work to computer B? In that case, the optimal response time becomes 1.36 seconds.

00:16:17.570 What I'm trying to show is that when you calculate the job allocation based on the Nash bargaining solution, there's no way to reallocate those jobs without increasing the response time. By using game theoretic principles, we can find efficient distributions of resources within our computer systems. We've only looked at an example with two computers, but we can extrapolate the same calculation to any number of computers. This enables us to think about algorithms that can dynamically and efficiently distribute jobs as they queue within a system. There will be more on this later in the talk, but for now, let's examine another game: the auction.

00:19:01.520 In the end, you and Gemma decide that instead of splitting up the sweets, you'd be better off taking them to the school playground and auctioning them off among your friends. So how does this auction work? We’re going to use a second-price auction setup. Participants in this auction for your sweets will simultaneously submit their bids in an envelope. The person who makes the highest bid wins the sweets, but they'll only pay the value of the second-highest bid. This may seem like a strange setup, but there are auctions today, like eBay and Google’s online advertising programs, that utilize this concept. Let’s look at an example with three participants: Lucy bids $25, Mark bids $5, and Helen bids $10. In this second-price auction setup, Lucy wins because she made the highest bid, but she only pays $10, the second-highest bid.

00:20:50.600 Auction theory asks how should an agent bid given varying intrinsic values for the object being auctioned. I want to introduce the concept of a dominant strategy. A strategy is dominant if it gives the player their highest payoff regardless of what the other players do. A dominant strategy is not the same as a winning strategy; it does not guarantee that you will win, and not every game has one. If a dominant strategy exists, it will guarantee the highest payoff, so you are always better off playing it. The reason I introduce that now is to show you that in a second-price auction game, it is the dominant strategy to bid truthfully. In the case of the sweets, it's in the best interest of the participants to bid the amount that the bag of sweets is worth to them.

00:21:43.140 Let's assume all three participants bid with the true values they hold for the sweets. For participant payoffs, it's zero if you don't win the sweets, or the true value you place on the sweets minus the second-highest bid. For Lucy, that amount is $15 because she values the sweets at $25 and only pays $10 for them. In a second-price auction format, there is no way for Lucy, Helen, or Mark to rebid such that their payoffs improve. Let’s focus on Helen. She came in second in this auction. There's no point in her bidding any lower than $10, since she'll still lose, and any bid up to $25 makes no difference to her. But consider if she had bid $26 to move into a winning position; she would end up with a payoff of -$16 because she values the sweets at $10, but has to pay $26 for them.

00:23:55.880 Thus, there is no other bid other than her true value of $10 that makes sense for Helen. This result holds true for any participant in a second-price auction. We can use auction theory to reveal the resource capabilities of machines within distributed systems. Sometimes, computers are owned by self-interested agents, and load allocation algorithms can fall prey to manipulation. How can we model this? Again, we have the computers as players; each has different capabilities that are only known to them. Each computer's strategy is to announce a bid, which is a statement about its capabilities. For example, a computer may have 10 gigabytes of memory, so an honest bid would represent that number.

00:25:18.330 The payoff to each computer is related to the outcome of the auction. For instance, consider two computers: computer A has 15 gigabytes of memory and computer B has 10, so they will enter an auction to win the task of running a set of jobs. Computer A will submit its bid as Bid A, while computer B will represent its capabilities as Bid B. These bids will reflect how much memory each computer has. Ideally, we want the job to run on the machine with the most memory, but the computers may have other motivations. Now we can return to our diagram, which includes an auctioneer. The auctioneer will announce an auction to run the jobs, and each computer will submit their bids to the auctioneer.

00:26:59.330 Upon receiving the bids, the auctioneer selects the winner and communicates the results back to the task distributor, which sends the incoming jobs to the winning machine. Behind the scenes, the computers are programmed with payoff functions that mimic the second-price auction setup. This example is somewhat abstract and differs from the previous one. In this simplified model, we will ignore the burdens on the machine caused by the jobs it runs. Considering computer A's potential payoff—assuming it has 15 gigabytes of memory—if computer A does not participate, its payoff is zero. If computer A loses, and it bids lower than computer B, its payoff will also be zero. However, if computer A wins the auction, then its payoff can be calculated as 15 minus the amount that computer B bid.

00:27:50.840 As before, this example will show that the dominant strategy for computers is to truthfully report their memory. For computer A, it is always in its best interest to bid 15 gigabytes. I will demonstrate that computer A will never want to bid higher than 15. For example, if it bids 15 while computer B bids 8, computer A wins and has a payoff of 7 (15 minus 8). If computer A were to bid higher than 15, say at 20, then there would be no gain in payoff due to the structure of the auction. Let’s consider when computer A bids 50 while computer B bids 17. If computer A decided to bid lower than its true capabilities, it would end up with a negative payoff.

00:29:00.050 Similarly, if computer A has to bid lower, there is no incentive to understate its capabilities to win any particular auction. Whenever a computer underbids relative to its true value, it risks receiving a negative payoff for trying to win the auction, meaning it is best for the computers to report their actual capabilities honestly. This framework helps create stability and predictability in load allocation execution. Using concepts like the second-price auction, we can design systems that incentivize machines to truthfully report their capabilities, maintaining efficient job distribution. Now, we’ve examined several examples and made numerous assumptions throughout our discussion.

00:30:36.400 For example, we've assumed that all jobs and tasks are the same. However, how can we differentiate between tasks or compute-demanding processes? What about aspects like deadlines or the desire to run certain jobs on the same machine? In the auction example, we ignored the burden of running jobs and only focused on abstract second-price auction payoffs. There are also more assumptions to consider, but the simple starting points we addressed are crucial because they help establish models to analyze these problems. These foundational concepts have already led to interesting real-world applications. To revisit that first bargaining scenario, in a paper titled 'Load Balancing: A Game Theoretic Approach,' Daniel Grocer shows how to develop an algorithm based on the Nash bargaining solution.

00:32:30.920 Imagine a distributed system with three computers, A, B, and C, requiring them to handle 7 jobs every second. After initial calculations to determine how many jobs to allocate to each computer, computer C returns a negative value in this framework, indicating that it is too slow to be effective. Consequently, we would remove it from the system before redoing the optimization. We repeat the calculation until all computers in the system yield positive values for the number of jobs they will accept. The final outcome reflects the unique bargaining point dictated by the Nash bargaining solution, indicating that we've achieved the desired optimal result—no waste in performance from redistributing jobs.

00:34:59.180 During experiments set up using this algorithm, comparisons were made against other allocation methods that optimized for various parameters. In the tests, this algorithm was not only simpler to understand and compute, but it also offered each job an equal expected response time, regardless of which computer executed it. The slowest computers were not effectively utilized by the algorithm, while other frameworks tended to overload certain servers, resulting in inefficient performance. Now, concerning auctions within the Cloud Foundry platform, the system I work on at Pivotal incorporates auction elements to orchestrate application deployment locations.

00:36:45.350 At a basic level, the process functions as follows: a user requests the platform for a specific number of application instances. An auctioneer then requests bids from all the virtual machines in the system, and the machines respond with their capabilities. In a blog post by a Cloud Foundry engineering intern, he explains how these bids are created from various data points, such as available memory and disk space. Given certain constraints—the available memory for each machine must meet or exceed the required memory for the application instances—the auctioneer can ascertain where to allocate applications. The surface-level operation seems straightforward, yet this structure emerged from a comprehensive rewrite of Cloud Foundry, simplifying a previously convoluted setup.

00:38:33.900 Game theoretic concepts have enabled these engineers to find simplicity and more effective solutions. Earlier, I discussed an example demonstrating how agents can be incentivized to truthfully report resources. This assumption, that resources owned by self-interested agents can lead to poor performance if they manipulate the load allocation algorithm, showcases the complexities of incomplete information. To uncover this information, frameworks are built around concepts derived from the second-price auction. One of these frameworks is known as the Vickrey-Clarke-Groves mechanism, in which computers are programmed with a profit function equal to the payment received minus the cost of computation. The payment functions as an incentive for truthfully reporting their capabilities, ensuring efficiency.

00:40:40.290 What do game theoretic approaches signify more broadly for the types of systems we envision? Nicholas Taleb introduces the concept of antifragility in his latest book, which posits that systems learn from feedback and gain strength in the face of failures. Systems designed for individual machines that optimize using game-theoretic algorithms can adapt to challenging scenarios. If a machine malfunctions or network partitions occur, the system algorithms can dynamically adjust to reallocate jobs as needed. They might manifest some of these antifragile characteristics during such changes.

00:41:49.370 As we wrap up, let’s review the specific insights we gained today. When we are aware of the capabilities of machines within a distributed system, tools such as the Nash bargaining solution help guide the allocation of resources. When we operate without knowledge of these capabilities, frameworks centered on second-price auctions prove tremendously effective in revealing them. In essence, computers parallel human behavior by often making decisions unpredictably, and when encountering uncertainty, we must formulate methods to clarify ambiguities. This is where game theory proves invaluable; it helps conceptualize complex issues by modeling components as a game.

00:42:44.160 You do not need to be a mathematician to leverage game theory as a tool to understand intricate problems. As a starting point, simply consider how you might model parts of your system as a game and inquire about the contracts or competitions that need to occur among components. Despite the vast mathematical context, this approach can unlock a realm of intriguing, simple, yet potent algorithms that can yield superior system performance. So, what can you do now? You could start with this book; it may not look like much, but it serves as an introduction to game theory filled with narratives from multiple domains, from sports to military strategy. When you read, consider how these examples might translate into computer system scenarios.

00:43:59.750 There are numerous game types I didn’t explore, so if you're interested in cloud security, you might identify interesting parallels with non-cooperative game theory. If you find yourself involved with clustering or databases like Cassandra, you might investigate insights from evolutionary game theory. Or, if you are curious about the platform-as-a-service landscape, coalition theory could provide guidance. For those looking to delve deeper, I've included some references I used while preparing this talk. One final question remains: would you like to play again? Thank you.