Predicting Performance Changes of Distributed Applications

00:00:15.769 I was born in Poland, where I studied informatics, and now I hold a PhD. During my research career, I have always focused on practical applications, and perhaps that's why I'm here today. I hope my practical applications will resonate with you.

00:00:28.849 Currently, I’m involved in a university technology initiative in Poland, and my research centers around distributed systems. Today, everything seems to revolve around distributed systems, which adds to the relevance of our discussion. Additionally, I use Ruby as a key tool in my research, which is another interesting aspect, as many of our applications are written in Ruby.

00:00:49.600 I also teach at the university, covering various topics, including Ruby on Rails. If you’re in Russia, you’re welcome to visit our Ruby user group, where we’re friendly, and we can discuss various topics. We will also have time for discussions after the presentations.

00:01:20.720 On a personal note, I have a wife and three kids, and I’m quite happy about it.

00:01:34.360 Now, let’s do a quick test together. Who here has heard about a distributed system called ‘the grid’? Yes, there are a few people, but not many, which is part of the challenge. I’ll provide a brief explanation to align our understanding and share my motivation for this research.

00:01:58.220 Imagine a large, worldwide supercomputer where you can push any computations you need or utilize any resources you can find. This supercomputer doesn’t need vast data centers; you merely use it and pay a bill. However, while you might think this relates to cloud computing, it differs significantly. It’s like having one enormous cloud with many vendors and users connected through a complex distributed system that connects users with resource providers.

00:02:31.020 This concept has been researched by scientists and researchers worldwide for many years. For example, in CERN, researchers developed monitoring systems for complex systems to bridge standard developer tools and applications. Debugging on a local machine is straightforward, but debugging applications running globally presents challenges.

00:02:54.680 To facilitate debugging between these applications and their monitors, we developed a monitoring system that would allow for interactive applications in this context. We needed online monitoring, which was complicated by the fact that no standard or central management existed due to the distributed nature of the system.

00:03:12.800 My role was to ensure security protocols were in place within this environment. It was an exciting challenge, but not the primary topic of this talk. After completing my work, someone raised an interesting question: while the security solution worked, it introduced additional overhead. Would the system still function effectively or would it slow down due to this added burden?

00:03:40.900 This led to basic experiments to measure overhead between secured and unsecured connections. I conducted some experiments to gauge the impact of overhead on performance, and the results indicated that the overhead wouldn’t significantly affect the system's overall performance.

00:04:03.660 However, I started to wonder what might happen if we faced a more complex architecture. For instance, with microservices, it can be challenging to predict the overall system’s performance when one or two microservices begin logging excessive data.

00:04:19.800 Our goal as researchers is often to devise solutions based on formal methods. But the reality is that clients typically need immediate reassurance that a new solution will not compromise performance. They aren’t inclined to pay for extensive research and validation of formal models.

00:04:36.300 I aimed to find ways for developers to utilize formal methods without having to master them, allowing them to receive precise results without the steep learning curve.

00:05:02.740 Today, I will share the approach we’ve developed, along with a basic example to illustrate how it works, followed by two case studies exploring its application.

00:05:23.950 For those who are curious about scientific literature on this approach, I can provide a list of relevant research papers. Now let’s discuss the basics of the solution.

00:05:37.100 Our approach relies on simulation to analyze various scenarios and deliver results. To simulate effectively, we must first create a model using a Ruby-based DSL (Domain Specific Language) that is simple yet flexible.

00:06:04.210 The simulation process is complex, involving scientific methodology to ensure precision in results. Once the simulations are complete, the resulting data is saved, and we can iterate over these results in Ruby to generate useful statistics.

00:06:22.130 Let's go over a very basic example. We will create a model involving a web server, web clients, and resources to run everything. Generally speaking, a web server is quite straightforward; it does nothing until it receives a request.

00:06:43.160 When a request is received, the server processes it using the CPU and then sends a response. To understand how long it takes to process requests, we would need to include some probes to record the start and stop times for each request.

00:07:06.559 The model for the web server is simple, but modeling a web client is more complex since two events occur: requesting data and processing incoming responses.

00:07:34.590 To handle request sending, we need to perform some CPU processing to prepare the request before sending it to the target web server.

00:07:42.940 We also manage response reception, logging the instance when a response is received with accurate timestamps.

00:08:02.250 We need to define resources such as nodes to run the programs along with networks connecting these nodes for data transmission.

00:08:24.620 Once we have the programs and resources defined, we can initiate the processes—one for the server and two for the clients.

00:08:43.820 Next, we save the complete model in a file and utilize more Ruby code to create an instance of our experiment.

00:09:01.360 This model can then be loaded, followed by starting the simulation and saving the output statistics for further analysis.

00:09:21.840 This basic example provides a framework for modeling web servers. However, the more interesting part lies in two case studies that demonstrate the application of this simulation process.

00:09:40.650 The first case study revolves around Heroku—some of you might be familiar with the controversy surrounding Genius and Heroku.

00:09:54.600 If you wish, you can read about it in detail through the blog post available online, which provides an insightful commentary.

00:10:06.680 In this study, we attempted to model the impact of intelligent routing versus random routing in the deployment architecture on Heroku.

00:10:21.810 Originally, Heroku had implemented intelligent routing, where each Dyno handled only one request at a time. This meant that if a Dyno was busy, no additional requests could be processed until it was available.

00:10:41.060 However, Heroku later switched to a random routing approach, where requests were assigned randomly to any available Dyno, regardless of its workload at the time.

00:10:54.740 This raises the question: for clients, what does this change in routing mean for performance? Is there much of a difference between the two approaches?

00:11:15.930 The blog post I mentioned earlier includes a detailed analysis of simulations done using Ruby, exploring the implications of these routing changes.

00:11:28.930 I decided to replicate these simulations because I wanted to glean insights from the best, providing a fresh angle to this problem.

00:11:47.760 In essence, the simulation aimed to model the HTTP client-server interactions using both random and intelligent routing methods.

00:12:05.640 To model the random router, we defined it as a simple mechanism that selects any server from a list when it receives a request.

00:12:23.660 Conversely, the intelligent router is a bit more complex, maintaining a queue for incoming requests and efficiently managing responses.

00:12:37.170 In this simulation, we analyzed the request processing times for both routing strategies under varied load conditions.

00:12:59.080 As we observed, the intelligent router performed well, successfully balancing requests while the random router faced delays, especially during peak load.

00:13:22.350 The results highlighted that under significant load, the intelligent router consistently outperformed the random routing method, confirming that intelligent routing provides a better user experience.

00:13:40.650 This led us to brainstorm potential optimizations—we wondered if we could combine aspects of both routing strategies to improve overall performance.

00:14:02.180 We explored the idea of deploying multiple intelligent routers, removing the bottlenecks associated with keeping track of idle servers.

00:14:27.650 Next, we looked into the potential impact of deploying combinations of intelligent and random routers in a more extensive architecture.

00:14:38.960 Eventually, our simulations indicated that while using multiple intelligent routers improved speed, it didn't significantly outperform the standard intelligent routing model.

00:14:59.650 The second case study closely follows the first, centered around the decision to scale applications on Heroku effectively.

00:15:23.260 In this scenario, we compared single Dynos against double Dynos in a Rails application that runs on Unicorn. We were curious about whether changing from six single Dynos to three double Dynos would yield performance benefits.

00:15:47.220 One advantage we considered was Unicorn’s master process, which effectively manages how requests are distributed among its worker processes.

00:16:03.320 Rather than running eight lower-capacity workers on single Dynos, we could run fewer but more capable double Dynos, maximizing our performance benefits.

00:16:19.780 The challenge, however, arose when we needed to simulate and analyze how Heroku scaled its Dynos, given the ambiguity in the documentation.

00:16:40.080 Despite the uncertainty, we decided to run a series of tests on Heroku's infrastructure to determine if switching to double Dynos would enhance our application's performance.

00:16:58.360 By running parallel tests on both single and double Dynos—executing simultaneous CPU-intensive tasks—we gathered comparative data on run times.

00:17:17.040 Consequently, our evaluations showed that the performance between single and double Dynos was largely comparable, sparking a further series of tests for confirmation.

00:17:34.830 We scaled this approach, running multiple CPU tasks with varying loads, ultimately concluding that there are no significant performance gains from simply switching the Dynos without considering the request load.

00:17:56.170 So, we confirmed that the type of scaling, whether horizontal or vertical, greatly influences performance output.

00:18:20.650 In summary, our simulations and experiments revealed valuable insights into navigating Dyno configurations on Heroku.

00:18:39.390 These experiences underscore the potential of simulations to derive effective benchmarking, allowing us to optimize application performance without costly and complicated real-world testing.

00:19:02.570 Thus, we can surmise that using simulation methods can significantly expedite performance decision-making processes in application scaling, yielding substantial savings in time and resources.

00:19:23.320 Thanks to our simulation work, we were able to devise flexible solutions easily adaptable to various application scenarios.

00:19:45.470 For any further queries, I am available for discussions. I welcome you to approach me for more insights on our simulations and findings.

00:20:06.690 If you have any questions, please feel free to ask, as I am here to offer support and guidance in your related projects.

00:20:27.520 In closing, let me extend my gratitude for your attention. I sincerely hope our time together has provided valuable insights into using simulation for performance analysis.

00:21:01.610 If you have any follow-up questions regarding the project, I'm more than willing to discuss those as well.

00:21:14.540 The project is currently not open source, as rights are still tied to the university. However, I am in the process of obtaining ownership to eventually make it open source.