Scaling

Summarized using AI

From Stubbies to Longnecks

Geoffrey Giesemann • February 20, 2013 • Earth

In the talk titled "From Stubbies to Longnecks" presented at RubyConf AU 2013, Geoffrey Giesemann discusses the critical aspects of monitoring and optimizing application performance and scalability, particularly at realestate.com.au. He emphasizes that every application is unique in how it performs and scales, reinforcing the importance of tailored diagnostics and monitoring tools. Giesemann identifies three main types of bottlenecks that applications may encounter:

  • I/O Bottlenecks: Illustrated through a graph displaying I/O wait times on a database server, indicating performance degradation due to excessive read/write operations. He suggests upgrading the hardware as a common approach to alleviate such issues.
  • Processing Bottlenecks: A case is presented involving the generation of PDFs in a massive loop. The overhead of rendering a web page into a PDF billions of times can significantly slow down application performance if not managed properly.
  • Artificial Bottlenecks: He shares an example of limitations on login attempts to prevent overload on the system, leveraging controlled bottlenecks to enhance stability.

In terms of performance testing, Giesemann explains:

- The importance of consistent measurement and understanding the application’s characteristics to engineer around bottlenecks.

- Tools like Siege and JMeter are highlighted for generating load and analyzing application performance.

- The effectiveness of New Relic for monitoring and CloudWatch for AWS users in extracting meaningful data from vast arrays of metrics.

- Optimization techniques for static assets, using caching strategies to minimize server requests and enhance user experience.

Throughout the presentation, Giesemann emphasizes that recognizing and addressing these bottlenecks is an ongoing process. It's crucial to regularly test, monitor, and refine application performance to adapt to changing traffic patterns and user behaviors. Ultimately, the takeaway is clear: optimizing performance involves continuous measurement, appropriate caching strategies, and an understanding of both user needs and server loads, ensuring that applications consistently meet user expectations.

From Stubbies to Longnecks
Geoffrey Giesemann • February 20, 2013 • Earth

From Stubbies to Longnecks - Finding and curing scaling bottlenecks

RubyConf AU 2013: http://www.rubyconf.org.au

Every application is different. Every application performs and scales in a different manner. What stays the same are the tools we use to monitor and diagnose our applications when they get sick or have a big night out.
In this talk I'll cover how realestate.com.au monitors and troubleshoots the performance and scalability of our Ruby and non-Ruby apps. I'll look at the tools we use to infer when/where things go wrong and several cases where things have gone wrong and how we've dug our way out of the whole.
Examples of areas covered include:
- Vertical scaling our way out of I/O pain
- Horizontal scaling our apps for throughput and availability
- Using HTTP and CDNs to avoid the reddit-effect
- Tools we use for establishing performance 'baselines'

RubyConf AU 2013

00:00:08.559 Thanks, everyone. It is a pleasure to be here. I'm going to start with two confessions.
00:00:14.120 The first confession is that if you are a keen observer, you'll notice I lied about the title of this talk.
00:00:21.519 If you look here, we've got a regular size bottle, but down here that's actually a pint-sized bottle. For the purposes of, you know, writing things off as a tax expense, I will buy beer. I do enjoy drinking.
00:00:31.920 The second thing is that because this is the Kiwi-free track, I will insert my own little fact about Australia. Do we have anyone from Queensland in the room? Queensland represent!
00:00:44.760 So, a little fact about people from Queensland is that in Australia we call them 'Banana Benders'. Why do we call them that, you might ask? Well, as it turns out, all of Australia's bananas come from Queensland.
00:00:58.199 However, when they grow on the tree, they are actually dead straight. It is only in the harvesting process that the Queenslanders bend them a little, hence we call them 'Banana Benders'. I honestly did not know that.
00:01:10.560 Anyway, my name is Geoffrey, and I do programming at a company up the road called realestate.com.au. We've been based over in Richmond, just up the tail end of Victoria Street for a bit over five years.
00:01:26.679 Not everyone is looking for a house all the time, so some of you may not have used our website recently. But, we have actually been around for quite a bit of time. I think the site first started in about '96, running out of someone's garage.
00:01:43.760 You can see here we predate modern web technologies such as rounded corners, white backgrounds, and nice fonts. I'm kind of hoping that maybe in about five years, the vintage web look will come back and we can revert to a nicer, older version of the site.
00:01:56.880 Because we've been around for so long, there are some skeletons in the closet. Over time, we've built a lot of little products that hang off the site. Some of those products wind down, but they still make money, so we have to keep running them.
00:02:18.760 We have to build and maintain a lot of different bits and pieces. The good news is that we're pretty skilled at keeping control of this monolithic lot, and we’ve learned how to tweak and tune all these parts to create a combined user experience.
00:02:36.239 What does a bottleneck look like? I know we've seen a picture of one, but bear with me. Up here, I've got a sample of code from a Ruby gem called PDFKit. This is actually from one of our reporting applications, which is a bit of Rack middleware that essentially gets tacked onto the end of a request.
00:03:03.959 It gets a request coming through, renders a bit of HTML, and then this bit of code sits at the end and processes it into a PDF. You can see here in the middle we have some advanced regex for HTML parsing. Right in the middle, we run this program called WebKit HTML to PDF.
00:03:27.360 It takes our HTML, grabs all the JavaScript, grabs all the images, renders it, converts it into a PDF, and spits it out in the response body. If you're also keen, you can see it messes around with its HTTP headers.
00:03:40.520 Running this process once or twice may not be a big deal, but if you're in a giant loop hitting a web server to generate PDFs, like in a mass email campaign, it can get quite painful. Essentially, you're spinning up a browser 10,000 times, rendering one page, and shutting it down again, which is pretty bad.
00:04:08.800 The good news is that this bottleneck is one we can get around easily because it's confined to one server or one application process. We can run a massive pool of these processes in parallel to mitigate the problem.
00:04:22.560 Another kind of bottleneck we've seen previously is the classic I/O bottleneck. This is a graph generated by Munin, a program that can be thrown on a server to generate a pile of statistics, and you can create pretty graphs.
00:04:42.959 As you can see here, we are looking at a database server. On the left, we're seeing load going up to 800%, indicating we have something with eight cores. Normally, this server is mostly idle, and down the bottom, we see some blue blips when the server is doing something.
00:05:01.519 However, we have this horrible pink hue creeping into the graph, indicating I/O wait time. Instead of being idle, the server is waiting for the disk to do something. If this were a one-off spike, we wouldn't be too concerned.
00:05:14.560 But if you observe these pink spikes over a week-long period, we’ve effectively driven off an I/O cliff. The server is now hamstrung because we are putting enough reads and writes into it that it can’t sustain the load, and everything talking to this database server invariably slows down.
00:05:29.480 Now, if we had loads of spare time and a group of really smart people, we could write a complex database-sharding program, but we typically avoid that due to resource constraints. Instead, we just upgrade the server running it, which is a problem we still actively monitor today.
00:05:55.679 It's worth noting that bottlenecks don't necessarily have to be a bad thing. Sometimes we can introduce an artificial bottleneck into a feature to make our lives easier. For example, the sign-in for the backend of realestate.com.au looks great.
00:06:08.720 However, imagine if I attempt to log in too many times; I get locked out of my account for 15 minutes. You might think an Australian listing website would be really popular locally, but it turns out our login page is quite popular in Romania, where users frequently attempt to log in.
00:06:28.080 To handle this, we set it up so if someone incorrectly enters their password 3 to 5 times, they cannot log in again for another 15 minutes. This mechanism helps manage the load on our system.
00:06:47.679 Those are three different kinds of bottlenecks, and one thing to note is that all applications have some sort of bottleneck somewhere. You may not be able to see it at all times, and sometimes they are not easy to identify. If you do find it and remove it, you're not creating a bottleneck-free application; you are merely exposing the next bottleneck that was hidden beneath the previous one.
00:07:03.560 However, as in the previous examples, if we understand our applications and their performance characteristics, we can engineer around these bottlenecks. So, how do we find bottlenecks? It would be fantastic if we had a magical testing wand to wave around our application and say, 'There's the problem,' but unfortunately, no such thing exists.
00:07:36.919 Now, I know what you're thinking—if I run Rails, I can just generate a performance test and it will identify my problems. While Rails performance tests are nice and easy to create, they mainly gather metrics based on your development environment. If your application is complex, with multiple databases and various services, you'll need a more rigorous approach to testing.
00:08:07.359 Performance testing is analogous to testing a car. Sometimes we may have a specific goal in mind, racing around the track while making adjustments, asking ourselves if we’ve achieved the right performance. Other times we simply kick the tires and see how it feels.
00:08:24.880 Performance is a relative measurement. It’s not about being absolutely fast; it’s more about whether something is fast enough for our needs. A great example is when you show an iOS developer an app that's performing well, while they may see issues like frame drops that you might not notice.
00:08:41.160 Performance testing is tough; it’s not something you do every day unless you’re making significant changes to your application. Alternatively, you can launch it and let your users find the issues for you. Good relationships with users can help, but be ready with a nice 500 error page!
00:09:12.559 A key part of performance testing is generating load. It’s challenging to assess a system when it’s idle. I’ll now share some tools we use at REA for generating load on our systems.
00:09:38.760 One of our favorites is Siege. It’s a simple tool you can set up on your machine. You call it on a list of URLs and specify a time window to run it with a set concurrency level to generate load, and it returns handy measurements.
00:09:53.680 The caveat is that it does not serve as a browser; you’re not testing your JavaScript or static assets. Because it lacks the concept of cookies or sessions, it doesn’t simulate natural user interactions. However, if you record an access log on a specific server, you could modify it into a usable load test.
00:10:06.120 We heavily utilize JMeter for load testing HTTP applications. It can also load test SOAP, LDAP, and mail systems. Similar to Siege, JMeter is easy to set up. You create and save test plans, though the UI can feel opaque, especially since it’s mostly XML.
00:10:23.720 However, we find it useful because it generates readable reports with response times and latency, better simulating user-like behavior. If you don't use JMeter, I recommend looking into Grape: a Ruby DSL for writing JMeter configurations that can encapsulate the power in an easier-to-read format.
00:10:43.720 WebPageTest is another fantastic tool worth considering, particularly if you’re looking for performance testing across various browsers and conditions. It acts like a catalog of browsers with specific versions and locations for your tests. You can even run it locally if needed.
00:11:02.919 When generating load, the next step is analyzing the data produced and monitoring the results. You want to see if you didn’t meet your performance targets and drill down to identify issues.
00:11:18.120 At REA, we heavily rely on New Relic for monitoring. It's exceptionally valuable for tracking performance. You just need to install the gem, input your API key, and you get working monitoring for Ruby, Java, and PHP apps.
00:11:37.240 While the free version retains the last 30 minutes of data and is useful in pre-production environments, the full version may incur costs especially if you’re on Heroku. Alternatively, for AWS users, Amazon CloudWatch is free and can monitor various AWS resources.
00:11:52.560 However, CloudWatch can be opaque, presenting lots of stats but making it difficult to pull them together into actionable insights. For that, we rely on Graphite, a robust time-series database for numeric data that helps us generate useful graphs from our metrics.
00:12:12.160 Graphite doesn’t collect data by itself, so we use tools like Munin or CloudWatch to gather metrics. A local Melbourne shop, 99designs, wrote a tool called Vacuum Metric, designed for vacuuming metrics from various systems into Graphite.
00:12:38.809 Another critical tool in performance monitoring is log aggregation. If you generate errors on your servers, you want to log stack traces and errors locally on your servers. If you have multiple servers and need a timeline overview, you’ll want to use a log aggregation tool.
00:12:59.920 Splunk is a prominent option in this space which provides extensive capabilities for log aggregation. However, for those seeking free alternatives, software like Graylog2, Logstash, and Kibana can be easier to set up and may provide similar functionality.
00:13:23.120 Looking at our performance testing process, we've found that optimizing static assets is essential. Once we render HTML, the browser must fetch a variety of assets, leading to significant user wait times unless optimized.
00:13:43.760 Using Chrome's Dev Tools, I've compared the initial request and a subsequent one visually. The first download took around 7 seconds, while the second completed in under 2 seconds due to cached content.
00:14:03.320 Over 75% of requests on our site are for static assets, which makes optimizing their delivery significantly impactful. The HTTP caching strategy consists of expiration and validation to minimize server requests.
00:14:26.720 With caching, we specify that an HTTP entity is public and set cache-control headers indicating its lifetime to allow shared access for unuser-specific content, which means we make just one request despite multiple users accessing the same assets.
00:14:49.160 HTTP validation comes as a feature that allows clients to check if a cached resource is still valid, letting the server use conditional requests to improve efficiency and reduce unnecessary data transfer.
00:15:10.840 Our Rails framework implements a robust mechanism for content rendering with cache control. The condition-based response supply allows the server to skip massive processing if the content hasn't been modified.
00:15:36.080 Notably, we heavily utilize image thumbnails, especially since a listing can contain anywhere between 1 and 26 images. To streamline, we generate these on the fly and cache them over HTTP.
00:15:58.520 We ensure efficient load distribution of requests across various hosts to maximize performance. Our thumbnail generation workflow involves a content hash to uniquely identify images with caching headers.
00:16:18.920 We store all generated images in Amazon S3. Initially, bandwidth limitations posed challenges, but now with the recent introduction of a Sydney datacenter, we efficiently upload images to nearby instances.
00:16:41.920 Each S3 bucket can load hundreds of images effortlessly. Alongside this, we leverage CDNs which allow our system to withstand high traffic spikes effectively.
00:17:04.600 While S3 is our primary storage, CDNs improve the delivery of assets to users, and during server maintenance, they allow for seamless content serving from backup datacenters.
00:17:28.920 Finally, for those running with frameworks like Rails, utilizing gems for S3 storage can significantly streamline asset management while maintaining an effective caching strategy that reduces server load.
00:17:51.560 The key takeaway is that performance relies heavily on measurement. You need to measure consistently and implement effective HTTP caching strategies to optimize performance, regardless of your application's architecture.
00:18:30.520 Any questions? Please feel free to ask.
00:18:38.760 We're often testing various services and combining them into a comprehensive view of performance.
00:18:54.920 Typically, after testing, we analyze from a single point of view for performance metrics. If we see abnormal spikes, we will drill down into detailed metrics.
00:19:07.480 Sometimes we rely on tools like New Relic for deeper performance diagnostics to help isolate specific requests.
00:19:28.080 We ensure our performance impact tracking is linked back to recent changes in our application.
00:19:38.520 Testing isn't always straightforward, especially concerning complex applications interacting with multiple services.
00:19:55.160 Therefore, we take measures to minimize performance impact during testing by using pre-production systems.
00:20:15.760 Our database strategy involves sharding by type rather than spreading our data over various instances.
00:20:37.120 Though there are complexities, relying on a single database for reads and writes can enable us to maximize performance.
00:20:54.240 Thank you very much for your attention, and feel free to reach out with additional questions or comments.
Explore all talks recorded at RubyConf AU 2013
+21