00:00:08.559
Thanks, everyone. It is a pleasure to be here. I'm going to start with two confessions.
00:00:14.120
The first confession is that if you are a keen observer, you'll notice I lied about the title of this talk.
00:00:21.519
If you look here, we've got a regular size bottle, but down here that's actually a pint-sized bottle. For the purposes of, you know, writing things off as a tax expense, I will buy beer. I do enjoy drinking.
00:00:31.920
The second thing is that because this is the Kiwi-free track, I will insert my own little fact about Australia. Do we have anyone from Queensland in the room? Queensland represent!
00:00:44.760
So, a little fact about people from Queensland is that in Australia we call them 'Banana Benders'. Why do we call them that, you might ask? Well, as it turns out, all of Australia's bananas come from Queensland.
00:00:58.199
However, when they grow on the tree, they are actually dead straight. It is only in the harvesting process that the Queenslanders bend them a little, hence we call them 'Banana Benders'. I honestly did not know that.
00:01:10.560
Anyway, my name is Geoffrey, and I do programming at a company up the road called realestate.com.au. We've been based over in Richmond, just up the tail end of Victoria Street for a bit over five years.
00:01:26.679
Not everyone is looking for a house all the time, so some of you may not have used our website recently. But, we have actually been around for quite a bit of time. I think the site first started in about '96, running out of someone's garage.
00:01:43.760
You can see here we predate modern web technologies such as rounded corners, white backgrounds, and nice fonts. I'm kind of hoping that maybe in about five years, the vintage web look will come back and we can revert to a nicer, older version of the site.
00:01:56.880
Because we've been around for so long, there are some skeletons in the closet. Over time, we've built a lot of little products that hang off the site. Some of those products wind down, but they still make money, so we have to keep running them.
00:02:18.760
We have to build and maintain a lot of different bits and pieces. The good news is that we're pretty skilled at keeping control of this monolithic lot, and we’ve learned how to tweak and tune all these parts to create a combined user experience.
00:02:36.239
What does a bottleneck look like? I know we've seen a picture of one, but bear with me. Up here, I've got a sample of code from a Ruby gem called PDFKit. This is actually from one of our reporting applications, which is a bit of Rack middleware that essentially gets tacked onto the end of a request.
00:03:03.959
It gets a request coming through, renders a bit of HTML, and then this bit of code sits at the end and processes it into a PDF. You can see here in the middle we have some advanced regex for HTML parsing. Right in the middle, we run this program called WebKit HTML to PDF.
00:03:27.360
It takes our HTML, grabs all the JavaScript, grabs all the images, renders it, converts it into a PDF, and spits it out in the response body. If you're also keen, you can see it messes around with its HTTP headers.
00:03:40.520
Running this process once or twice may not be a big deal, but if you're in a giant loop hitting a web server to generate PDFs, like in a mass email campaign, it can get quite painful. Essentially, you're spinning up a browser 10,000 times, rendering one page, and shutting it down again, which is pretty bad.
00:04:08.800
The good news is that this bottleneck is one we can get around easily because it's confined to one server or one application process. We can run a massive pool of these processes in parallel to mitigate the problem.
00:04:22.560
Another kind of bottleneck we've seen previously is the classic I/O bottleneck. This is a graph generated by Munin, a program that can be thrown on a server to generate a pile of statistics, and you can create pretty graphs.
00:04:42.959
As you can see here, we are looking at a database server. On the left, we're seeing load going up to 800%, indicating we have something with eight cores. Normally, this server is mostly idle, and down the bottom, we see some blue blips when the server is doing something.
00:05:01.519
However, we have this horrible pink hue creeping into the graph, indicating I/O wait time. Instead of being idle, the server is waiting for the disk to do something. If this were a one-off spike, we wouldn't be too concerned.
00:05:14.560
But if you observe these pink spikes over a week-long period, we’ve effectively driven off an I/O cliff. The server is now hamstrung because we are putting enough reads and writes into it that it can’t sustain the load, and everything talking to this database server invariably slows down.
00:05:29.480
Now, if we had loads of spare time and a group of really smart people, we could write a complex database-sharding program, but we typically avoid that due to resource constraints. Instead, we just upgrade the server running it, which is a problem we still actively monitor today.
00:05:55.679
It's worth noting that bottlenecks don't necessarily have to be a bad thing. Sometimes we can introduce an artificial bottleneck into a feature to make our lives easier. For example, the sign-in for the backend of realestate.com.au looks great.
00:06:08.720
However, imagine if I attempt to log in too many times; I get locked out of my account for 15 minutes. You might think an Australian listing website would be really popular locally, but it turns out our login page is quite popular in Romania, where users frequently attempt to log in.
00:06:28.080
To handle this, we set it up so if someone incorrectly enters their password 3 to 5 times, they cannot log in again for another 15 minutes. This mechanism helps manage the load on our system.
00:06:47.679
Those are three different kinds of bottlenecks, and one thing to note is that all applications have some sort of bottleneck somewhere. You may not be able to see it at all times, and sometimes they are not easy to identify. If you do find it and remove it, you're not creating a bottleneck-free application; you are merely exposing the next bottleneck that was hidden beneath the previous one.
00:07:03.560
However, as in the previous examples, if we understand our applications and their performance characteristics, we can engineer around these bottlenecks. So, how do we find bottlenecks? It would be fantastic if we had a magical testing wand to wave around our application and say, 'There's the problem,' but unfortunately, no such thing exists.
00:07:36.919
Now, I know what you're thinking—if I run Rails, I can just generate a performance test and it will identify my problems. While Rails performance tests are nice and easy to create, they mainly gather metrics based on your development environment. If your application is complex, with multiple databases and various services, you'll need a more rigorous approach to testing.
00:08:07.359
Performance testing is analogous to testing a car. Sometimes we may have a specific goal in mind, racing around the track while making adjustments, asking ourselves if we’ve achieved the right performance. Other times we simply kick the tires and see how it feels.
00:08:24.880
Performance is a relative measurement. It’s not about being absolutely fast; it’s more about whether something is fast enough for our needs. A great example is when you show an iOS developer an app that's performing well, while they may see issues like frame drops that you might not notice.
00:08:41.160
Performance testing is tough; it’s not something you do every day unless you’re making significant changes to your application. Alternatively, you can launch it and let your users find the issues for you. Good relationships with users can help, but be ready with a nice 500 error page!
00:09:12.559
A key part of performance testing is generating load. It’s challenging to assess a system when it’s idle. I’ll now share some tools we use at REA for generating load on our systems.
00:09:38.760
One of our favorites is Siege. It’s a simple tool you can set up on your machine. You call it on a list of URLs and specify a time window to run it with a set concurrency level to generate load, and it returns handy measurements.
00:09:53.680
The caveat is that it does not serve as a browser; you’re not testing your JavaScript or static assets. Because it lacks the concept of cookies or sessions, it doesn’t simulate natural user interactions. However, if you record an access log on a specific server, you could modify it into a usable load test.
00:10:06.120
We heavily utilize JMeter for load testing HTTP applications. It can also load test SOAP, LDAP, and mail systems. Similar to Siege, JMeter is easy to set up. You create and save test plans, though the UI can feel opaque, especially since it’s mostly XML.
00:10:23.720
However, we find it useful because it generates readable reports with response times and latency, better simulating user-like behavior. If you don't use JMeter, I recommend looking into Grape: a Ruby DSL for writing JMeter configurations that can encapsulate the power in an easier-to-read format.
00:10:43.720
WebPageTest is another fantastic tool worth considering, particularly if you’re looking for performance testing across various browsers and conditions. It acts like a catalog of browsers with specific versions and locations for your tests. You can even run it locally if needed.
00:11:02.919
When generating load, the next step is analyzing the data produced and monitoring the results. You want to see if you didn’t meet your performance targets and drill down to identify issues.
00:11:18.120
At REA, we heavily rely on New Relic for monitoring. It's exceptionally valuable for tracking performance. You just need to install the gem, input your API key, and you get working monitoring for Ruby, Java, and PHP apps.
00:11:37.240
While the free version retains the last 30 minutes of data and is useful in pre-production environments, the full version may incur costs especially if you’re on Heroku. Alternatively, for AWS users, Amazon CloudWatch is free and can monitor various AWS resources.
00:11:52.560
However, CloudWatch can be opaque, presenting lots of stats but making it difficult to pull them together into actionable insights. For that, we rely on Graphite, a robust time-series database for numeric data that helps us generate useful graphs from our metrics.
00:12:12.160
Graphite doesn’t collect data by itself, so we use tools like Munin or CloudWatch to gather metrics. A local Melbourne shop, 99designs, wrote a tool called Vacuum Metric, designed for vacuuming metrics from various systems into Graphite.
00:12:38.809
Another critical tool in performance monitoring is log aggregation. If you generate errors on your servers, you want to log stack traces and errors locally on your servers. If you have multiple servers and need a timeline overview, you’ll want to use a log aggregation tool.
00:12:59.920
Splunk is a prominent option in this space which provides extensive capabilities for log aggregation. However, for those seeking free alternatives, software like Graylog2, Logstash, and Kibana can be easier to set up and may provide similar functionality.
00:13:23.120
Looking at our performance testing process, we've found that optimizing static assets is essential. Once we render HTML, the browser must fetch a variety of assets, leading to significant user wait times unless optimized.
00:13:43.760
Using Chrome's Dev Tools, I've compared the initial request and a subsequent one visually. The first download took around 7 seconds, while the second completed in under 2 seconds due to cached content.
00:14:03.320
Over 75% of requests on our site are for static assets, which makes optimizing their delivery significantly impactful. The HTTP caching strategy consists of expiration and validation to minimize server requests.
00:14:26.720
With caching, we specify that an HTTP entity is public and set cache-control headers indicating its lifetime to allow shared access for unuser-specific content, which means we make just one request despite multiple users accessing the same assets.
00:14:49.160
HTTP validation comes as a feature that allows clients to check if a cached resource is still valid, letting the server use conditional requests to improve efficiency and reduce unnecessary data transfer.
00:15:10.840
Our Rails framework implements a robust mechanism for content rendering with cache control. The condition-based response supply allows the server to skip massive processing if the content hasn't been modified.
00:15:36.080
Notably, we heavily utilize image thumbnails, especially since a listing can contain anywhere between 1 and 26 images. To streamline, we generate these on the fly and cache them over HTTP.
00:15:58.520
We ensure efficient load distribution of requests across various hosts to maximize performance. Our thumbnail generation workflow involves a content hash to uniquely identify images with caching headers.
00:16:18.920
We store all generated images in Amazon S3. Initially, bandwidth limitations posed challenges, but now with the recent introduction of a Sydney datacenter, we efficiently upload images to nearby instances.
00:16:41.920
Each S3 bucket can load hundreds of images effortlessly. Alongside this, we leverage CDNs which allow our system to withstand high traffic spikes effectively.
00:17:04.600
While S3 is our primary storage, CDNs improve the delivery of assets to users, and during server maintenance, they allow for seamless content serving from backup datacenters.
00:17:28.920
Finally, for those running with frameworks like Rails, utilizing gems for S3 storage can significantly streamline asset management while maintaining an effective caching strategy that reduces server load.
00:17:51.560
The key takeaway is that performance relies heavily on measurement. You need to measure consistently and implement effective HTTP caching strategies to optimize performance, regardless of your application's architecture.
00:18:30.520
Any questions? Please feel free to ask.
00:18:38.760
We're often testing various services and combining them into a comprehensive view of performance.
00:18:54.920
Typically, after testing, we analyze from a single point of view for performance metrics. If we see abnormal spikes, we will drill down into detailed metrics.
00:19:07.480
Sometimes we rely on tools like New Relic for deeper performance diagnostics to help isolate specific requests.
00:19:28.080
We ensure our performance impact tracking is linked back to recent changes in our application.
00:19:38.520
Testing isn't always straightforward, especially concerning complex applications interacting with multiple services.
00:19:55.160
Therefore, we take measures to minimize performance impact during testing by using pre-production systems.
00:20:15.760
Our database strategy involves sharding by type rather than spreading our data over various instances.
00:20:37.120
Though there are complexities, relying on a single database for reads and writes can enable us to maximize performance.
00:20:54.240
Thank you very much for your attention, and feel free to reach out with additional questions or comments.