Talks
Facepalm to Foolproof: Avoiding Common Production Pitfalls

Facepalm to Foolproof: Avoiding Common Production Pitfalls

by Jon McCartie

In the video "Facepalm to Foolproof: Avoiding Common Production Pitfalls," presenter Jon McCartie shares insights from his experience at Heroku to help Rails developers navigate common issues they face when deploying applications in production.

The talk focuses on several recurrent "facepalm" moments that can arise when transitioning from development to production environments. McCartie uses a fictional startup, Airless B&B, as a playful example to illustrate the various pitfalls developers might encounter. Throughout the presentation, he emphasizes the importance of understanding server configurations, application performance, and deployment practices.

Key Points Discussed:

  • Choosing the Right Web Server: McCartie highlights the common mistake of using Webrick in production, recommending Puma for better concurrency and performance.
  • Memory Management: He explains the significance of monitoring memory use, particularly with too many workers leading to memory leaks and swapping issues, which severely impact performance.
  • SSL and Security: McCartie stresses the need for secure connections and how to effectively set up SSL in Rails applications.
  • Using the Asset Pipeline: He discusses the asset pipeline's role in managing file versions and the necessity of using Rails' built-in helpers to avoid common errors related to asset caching.
  • Proper Logging Practices: Highlighting the 12-factor application principles, he advocates for logging to standard output rather than local log files for better insight and management.
  • Deployment Practices: McCartie recommends using tools like Mina for fast and efficient deployments while avoiding issues associated with manual restart scripts.
  • Secrets Management: He explains how to securely manage credentials using environment variables or the secrets.yml file to prevent exposing sensitive data.
  • Database Indexing: The importance of proper database indexing is discussed, noting how it can drastically improve query performance.
  • N+1 Query Problems: He warns about performance issues caused by N+1 queries and offers solutions within Rails to mitigate this.
  • Background Jobs: McCartie emphasizes the use of background jobs for tasks that can cause delays in user experiences, such as sending SMS messages or processing uploads.

Conclusion:

The overarching message of McCartie's presentation is that debugging and optimizing Rails applications in production requires continuous learning and adapting to challenges. Developers should not be discouraged by failures but instead use them as learning experiences to architect better applications. By employing the tips shared in this talk, developers can enhance their app performance and user satisfaction as they ship their products.

The presentation was delivered at RailsConf 2016 and is a valuable resource for both novice and experienced Rails developers alike.

00:00:09.830 All right, we were going to get started. I have the unenviable job of speaking last today, which means you’re tired, you want to go to sleep, or you haven’t had enough cake, or all your coffee is crashing. I’m nervous, so let’s do this. Turn to the person next to you and say, "I promise you I will stay awake for this entire talk." I'll wait.
00:00:20.100 Okay, that’s a lot more conversation than I was expecting! Now turn to your second choice, the person you didn’t want to talk to, and say, "Did you know the guy on stage is homeless?" It’s not true, I don't know why you could say that about me, but we’ll get to that in a second. Hi, my name is Jon McCartie, and I work at Heroku. I help people fix things there, and I am not homeless. What kind of house do I live in? This is my house.
00:01:10.220 All right, the link is bit.ly/vitdly/railsconf16th, or something like that. Go ahead, thank you! So, where was I? Oh, the homeless joke... No, that’s my house. There’s a reason I’m telling you this. Last year, about a year and a bit ago, I sold my house and moved into a 180-square-foot trailer. I’ve been living on the road for about a year now. This sounds kind of scary to a lot of people, especially when you throw in three kids and a wife, and that’s when everyone goes, "Oh!" So yes, that’s my family.
00:01:38.510 It’s really cool; we get to travel around. I work remotely for Heroku. We’re about fifty percent remote, and sixty percent of engineering is remote. Fifty percent of the entire company gets to go to some really cool places. That’s up in Washington State last summer. This is my home office. There’s my desk chair and my solar panels – it’s my roof.
00:02:05.120 I tell you all this because I love traveling, which sets us up for what we're going to see here. I’m so tempted to hit play! Do I dare? Oh wow, okay, I don’t know what slide is coming next because I’m mirrored, but we’ll work with it. I’m telling you this because I love to travel, and in light of today’s talk, we will cover some common issues we see at Heroku support.
00:02:30.260 All of these issues we are going to discuss are facepalm moments. You know, those moments when you think, "My app is so slow, what's going on? It’s probably Heroku's fault..." But I can’t tell you who made these mistakes. Some of them were mistakes I made early on while writing Rails, but many of these are common customer issues that come up. They write in saying things aren’t going right, and as you start to poke around and ask some questions, a lot of times it’s a really simple thing.
00:03:07.020 So I can’t tell you who they are, so we’re going to tell this entire story using a made-up company. I like to travel, so we’re going with a travel theme. I think Airbnb is a fantastic site, and I use it all the time, but they haven’t really looked beyond the horizon – literally, into space for rentals. So we’re going to take advantage of that, and today I’m launching Airless B&B, my new startup! If you’re a VC, please talk to me afterwards. If you’re not, you’re probably not a very good VC.
00:03:48.820 We’re launching now, thank you very much! Things are going okay. In the last ten seconds, we launched, and just poking around to see how things are going. It looks like my Rails app is capable of about 40 requests per second. I start to poke around, wondering what's going on. If we boot up the server, this is what’s going on – and if anyone knows what I’m talking about with the previous Rails versions, they're probably shocked.
00:05:01.240 There’s a shocking number of people running Webrick in production. Hopefully, that’s not you! If you don’t know what web server you’re running, take note: if it’s Webrick, you need to change it. That's a facepalm moment. Let’s talk about Puma real quick. Puma is the web server that we recommend at Heroku. We used to recommend Unicorn; it’s great for a long time, but it’s susceptible to slow request attacks, so we kind of stay away from that.
00:05:41.490 There are a lot of great things that Puma will do. You can look up the benchmarks on their website; they’re fantastic. We’ll talk a little bit about the kind of ping that Puma gets, which is that it’s threaded. If you’re using MRI, which doesn’t really handle threads, what's the point? In reality, most Rails apps are I/O-bound, so while your app is going out to the database, it can take advantage of threads even in MRI. Puma is fantastic and really easy to get started with.
00:06:37.350 Just throw it in your Gemfile, boot it up, and it’s simple. This is just a Home Controller, and I’m not doing any kind of context switching. Immediately, I get double the throughput. If you’re wondering why you shouldn’t use Webrick, it’s mainly because it has zero concurrency. Technically, it does, but in Rails, it has no concurrency.
00:07:06.720 What that means is it can only handle one request at a time, which is terrible because while one request is being processed, if another one comes in, it has to wait, and it’s blocking. With Unicorn or Puma, you are able to fire up multiple workers and handle multiple requests at a time. That is a fantastic thing.
00:07:53.020 Going back to the general issues we face, almost every single one starts with, "My app is running really slow." This is one of the hardest places to start helping someone debug what's going on because you have to start digging around to figure out the reason. In this case, I started poking around and saw a Splunk graph. The red shows how much memory has been used, and the blue shows something I don’t want to see, which is my app using swap.
00:08:37.840 Swap. The first time I learned about it, I got a Linux book about this thick, and I won’t cover that now, but let me give you a quick explanation. If you don’t know, your operating system has a certain amount of RAM. After that, once it fills up, it starts saving stuff to disk. Your disk is much slower than memory, so when your app starts going to swap, you will experience a big performance hit.
00:09:19.940 There are a couple of things we usually see. The first one is memory leaks, which are nasty and take a longer time to talk about. I’ll give you two links to check out later. The first is a library by Richard Stevins at Heroku – it's a fantastic little tool to run against your app to see if you have a memory leak. If you are concerned about that, check it out. It also has a great blog post on how to debug memory leaks.
00:10:21.399 The second thing, which hopefully number one doesn’t apply to you, is the more common issue we see: people running too many workers. They have Puma and think, "How many workers should I run?" Usually, they say five because it's a cool number. But they don’t think about what that number actually means, especially when it applies to memory.
00:11:01.880 A typical Rails app runs out of the box around 180 to 200 MB. So if you’ve got a gig of RAM, let’s say a gig is 1024 MB. You say, "I’ll run five workers" – the problem now is that you're right up against that limit. If your app needs to allocate more objects (like if you're doing any kind of image processing), memory use balloons, and you start swapping. The better way to look at it is to cut back on your web processes, see how your memory looks, and adjust from there.
00:11:58.480 When you start going into swap, you're going to have a bad day. Checking this out is easy; just go to your server and type "free -m" and you’ll get your total amount of memory. If you have eight gigs, your operating system will allocate memory, and if you see that number going down, you need to be concerned.
00:12:39.680 Moving on, I’ve got people wanting to pay me, which is awesome! I found a great place on Mars – one bedroom. The problem is we have a bad guy who discovered an unprotected part of my website and has taken the money, and now things are on fire. This happens more commonly than you think.
00:13:09.440 We put up an SSL certificate, and everybody sees the little 's' at the end. Why are people going here? Because they didn’t know otherwise, and they’re going to the checkout page, which is completely unprotected. It’s really easy to fix this in Rails. We’ll talk about how to set up Rails and Nginx.
00:13:43.500 The first one is to go into production and set Force SSL to true. This will take any request that comes in on port 80 and immediately redirect the entire thing to HTTPS. However, doing this in your app requires going into your app and adds lots of overhead for just a redirect. So you can do this higher up – in Nginx, listen on port 80, and when someone comes onto port 80, send them to port 443. You can do this at a higher level.
00:14:44.260 We need to move on; I have a few more topics to cover, and I’ve got 25 minutes left. We are going to change our logo – the rocket ship didn’t test well; we’re going to change it to a comet. It’s super easy to do, so we change it, we push it, and now our logo is gone – nobody knows where it went.
00:15:01.260 It's an image, so it should be IMG tag. As a developer and designer, I might have done that. But why did this happen? Our friend the asset pipeline gets a bad rap; it’s unfortunate because I love the asset pipeline. If you’ve ever had to deal with asset caching – like before the asset pipeline – it was so much worse.
00:15:41.620 So this image, my logo here, turns into this MD5-digested filename, which allows for easier cache busting. Basically, the asset pipeline looks inside your image, and if it’s changed, it will generate a new filename, which is how we do cache busting. But the problem is, someone did IMG and directly linked to the logo instead. It’s an easy fix – just use the image tag helper. This isn’t as common as the CSS version, but you should make sure to use the asset pipeline wherever possible.
00:16:35.430 When you've got your CSS, stop using URLs to link to assets, because this will not use the asset pipeline at all. Very few people seem to notice this, but Rails provides CSS helpers. For example, image URL will handle it all for you. If you’ve got a fonts folder, it handles font URL assets. There’s also a base asset URL. Make sure that you’re using these wherever you can. Teach the rescue team to always use these. The rescue team needs designers to use these helpers.
00:17:39.170 The pre-compile list is important too. In recent versions of Rails, it will actually warn you if the necessary items aren’t in the pre-compiled list. If you try to render another JS or CSS file not in this list, you’ll run into issues. By default, Rails will render application.js and application.css. If you want another, ensure you add it to the list so it gets precompiled. This is another pitfall we see: it works locally, but when pushed to production, it hasn’t been precompiled, resulting in errors.
00:18:16.640 Now, we’ve got a bug. Some ask me if it happens a lot, and I say, "I don’t know; it happened once or twice." Let’s check the logs. When we check the production log, the problem is we have multiple application servers. Now I have multiple production logs, and on top of that, they get really large. If you've never read about 12-factor apps, I highly encourage it.
00:19:07.840 Let’s take a look at what it says about logging in a 12-factor app: a 12-factor app never concerns itself with routing or storage of its output stream. It should not attempt to manage log files. Instead, each running process writes its event stream unbuffered to standard output. This means your app should not be sending stuff to production log. I’ve seen many apps – usually not on Heroku – that just decide to dump everything into production.log.
00:19:58.610 So, log to standard output! If you log to standard output, now you’re doing what a 12-factor app should do. Sending everything to standard output will allow you to consume it in a much better way – you can use tools like Hadoop, Hive, Splunk; or you can use an add-on like Logentries or Papertrail to capture those logs for better insight into your app.
00:20:42.420 Alright, moving on to shipping. Ship, ship, ship! Everybody wants to ship. This is our deployment script. Go ahead and find it. This is an actual restart script someone gave me when I took over a friend's project. The former engineer said this is our restart script. What do you think happens when a user comes in, and we are sleeping? It shows them an Nginx 503 error.
00:21:04.270 The argument is that we shouldn’t do restarts because you could have two different versions of your app running based on the Unicorn restart. That’s a whole other conversation, but this is bad. What we need is an actual legitimate way to deploy our app without having to SSH into a server and running this manually.
00:21:52.270 You may have heard of Capistrano, and I hope you’ve heard of Mina. If you have not, I highly recommend Mina; it’s my favorite way to deploy a Rails app. I started deploying with Capistrano and found Mina. It’s really similar but works faster. Mina does the same thing where it does a shallow clone out of Git, makes a new version folder, symlinks, bundles, and runs migrations for you.
00:22:55.460 What’s even cooler is that, unlike Capistrano, which fires every command in its own SSH tunnel, Mina looks at the entire thing, bundles it up, opens one SSH connection to all your servers, and runs the whole thing at once on the server, making it much faster. This is what it looks like when you deploy. Oddly, it’s similar to something else I’m about to show you.
00:23:47.550 It will clone the Git repo and do all of the business, then symlink and restart the server after. Another way to do this is via Heroku. You create your app, execute a Git push to Heroku master. Whether it’s this or using Mina, you need a good way to deploy if you want to ship frequently.
00:24:46.270 Teams that are usually unable to deploy quickly are those that struggle with these deployment practices. They think they’ll get to them at some point and they never do. They get stuck because everyone’s busy doing something else and by the time they need it, it’s too late to deploy quickly. So, have a way to deploy quickly so you can ship.
00:25:41.840 Next problem: someone got into all of our S3 images and swapped them out. Someone nefarious has taken my beautiful cosmos pictures and replaced them with something less appealing. How could this happen? A while ago, someone decided to put S3 credentials into the Git repo. Something happened – someone forked a part of the app, maybe someone took the app and put it on GitHub and deleted it, but it's still there.
00:26:20.360 A couple of years ago, you could actually go through all of GitHub for S3 credentials. People have S3 credentials all over the place. Don't do it! There are much better ways of managing this. Inside Rails now, there’s something called secrets.yml. This version works pretty well; I have another way, but if you like this version, go this route.
00:27:04.200 Throw your keys inside of secrets.yml and make sure to add it to your .gitignore, then use it inside Rails like this. Personally, I much prefer environment variables because all your keys should be in environment variables. You don't commit them; they are different based on the environment you're running. It's easy to do in development with a gem called dotenv. You make a .env file, add that to .gitignore, and it has your key values.
00:27:38.620 On Heroku, it's easy to run 'heroku config:add' with your key value. Then in your app, just call ENV and the variable name as needed. This is the proper way to store credentials. Please do this! It’s tempting, especially when you're rapidly developing an app, just to say, "I’ll throw them in here and deal with it later." Start off on the right foot by using environment variables from the beginning.
00:28:45.930 I’m going to hurry up because I want to do questions. Here’s my database graph, or this is my app graph in New Relic. The orange chunk right there is my database. This does not look good. What on earth could be causing this? When we were a smaller app, looking up users by their username took 2.3 milliseconds.
00:29:09.520 But after twenty minutes of being up, with 1000 users, that call now takes 51 milliseconds! Who would like to tell me what’s wrong? Indexing! That little thing is now this big problem, and it’s only going to get worse because we don’t have an index on the username. Database indexes basically tell your database, "This column is something that’s important to me; I want you to index it because I am going to make calls against this column frequently." It’s so easy to do in Rails.
00:30:07.060 You just create a migration, add the index to the table, and all of a sudden, this 51 millisecond call goes down to one millisecond. Asterisk here: composite indexes – I didn’t know about this either. If I want to look up by username and something else, suddenly that original index no longer applies, and I need to create a composite index. This is also easy to do in Rails: pass an array of column names, and you’ll have a composite index.
00:30:36.620 Alright, two more problems to blow through. We are so close to a billion-dollar valuation – it’s going to be great! My first problem: the app is a little slow, and I’ve got this controller iterating over photos. In my view, I am rendering the photo and then the user's name. What does my log look like?
00:30:55.310 I’m getting all the images, and then I have all of these user loads! What’s going on? It’s an N+1 problem. You can easily overlook this, especially while writing an app. Your development environment isn’t under the load that your production environment will face, so what works locally gets thrashed by the number of users.
00:31:38.390 Basically, an N+1 happens when you call one object with a relationship and then start to iterate over this, calling the related associated object. It’s really easy to fix in Rails. Instead of going for 'photos.all' or 'photo.paginate', use 'includes'. This might drop your performance issues drastically.
00:32:30.480 For development, use a gem called Bullet. This will show you when you have N+1 issues, indicating in your logs and prompting you to add the necessary includes. If you’ve got your photos, Bullet will tell you there’s an association called users – just add this to your finder and include 'users', and life will be magical!
00:33:06.040 Last one: users complain that certain actions, like adding photos, are slow. You’ll find evidence of cases where milliseconds affect users' happiness. People will leave websites if things take too long, so this is pretty important.
00:34:04.640 When I look into New Relic, I see that my app is snappy except for one thing: the 'photos_controller' takes almost five seconds. I look into it and see that I'm spending 66.1% of my time dealing with S3. The request hits the controller, then creates this photo, then it goes off to S3 and waits.
00:35:12.470 The user is now waiting for something that doesn’t really matter to them yet. They want to upload a photo and carry on. We’ll talk about background jobs. Here’s another use case: sending an SMS. The user comes in, and we need to wait for the SMS provider before returning a response.
00:36:05.080 This is a great example of when to leverage a background worker. A better example is through mailers – if you look at the Rails documentation, you’ll find examples using mailers. But replacing SMS with a background worker allows the user to continue using the application without delay.
00:36:54.450 You can replace 'user.send_sms' in the controller with 'perform_later'. Active Job can use Sidekiq or any other job queues. This allows users to get back to what they are doing while the SMS job takes care of processing.
00:37:43.050 This is what a basic job looks like in Rails – it inherits from ActiveJob::Base and you just need to create a perform method to execute the task. You can do tasks like uploading the photo, sending SMS, or whatever it is. This is small, bite-sized code that can save a ton of time, and what's even better on Heroku, you don't need as many app servers because they aren’t blocked waiting on resources.
00:38:35.020 All this allows you to utilize fewer servers, which is great for the bottom line. That’s it! We sold our app and made a billion dollars off a terrible idea. But mostly, it was because we fixed all the problems we've talked about.
00:39:52.530 We’re now pivoting toward building a Tinder app for cats – it’s going to be good. I don’t want you to steal that idea!
00:40:28.370 So, the bottom line here is: if you are new or have been around for a long time, when you’re running a Rails app in production, you’re going to break things. The important part is when you do fail or something breaks, learn quickly and keep learning.
00:40:45.160 There are many things changing in Rails, and so much changing regarding the state of the servers. Don’t get discouraged – keep learning and you’re going to be great. Thank you very much!