Eric Lindvall

Automation in Deployment on Hybrid Hosting and Private Cloud Environments

By Fletcher Nicol and Panel

In a world of public and private clouds, API-driven load balancers, and bare metal servers there has never been more choice when building your next scalable killer application. As the complexity of your application's deployment environment increases, the economics of automation start to pay off. In this session we'll discuss the challenges facing complex application deployments, strategies to make development environments mirror production, and how you can manage architectural changes with your application over time. Automate all the things? Let's find out!

Help us caption & translate this video!

http://amara.org/v/FGb7/

RailsConf 2013

00:00:12.259 thank you so hey
00:00:17.820 um we're going to be talking about deployment and Automation and honestly like lots of buzzwords in this title um
00:00:25.619 I am all the buzzwords all the buzzwords yeah we did that so I'm Fletcher nickel
00:00:31.679 I'm a software engineer at blue box out of Seattle um so while my name's on here on this
00:00:39.600 talk I'm hoping to actually not do a whole lot of talking um because there's an amazing panel down
00:00:45.059 here that can talk much more at length about these things than I can so I thought what I'd do is
00:00:51.180 um just kind of go down the row and have um each panelists introduce themselves
00:00:56.340 because they can do that far better than I can I think great thanks I'm Matthew Coker I'm a
00:01:02.520 developer at pivotal Labs I've been working on cloud Foundry now you work at pivotal now I I work at pivotal Labs
00:01:09.600 we're a division of pivotal now it's it's very confusing frankly but just
00:01:15.119 just call it pivotal and we're fine um we are now carrying forward the development of cloud Foundry which is an
00:01:21.360 open source platform as a service and I'm leading the engineering effort there so we're we're actually putting a huge
00:01:27.360 amount of engineering resources behind it to make it uh the deployment platform
00:01:32.460 that anyone uses for deploying apps uh my name is Doc Nick I used to be a VP
00:01:38.579 of engineering on technology at engine yard now doing consultancy around Cloud Foundry so I am in charge of harassing
00:01:44.700 Matt pretty much it's an awesome job and uh um have a lot of fun not nearly when
00:01:50.460 we're actually harassing each other because I lose every time
00:01:55.680 so um really looking forward to winning not actually true I look forward to winning a couple points today in front
00:02:00.720 of the audience so we keep track of it uh yeah so uh but there was no end to that I'll just keep talking so anyway
00:02:07.079 yesterday right now uh my name is James Casey I'm a development lead at Ops Code where I
00:02:13.319 work on Chef mostly on the backend servers before that I worked Bank simple
00:02:18.840 here in Portland deploying an online bank um with Ruby jruby Scala Java into
00:02:25.739 ec2 my name is Andy camp and I'm a development manager at King of the web
00:02:31.520 which is uh that is an awesome name that's like one bit of the webmaster
00:02:38.160 uh which is the startups underground online video where we give a cash prize whoever gets the most votes but uh
00:02:43.620 centered around independent Web video creators before that I was at hark.com which is a site Center on audio and I
00:02:51.959 started at harc with Hardware on rented rack space and I've deployed on Raw ec2
00:02:59.060 engine yard blue box and personal sites on dreamhost so a variety of different
00:03:05.340 deployment strategies my name is Eric lindal I co-founder of
00:03:10.500 paper trail we help you see what's going on in your server and application logs
00:03:15.959 and help do something about it uh I don't have anything else to say
00:03:24.180 so I think what we're trying to do is get sort of a mix of um
00:03:29.760 people involved with businesses that were deploying their like their business as well as some more like platform and
00:03:36.060 Tool makers and that kind of thing and sort of see um where things are good where things are
00:03:41.159 not so good um the the you know best case scenario is
00:03:46.200 everybody's business gets bigger and better um but then you know all these
00:03:51.239 complications yeah let's talk about City start let's talk about automating failing businesses
00:03:56.540 failing business they want to admit to well let's talk about that so I guess I'm just curious to see sort
00:04:02.700 of what everybody's um uh I guess into these days for lack of a better word
00:04:08.640 um around sort of um solving this problem at least for themselves so um I guess speaking for myself from Blue
00:04:15.060 Box um as a hosting provider um we get to see lots of different
00:04:20.699 customers and different ways that they kind of do deployment we're also sort of
00:04:25.919 a hybrid model of virtualized resources and bare metal servers so we have this
00:04:33.060 additional bonus and challenge of uh sometimes helping customers provision
00:04:39.000 their application on what is traditionally resources you could provision over API
00:04:45.240 and then these crazy like servers like real actual servers
00:04:51.500 and that can be kind of a challenge as you grow your business and you get more of these crazy real servers
00:04:59.280 what I would like to do is put apis around that and just treat that Hardware
00:05:04.440 like software so we've been kind of looking into how we make our lives better internally on
00:05:10.680 our platform as well as you know extending that out to customers
00:05:18.120 in a way that you know if you have if you sort of know the needs of your your application
00:05:23.280 um you just sort of want to check boxes or make API calls saying you know I want six of these and four of those what's
00:05:28.620 actually underneath it it would be really nice if that was um uh much more just opaque and uh you know
00:05:34.800 checkbox uh the reality is that's it's a very hard thing to do when you're sort of doing one or the other you can kind
00:05:41.940 of specialize but um that's yeah that's that's me
00:05:47.039 cool I think I'd sort of echo the sentiment that what developers really
00:05:52.919 want is apis around things they're going to have to interact with so I think over the last few years Roku has really
00:05:59.400 proven at least to a lot of people certainly in the Ruby community that uh
00:06:06.000 I Pat a pass can't actually solve these needs of how they're going to deploy their application for the vast majority
00:06:11.580 of them and unfortunately we've also learned that U.S east isn't a great place to host your application
00:06:18.720 why that's where everyone else is it's obviously awesome it's where the cloud is I know
00:06:25.020 and obviously Heroku is doing work around being in multiple availability or multiple regions but I love about their
00:06:31.740 first place like we're not just going to go outside of Virginia we're gonna get the hell out of this continent that's
00:06:37.259 right going to Europe and so I I think that what that what's
00:06:45.060 that shown is that's really the interface that developers want is basically here's my code here's what here's the resources it needs it needs a
00:06:51.840 database it needs a message bus uh it needs you know 10 instances of it that
00:06:57.180 can serve n requests per minute and sort of they want to specify that and just give give that to the system and have it
00:07:04.139 run it and so we uh pivotal came across Cloud Foundry and really saw it as the
00:07:10.740 thing that would allow us to do that for customers that weren't happy with being in Amazon they weren't happy with having
00:07:17.460 a closed store solution to that and so the sort of where Fletcher's working on
00:07:23.460 sort of these lower level apis that are give me a load balancer give me an instance we're working on much higher
00:07:29.280 level apis of here's an application here's the code run it 100 times on
00:07:34.319 different servers and distribute the load equally so I think it's really the same issue but we're sort of attacking
00:07:40.080 different levels of it so I work full-time my customer is a large
00:07:48.000 Enterprise company who is in the music business and uh big
00:07:53.400 interest big investment at the moment in redoing how we do it and and and cloudflare into
00:07:59.220 the large chunk of that but so there's kind of that part of cloud Foundry which is you know why doesn't IG group stand
00:08:04.319 up Cloud Foundry because they don't want to hear from App developers just do it yourself no no so there's part of it is
00:08:09.780 providing the automation as a service um but nonetheless we then put Jenkins on top of that
00:08:15.780 so we try to uh both in a purely automated sensor when CI gets run when
00:08:21.599 you can run CI it'll uh deploy the app into a into a cloud Foundry environment for the purpose of testing then it will
00:08:27.360 run selenium against that with that same mechanism place we can move apps from an environment to environment as they sort
00:08:33.360 of go through uh from developments QA through integration testing staging and you know and production which if
00:08:41.279 you've from the more agile sense or you know modern uh if you've worked in a smaller place where you probably
00:08:46.620 couldn't even be bothered having that many it's like just keep changing production until customers complain larger organizations have done that
00:08:53.640 already and decided no let's not ever change production we'll just keep failing and staging instead
00:08:59.899 so but we have all the automation around that so that's one level of automation the other automation we have to worry about which not that good at but why
00:09:07.019 we're pretty excited about the Bosch stuff that's coming through is the automation of upgrading Cloud Foundry itself it's all very nice to have nice
00:09:14.220 ways to deploy your apps but what about all the you know changing doesn't the infrastructure under the app itself
00:09:20.580 doesn't change as fast as the app but it's still eventually you're going to
00:09:25.860 want to get off MySQL 5-0 hopefully um ruby186 the gold standard and these
00:09:33.000 sorts of things you know so you still want to be able to do staging production of entire environment changes
00:09:38.040 um so yeah the the Bosch stuff that the cloud Foundry group came up with is certainly something we're interested in
00:09:44.760 I mean I think the first thing for me is the configuration management person you know is to say
00:09:51.480 that you all should use configuration manager whatever whatever your problem is
00:09:56.519 exactly right and I realize at home children configuration management
00:10:03.120 and it doesn't really matter what you do but you know like what we're talking about here is really agility we're
00:10:08.279 talking about business agility we're talking about the situation where your company is growing fast and what you want to be able to do is react quickly
00:10:13.980 to that um and some of those reactions you have to do might be very small might be an app upgrade it might be as
00:10:19.680 Nick said you know I need to get off my sequel I need to change out I need to use a completely different infrastructure underneath and you know
00:10:27.300 the the real key configuration management and I guess how we always put its infrastructure as code we're all
00:10:32.940 developers we're all Engineers if we treat our infrastructure it's sort of a level past we want apis we want our
00:10:38.640 infrastructure to be code um which is really cool because then you can test it in one place and rerun that same
00:10:45.480 infrastructure it's like imagine just you know anyway so like yeah I mean how do I how do I construct you know for me
00:10:52.500 one of the really interesting things about Clyde about AWS was having you know networking as code and you know
00:10:58.620 that's a that's a big shift for a system administrator to say AWS networking is code very slow code very slow code but
00:11:05.640 the fact that you can provision a second network interface as code without touching Hardware it means that comes into the Realms of developers so imagine
00:11:12.360 putting all of your infrastructure into that realm where you can change it and you can spin it up you can build your
00:11:17.519 Jenkins Cloud you can spin it up you can test it and you can see what happens and it means you can start to experiment
00:11:22.560 with your infrastructure just the way you experiment with your code uh yeah and it's not this you can bring
00:11:28.320 it back to the realm of developers right as opposed to these magical people with beards sometimes they become developers as well
00:11:38.940 um so I I from like a customer standpoint for someone using it one thing that has worked really well for us
00:11:44.100 at King of the web we had the challenge of we're a contest that gives away cash prizes and from the beginning we wanted
00:11:49.380 to have a public vote count um that people could see and we we'd know the winner the minute it was over
00:11:55.320 um and we're giving away the prize for what people voted on for the best content and then we found that content
00:12:00.540 didn't actually win but personalities did um and these personalities are people on YouTube who have enormous Social Power
00:12:07.079 and they'd make a video they'd do a tweet and they deliver 10 to 50 times their Baseline traffic in about 15
00:12:12.959 minutes um so what is that cheating uh no and we
00:12:18.899 would also detect cheating people would try and create multiple accounts game or voting system and we'd do that as well
00:12:24.899 um and so we plan for scaling from the beginning but we now had this new challenge of people would run and
00:12:30.420 they're in Singapore and they do this at two in the morning and we need to bring up 20 servers um in about 15 minutes or as quick as
00:12:36.959 possible and that actually was a process that went very well for us using new relics
00:12:42.959 API to detect traffic using blue boxes API to bring up servers and using Chef to then configure those servers once
00:12:48.959 they're brought up you can image your existing servers and scale whatever service you need relatively quickly and
00:12:56.579 I think that was a service we wrote a custom rails app that did this through rake tasks because that's what we knew
00:13:01.800 how to write um and and that was something that that works very well I think in today's Cloud
00:13:07.440 world and again with these you know what they've said before what was a challenge for us was the configuration changes and
00:13:15.120 managing the changes and testing that well so bringing up something that's static that you already know that works
00:13:20.579 you can copy what you have making a change to your nginx config or to your rescue workers or to where they're going
00:13:28.560 to run or how many are running and testing that successfully in an environment that replicates production
00:13:33.680 has been a challenge so and I think that's really the next step that I'm looking for I love what's coming out
00:13:40.200 like in the chef world for like librarian Chef or there's a new one that I haven't used in Berkshire
00:13:46.440 um so Burke shelf is like I love that um idea and what that's going to do for
00:13:52.200 managing our cookbooks and um so anyway there's some great progress being made so something that I wanted to actually
00:13:58.320 come back to about your comment about an API for deploying physical services that I wanted to kind of drill into what is
00:14:04.019 it that makes people say they want that and I think that there's a couple things one of them is the visibility in the
00:14:09.899 amount of time that it will take before you can use that thing and then the other is the repeatability of the process and I think that
00:14:17.160 that is the fundamental nugget that is what makes people say I want an API and
00:14:23.339 so it doesn't even necessarily have to be a programmable API per se for people to get a lot of value out of it as much
00:14:29.459 as just something that they can be able to predict and on know that I have this much time
00:14:36.300 before I will be able to use this resource and so that's been to date the thing that people like so much about the
00:14:42.660 virtual servers is that in general unless you have USC's getting overloaded uh that you
00:14:50.279 know you have that capacity that you can assume will be there and that really is
00:14:56.040 missing right now in the world of deploying physical service that's that's actually a good point and
00:15:01.500 that's something that um as we're doing this um now that you have that consistency
00:15:06.540 and the the amount of kind of steps that you need to go through is so much lower you can start looking at yeah what's the
00:15:12.779 turnaround time um what's sort of like what's the deviation is it consistent is it not consistent where are the bottlenecks
00:15:19.620 um because you're going for like yeah shortest shortest spin-up time that's consistent as you possibly can so yeah I
00:15:26.220 agree even if it's not necessarily programmatic it's more like just turn around time for delivery that
00:15:32.820 um for me right now I care about because if even if I'm deploying my own uh gear
00:15:37.860 I want that to be um deterministic as much as possible well
00:15:43.199 and I think that an API without those deterministic aspects of it isn't
00:15:49.920 necessarily going to solve the underlying problems that people hope it will yeah has anyone use the Amazon API
00:15:58.740 a lot I mean not just not just I've ever done it but like automate it as it's not extremely deterministic
00:16:05.880 it lacks determinism unfortunately so I guess are there any strategies for
00:16:11.279 kind of mitigating that because we yeah we do live in the cloud but I mean it's it's so by the way if you don't
00:16:17.160 understand what there's a thing where you they have multiple endpoints for the API and it's an abstract way you don't
00:16:22.860 know which one you're getting so you'll say can I have a server and it'll say is it done yet is it done yet is it because
00:16:28.260 callbacks are not heard of apparently to Amazon so anyway eventually you you'd
00:16:33.360 say you know you hit an endpoint as soon it says yes it's done you go that's awesome high five Amazon and you say
00:16:39.120 awesome can you please bind it to that IP address and it says what server never heard of it and um
00:16:45.360 sorry they abstracted away the ability to sort of say to three different endpoints do you all agree
00:16:51.360 as a team that that server is there high five everyone it doesn't exist so yeah so and
00:16:57.839 that's that you know you might be able to deal with that if you're doing stuff manually but when you start automating the creation of of stuff on Amazon you
00:17:04.260 start your build starts breaking just because you know like that happens so the the determinism the ability to do
00:17:11.339 what's great once you start doing it the inconsistency starts to want to poke some of the eye so here we are in
00:17:16.380 Portland where they have a Data Center and I say we'll go down and visit our local Amazon personnel
00:17:21.540 some of them are here come on up
00:17:27.360 so I guess on that note um wearing sort of the the sort of business application developer sort of
00:17:33.660 hats what do we do um knowing that like you know the Cloud's great and we have 100 success
00:17:40.140 every time not really automatically all the things yeah we automate all the things but then what happens when things like absolutely go wrong
00:17:47.640 because they never go 100 right well I mean I don't know I guess the metaphor is these modern airline pilots you know
00:17:53.400 they've automated the entire cockpit except for when goes bad so you
00:17:58.559 know if you want to know where your profession goes as a sister admin all you get is phone calls you don't go to work ever
00:18:04.200 um you just starts bitching at you and every now and again you go I don't know if the automation doesn't know I don't know I'm going back to watching TV and
00:18:11.820 uh you crash no so um you know I mean the aim is just more Automation and the more you automate the
00:18:18.480 more you hit the edge cases eventually and you've got to really nail those because they will turn up in production so
00:18:24.360 so and for me what I found is um without some kind of visibility into what's
00:18:30.840 going on um whether that's just straight up like logging even if it's not like accessible
00:18:36.660 anywhere or some kind of metrics or graphing like without any visibility into that you're effectively Flying
00:18:43.080 Blind and unless you're ready to like manually cope with that
00:18:49.260 for us it can be friends with the instrument things will tell you eventually
00:18:56.820 well yeah I guess I'm kind of curious um if anybody has anything to add around like what do you do to kind of handle
00:19:03.660 that case that um you know this this Cloud thing whatever that is or the application
00:19:08.940 deployment even like it's never perfect 100
00:19:15.000 well if I for us it's never perfect 100 so we definitely take strategies so if we know we need to scale up two servers
00:19:21.780 we'd scale up six and then half an hour later if we got all six up then we'll
00:19:27.059 bring the extra ones down so um because we know that there's going to be times when you bring up servers or
00:19:33.539 when you have bugs in your script that are adding it to monitoring it doesn't get monitored correctly when you know
00:19:39.120 something's going to go Haywire so you'd build some redundancy in there so we've
00:19:45.299 definitely taken that strategy um which I think is applicable to uh to all
00:19:51.960 the hosting providers I mean I think there's a certain amount of of the what comes through is not
00:19:58.500 trusting the automation at a certain level you know like like again taking the airline pile example I mean
00:20:04.140 checklists you know checklists are still a great thing you have to automate everything but you you can't just hit
00:20:09.179 the button and walk away you know you still need to know and you know I think this is often a good use for for some
00:20:16.200 sort of lightweight smoke test monitoring so you know you want to you don't want to smoke in a plane exactly
00:20:21.440 no smoke tests so you know how can you how can you deploy and then five minutes later you know sample your
00:20:27.539 infrastructure is it working then roll when you want in so um I mean this was something we find very useful with Chef when we wrote our new version of Chef we
00:20:34.020 package our smoke tests our full test Suite with the product so you know we already use this internally a lot so we deploy into our
00:20:40.679 hosted environment and somebody's going to run a fraction or all of the test week before those nodes are really
00:20:45.720 pulled into production so you know you're you're you're automating but still you're doing that sort of human
00:20:50.880 control um human has to be in the loop so that you can make a sensible decision
00:20:55.980 but you also have a way to kind of help script or automate that like acceptance right you're running six
00:21:02.340 thousand tasks you're not doing it by hand the Natural Evolution of everything we're doing is is like I said with
00:21:08.460 planes it's all going to get too complex for human beings I mean think of every system you've ever
00:21:13.620 known and what do we you know I trust airline pilot so I I just think it's our
00:21:21.000 responsibility to to to automate all and you put your back to the Smoke tests you automate the testing of it
00:21:27.660 it's not you didn't add a human look here here's manual Chef that's not a
00:21:32.760 thing you sell um you don't sell manual safety I don't think so um you know I think it's a constant
00:21:38.400 responsibility and so the more tools you can use in your tool chain that a part of everyone else's Community the more
00:21:43.860 you can trust it and live with it the benefit of you know some of the things we're talking about are like Chef Etc
00:21:49.860 um you know is is you know it's just your last bit but you I I mean you've
00:21:55.080 got to assume that it's going to get out of control we're going to be building bigger and more complex systems and your
00:22:01.020 ability to manage in your head I mean this is one of I think last Chef conf Adam Jacobs is on stage talking about
00:22:06.480 um the idea of you know the dangerous part of the system is like when you've got five machines you can kind of give
00:22:11.880 them names all right you know and your nerd so it's probably going to be a Star Trek theme
00:22:17.220 Smurfs something like that and uh and if you've got 5 000 nodes then you know
00:22:22.919 you've got to automate that right but they said the dangerous Parts when you've got 500 and you kind of think you
00:22:28.200 can do it um I want to tell you this this whatever job you've got is going to get out of
00:22:33.419 control you can have lots of machines lots of apps lots of services lots of things to manage and you can't
00:22:40.080 so you know the responsibility of not just deployment but running your company running your team is to automate all the
00:22:46.380 things I think there was a quote from this Chef called where somebody asked how many servers do I need before I start
00:22:51.659 automating yes values of in yeah
00:22:58.260 let me start now right you can have bad Automation and hopefully there's a number of things grows not just nodes or
00:23:03.360 service but apps and whatever grows repos I mean you know yeah start
00:23:08.520 practicing your automation well I mean if we're talking about on the small scale the number one thing to do before you start automating is
00:23:14.640 document the hell out of everything that you need to know what it is that you're doing and once you've done that is the
00:23:19.980 great time to actually be able to understand what it is that you're automating so you know that it is very
00:23:25.980 much a process of understanding what is involved and then removing the human
00:23:31.380 from that for the things that you are doing more than once you know that it for all of
00:23:36.900 the automation that I see adding the first IP address to a system is often
00:23:42.600 not something that is managed by Chef or puppet or anything else because well you couldn't be running it on it unless you
00:23:48.600 had that part solved so you know that there are there are levels in the Automation and
00:23:54.299 that as you get more things solved that amount that has to be
00:24:01.020 done by the human is shrunk but you know remember that documentation is
00:24:06.780 the first stage of all that or very sweet and stool script
00:24:12.500 but but isn't that like isn't that a form of documented absolutely no documentation but you can test a
00:24:18.299 bootstrap script I mean that that's how I started lots of procedures and stuff in previous jobs I
00:24:25.140 didn't realize I was doing it was documenting it and you break out into these like little like then you do this
00:24:30.360 and that and then you're like I'm repeating this I'll make that a variable and I'll throw it to the top my Wiki yeah I mean I'll admit a mistake I I
00:24:36.720 read a bunch of documentation around uh Cloud Foundry Bosch last year and uh and I came about to rewriting it and I
00:24:42.840 thought nah nah I'll I'll script it this would be awesome but because the project in my head was convert documentation to
00:24:49.799 scripting it was only after the thing I got completely out of control I thought yeah yeah Tess that would have been awesome
00:24:59.700 so one for Matt so much clever than I am speaking of tests one of the techniques
00:25:06.240 we use is to actually have integration test Suites or acceptance test Suites that run against the production system
00:25:12.120 as well as in test so if you write your selenium test correctly they can drive
00:25:17.400 load to your production system your test your staging system and you can actually
00:25:22.440 verify behavior is still there so you can use your test as monitoring it
00:25:28.380 requires some discipline we have things that actually send email and check it through Gmail to make sure they get
00:25:34.200 through but it lets you know that things are actually working and you sort of you don't really care what the
00:25:40.200 infrastructure is doing maybe you only got two of your six servers but if you're getting the behavior that you want it doesn't really matter yeah and
00:25:47.580 if that's a thing you want there's a company called still alive that do that sort of you know next level Kingdom type
00:25:53.820 service of actually not just checking that the endpoints there but actually going through real life examples just to
00:25:59.220 determine that your app is not just up but functioning um so the the general concept of having a
00:26:06.419 canary or multiple of them that are actually going through and confirming that the important aspects of your app
00:26:13.260 are actually up and doing what they should be doing I think is something that ties very much into these sorts of
00:26:20.700 automated deployment setups when you have a complicated system to do more than just verify that you know some
00:26:29.159 button worked or something else or that you know assuming that you're doing something that isn't just web related that the other parts of the
00:26:35.279 infrastructure involved are actually doing what they need to yeah this is the stuff if you're running like a retail store you might stand in a mystery
00:26:41.880 shopper right and they'd make a purchase and they'd check that they got their stuff and then feel good about it well to want it to send a business Shopper in
00:26:49.200 once is to do it once put it on taskrabbit it's like can you
00:26:54.960 please go to my shelf and buy something and see what happens um you know so there's it's not just our
00:27:00.900 you know the ideas you were talking about here I think are applicable to most systematic Enterprises I once
00:27:06.720 worked on an API for a large National retailer and we had our pingdom check actually checked that there were double
00:27:13.200 a batteries in stock near Richfield Minnesota and at one point we there were no
00:27:19.440 batteries in Richfield Minnesota and it was a problem and what had happened was someone had for the
00:27:26.580 latitude it wasn't that you needed it for failover situation if our system goes down can we get to the shop to buy
00:27:33.179 double a batteries that wasn't it and they had latitude and longitude they'd actually put the area code for the phone
00:27:39.480 number so it was I don't know 512-512 or something which meant that they were
00:27:44.520 nowhere near Richfield Minnesota anymore and
00:27:50.279 so you can you want to test actual real world things and see that your app is actually functioning as you think it is
00:27:57.059 and at that point it doesn't matter what the how it's getting there as long as it's doing the right thing
00:28:03.720 so and is the strategy that you want to be doing this all the time on each deploy
00:28:09.659 once in a while only when something's non-trivial every five minutes
00:28:15.539 which is how expensive the tests are but I mean um so I mean yeah I mean once you have
00:28:20.640 it automated I mean you should be running something constantly just to make sure that let's let's get back to what we're trying to do here we're
00:28:25.919 trying to keep our production systems running right so that we can sell stuff or provide the service that we promised
00:28:31.320 I mean everything else is you know you make choices based on on that goal I
00:28:36.360 guess I mean that's is that the goal anyone else want a different goal so how often do you run your tests well I don't
00:28:41.400 know if it stops people being able to buy stuff because you're just hammering the crap out of it sorry sorry you human cup you know shop at the moment we've
00:28:48.120 got Bots busy you know destroying it um you know I mean you know if you're not making enough money to be able to
00:28:53.820 afford if you just got started yesterday I don't know about booting up a thousand VMS just to check your Chef good
00:29:01.080 um so you know I guess sorry I'm just saying there's riddle choices to make but to always be more automated
00:29:08.700 so this becomes like your continuous integration for production like your I don't know your business CI of sorts
00:29:15.299 yeah I mean you want to just keep getting a little bit better and I mean I used to talk about Bosch a lot I just
00:29:20.399 thought Boston was the beat is everything but nothing is ever and I came to realize that Bosch is on the Continuum of your business getting
00:29:27.480 harder and tougher and you know harder like me to buff I mean like solid yeah
00:29:34.080 like me so um you know when you get started you know you just want to you can do things manually but you just want to write more
00:29:40.260 features and your business gets more solid you continue to write more automation to to do the things so they
00:29:45.899 don't stop work for six months while you automate just because we said so I mean you gotta be awesome in terms of my ego
00:29:50.940 but nonetheless it's invest engineering time that's part of it is there is no
00:29:57.240 other group that's going to do it there's no magical automation team that they're going to hire right it's your responsibility as
00:30:03.600 Engineers to do this stuff this is devops so you've got your name in it and everything
00:30:09.299 if they do hire that separate team you should run for the exits yeah I think that's a really important
00:30:14.880 point I mean the best people to know how to automate to your system is used you know there is no one button solution you
00:30:21.899 can't just load up a bunch of Chef cookbooks or puppet manifest or whatever put them on expect it to run your
00:30:28.020 business because your business will have something unique about it and you guys are the people you know who
00:30:35.039 know what that is because you've built it you're running it already and so beware of people saying this solution
00:30:41.159 will solve all your business needs this will automate your system because it's just not going to happen you know
00:30:47.159 pick things are toolkits pick things that are technologies that you feel happy extending building on top of
00:30:52.799 building um things alongside of and putting it all together for a solution for you
00:31:00.600 so I guess um it's you know a little bit on that on that note I'm just wondering
00:31:05.820 how um knowing that these things kind of these things change over time like the even
00:31:12.539 the low level like architecture like how how do we help manage that that change
00:31:19.500 over time when we're swapping out Services changing providers like
00:31:25.140 changing technology Stacks upgrading databases um
00:31:30.960 how do we how do we help make that somewhat better assuming like everything's automated
00:31:36.899 um you know in a nice perfect world but uh those are things that can also go very
00:31:43.380 bad very fast in a very automated situation like I don't know are there certain
00:31:49.260 situations where you kind of want to like dial that back or where you put sort of the human back in the I mean
00:31:55.440 decision drivers you know I think well I think the key is always things you're
00:32:00.480 not going to do you know are you going to let your automated system do a schema upgrade on your live production database
00:32:06.539 with nobody nearby at some random time probably not because you do it so rarely
00:32:11.820 but bringing in new servers into an auto scaling system did you test the schema migration on a
00:32:19.140 staging environment with the same data and with the same you know layout of bits on the desk and the same indexes I
00:32:25.380 mean like like I think there are so many variables around there that it can go bad so quickly well in the same
00:32:30.480 production load the same production fundamental impact on that you know I you know I'm a I I think that can be
00:32:37.380 done I think one of the engineering uh groups or at least their blog that I
00:32:42.419 read is Etsy um who's been a big proponent of continuous deployment and I love their their Mantra is when we hire a new
00:32:48.840 engineer the first day on the job they're going to commit to production um and every single commit they do goes
00:32:54.360 through CI and every single commit pushes out to production so and they have a staging environment and
00:33:00.720 um you know I think it's a new move I've heard Facebook's VP say as well that um every feature that they're going to put
00:33:06.840 on the site has been on the site for six months so and they have a very robust system that manages which features get
00:33:13.380 rolled out to which users and they do percentages and test things that way and you know really monitor their production
00:33:19.260 stack um and if this idea isn't new I mean it was just the other day and it's a chef boy here
00:33:25.080 um Chef mint was always acceptable boy Chef
00:33:31.620 boy and uh sorry uh
00:33:38.360 I think it does a pretty good high level uh summary sorry that's a discussion of
00:33:44.340 why devops and continuous deployment are how all the winners will be doing it when you look back 20 years from now and
00:33:51.240 look at you know what is everyone doing well the people are still in business the ones that are succeed in their Industries are the ones that we're doing
00:33:56.279 continuous deployment because it's the only way to give what humans instinctively or want now
00:34:02.100 and uh I won't you know there's a lot of answers to that question in terms of in
00:34:07.860 terms of being service by by providers so um I thought it was an excellent uh summary you know if there's if you're
00:34:15.060 not a chef person please go and watch Adam's sort of introduction to the the future of all of our Industries he
00:34:21.419 doesn't mention chef keynote so if they're already making sure right no no I mean he's yeah so
00:34:28.919 something fundamental it's a great it's a great keynote for our industry to this industry of automation of not just
00:34:34.139 automation of the low-level stuff of whatever level you're thinking about but being able to con as an organization
00:34:39.899 having the goal of being able to continuously get to the you know new things to the customer and veteran and uh
00:34:47.280 all the things you want so something fundamental that Etsy does that I believe they got from Flickr is the fact
00:34:53.879 that there's no such thing as a big change and so that's one of the reasons why the automation that they're doing
00:34:59.339 and the continuous deployment works is that they do not have such thing as a feature Branch they don't have any of
00:35:06.000 that sort of stuff everything is all conditional in the code which allows them to make it much smaller that you always
00:35:13.500 know that there is no massive roll everything back moment where you have to change what code is running and
00:35:18.660 essentially because I think at some point someone's going to run an actual programming language that has that concept as a first class feature in that
00:35:25.020 feature flipper kind of sorry my point Sorry is the idea that which when do you get to remove that code right that idea
00:35:31.560 of what you know I wrote it that what's this if statement under what circumstances do we ever for God's sake
00:35:36.780 to remove this it's also the idea of we look at code as
00:35:42.420 such a static thing how many of us love to pick the theme or our editor and we pick an editor and we feel good about
00:35:48.359 and talk about Ruby we're all in love with Ruby God bless the various limited number of
00:35:53.520 people that have come here to talk about running production systems which is the point of writing a rails app right
00:36:00.839 um and so you know perhaps Ruby's not the best language running production system that has that particular feature
00:36:06.900 of permanent you know of all the features are in the code PHP is far better
00:36:18.119 code lighting production features and that come back in your idea I mean honestly I think that you know the the
00:36:24.359 big answer how do you make a big change is that you don't you try to turn every change into as small of a change as you
00:36:30.420 can doing any sort of forklift upgrade be it you know moving from one data store to another changing technology
00:36:36.900 Stacks that if you can do that in a way that you are running the old stack and
00:36:42.000 the new stack side by side and making that change just be changing some configuration setting and even making
00:36:49.260 that more granular that you can just start to move traffic over and not do everything that those are the sorts of
00:36:55.079 things that help to make that not catastrophically bad yep
00:37:00.900 it's hard because it's not necessarily in the code you're not looking at this problem the code it's like well should I make a schema change
00:37:07.619 you have no idea what that's going to do in production at the time you do it so so I guess at this point there's like
00:37:14.339 there's a few minutes left um I'm curious if there's any yeah questions and I apologize actually for
00:37:20.160 living this so late
00:37:30.119 how long does that scale take like and then you know like how do you
00:37:35.220 determine like if we need the resources right away how fast can we get those resources you know before like what sort
00:37:41.160 of metrics are you looking at for eign so the the question was I guess
00:37:47.880 about Auto scaling and um sort of the I guess success criteria that the metrics which you're sort of looking at for that
00:37:55.800 um yeah so at King of the web we we use new relics API that monitor our app and that would give us the primary metrics
00:38:02.040 we looked at was um throughput on an absolute level and then it also gives you a an application capacity which is
00:38:08.280 the number of your Ruby threads that are busy at a given moment and we actually just use a combination of those
00:38:15.180 um plus we will actually monitored we ended up even monitoring the end what are our most popular people were doing
00:38:22.020 so if they posted a video on YouTube that contained keywords about King of the web we'd scale as well in
00:38:27.480 anticipation um that's pretty awesome bring them on
00:38:33.660 um so those are the primary things we looked at um and then our scale up time we used an Imaging solution so we
00:38:40.980 actually Chef was used to add the new servers into services like monitoring and things like that but for just
00:38:47.940 bringing up a new server we just would use our image of our existing servers which would take anywhere from five to
00:38:53.520 ten minutes so you call the API you bring it up and then once it comes up you add it into your load balancers
00:38:59.220 so there's a small bootstrap script to make sure all the all the services are running and to start your rails app and
00:39:04.260 everything so one of the distinctions I like to draw is that you can Auto scale down if you want to save money but you can't Auto
00:39:11.160 scale above load you've previously seen because you're going to hit things that you haven't seen before and it's not
00:39:16.920 just about how many servers are running in your application tier they're going to be other weak points in your infrastructure so if it's
00:39:24.119 if it's about I have a bursting load and I know I have a bursting load yes sure go ahead and spend your time figuring
00:39:30.540 out how to save money but it's always about how to save money and many people don't actually have that need their
00:39:37.140 hosting costs are such a fraction of their engineering costs that it's not worth the time so really look at your
00:39:42.839 actual load pattern and see what you might be looking at saving before you spend any time on it absolutely our Auto stigonics Solution
00:39:50.040 involves humans and it's for exactly that reason that it's just it's not a big enough cost
00:39:56.099 savings to make it worth being fully automated it is automated such that a human can
00:40:01.380 decide it's time that we need this more and add it
00:40:07.880 yeah so you talk about sending canaries through the system and making sure they get all the way through the stack
00:40:13.260 um is that worthy of writing a separate app to do that sort of sending and monitoring or is that code that's
00:40:19.200 running in production as well as part of like a prime task such that it all gets bundled and deployed at the same time
00:40:24.599 the easiest thing if you're running a rails app which presumably you are since you're here uh is you can create a
00:40:31.619 separate controller that has one view that pings all your various subsystems
00:40:36.660 so if you've got solar over elasticsearch you can have it do a search it can hit your database it can
00:40:42.060 hit redis and it just returns 200 if it can do all these things it returns 200 you can then sign up for it I don't know
00:40:48.359 what pingdom cost per month but it's essentially free and have ping them hit that every every
00:40:53.880 minute and ping them will then email you or send you some sort of alert if any
00:40:59.339 one of those things isn't you know actually able to be contacted and that's the simplest thing you can do to sort of
00:41:04.680 test that all the sort of aspects of your system are working doesn't require you running anything it just exercises
00:41:10.980 all the parts that you know your rails app actually exercises it's not nearly as good as having a like full full-on
00:41:18.660 monitoring Suite that you set up that you're looking at behind the scenes of what your servers are doing but it'll get you most of the way there pretty
00:41:25.380 easily you can sit down and do it in an hour or two and be pretty test a lot of
00:41:30.960 your application and sort of know that most of the systems are working we end up deploying it as part of the
00:41:36.480 Rails app and then it also will go and connect to a third-party service and hit a URL that if it doesn't get received in
00:41:43.560 an hour will email us and set off all sorts of alerts that something's obviously wrong
00:41:50.940 well maybe one more um
00:42:07.980 so this was I guess uh like deployment and maintaining performance integrity yeah it's a good one I mean
00:42:15.480 I mean whatever you're using for monitoring I mean so you know for someone like us we would
00:42:21.720 use graphite we'll be putting all of our things in there and then using some type of uh change detection hold Winters
00:42:27.780 prediction type things to say you know let's look at the metric over time and I've not got an event let's look at what
00:42:33.240 happened afterwards and yeah and a lot of something that's going to flag on that well metric systems now let you
00:42:38.700 tag things that happened yeah like I deployed I scaled or whatever so it'd
00:42:44.460 be interesting to put that in there and at least figure out what might have caused the thing which then gets you to
00:42:49.980 who did it and then you can you know use the old fisticuffs to yeah I mean I still think the vast metrics correlation
00:42:56.579 eyeball uh you know mattress correlation system is still eyeballs so having something where you can plop events at
00:43:03.119 discrete time points like your builds your deploys whatever and then look at what happened yeah for all my my yeah
00:43:10.500 airline pilot metaphors obviously the distinction is flying a plane from San
00:43:16.319 Francisco to New York um looks the same
00:43:21.540 within a pretty good standard deviation most times whereas each of us run very different
00:43:28.079 systems that the systems themselves change over time so yeah there's obviously you want to automate as much
00:43:34.619 as possible but um yeah that's a little bit of a dream to think that you can get rid of the humans and you should definitely be
00:43:40.920 alerting on the worst case scenarios that like you're starting to see Q depths in your HTTP request handlers and
00:43:49.020 things like that that you know things have gone bad if you've gone past the threshold of what your system can handle
00:43:55.140 if everything is terrible if you have graphs of your key
00:44:00.720 metrics and you look at them after some sort of catastrophic event if you've been looking at them every day for the
00:44:06.960 past month you will know that something's wrong fun
00:44:12.720 about this this
00:44:18.839 okay if you're always deploying is there a large heavy heart it's not
00:44:26.520 just like an old graph sure like like we we switched from rails
00:44:32.579 two three to rails three and look it got slower that's what it does it could possibly be
00:44:38.280 but yeah I definitely think that's where you definitely want to know if you do have some kind of a graphing or some
00:44:44.460 solution where you can correlate the like the deployment event of that change set with sort of you know the
00:44:51.000 performance um you know like your request time or something so you can sort of see like you know as of like Monday it was all
00:44:58.380 right but like yesterday something really like went weird um and hopefully be able to try and
00:45:03.540 track that back um but this is where I guess having like the ability to spin up a production-like
00:45:10.680 environment that can have similar performance characteristics really helps to test those
00:45:16.380 um regressions I guess we also will write performance specs if we have a
00:45:21.900 performance degradation that we notice enough to go in and fix and so you can actually oftentimes write a spec that
00:45:27.900 says this thing happens in this amount of time and we'll often put a multiplier in there that we can set on a given
00:45:34.260 machine so if you're running on a a Mac Mini that's your CI box you set your
00:45:39.480 multiplier up to four if you're running on your iMac it's a modern new thing you set it to one but most of the time it a
00:45:46.740 given thing should finish in that amount of time and this will work if you have sort of errors in your code where you
00:45:52.920 really do have N squared problems where people that don't know what they're looking at can sort of say oh I can fix
00:45:59.880 this and I've done this before myself I could do this way cleaner like I'll just use Ruby's set class and it'll just do
00:46:06.599 it you know set difference and it'll be beautiful and it was beautiful but it was extremely slow it was it was it was
00:46:14.400 okay for five examples it was items but unfortunately the data set had about a million items in it
00:46:20.420 so and it worked fine in test but if you then write a performance spec with
00:46:25.500 enough elements in it it was very easy to pull that back out and say okay that's not going to happen again but it
00:46:31.800 is just something you have to watch and deal with as a developer I'd like to see more of the things like Heroku Cloud
00:46:37.440 Foundry and other deployment you know the chef recipes so however people are doing deployment of things apps or
00:46:42.599 whatever to be allowed to do more a b testing that's not right word and also testing for the ability to introduce
00:46:47.819 things gradually a bit like the feature flag idea but can you introduce some code so let's just let some traffic flow
00:46:56.040 through it and then incrementally more and I think you play with this idea in and around Cloud Foundry I'd like to see
00:47:01.680 that as a feature that all these systems have so it wasn't like a well yeah we should do that but then we'd have to
00:47:07.319 give up Heroku and build it and uh no one wants to do that so we'd like to see
00:47:12.780 those sorts of things so we could behave in this fashion so we wouldn't have over you know terrible terrible things
00:47:18.420 happening you can see that's the All or Nothing deploy yeah which is what we ordered so I think it looks like we're totally
00:47:25.440 over on our time so I apologize for that um we may as well wrap it up you've got tremendous value and you're lucky that
00:47:31.079 we decided to spend a little extra time talking to you so yeah if anybody's interested in
00:47:36.420 talking more you can find anybody here if you see them blue box has a booth
00:47:42.180 down in the Pavilion where you eat I'm not sure what that's called tomorrow and I'm sure
00:47:48.420 there'll be lots of people around there who are interested in this as much as I am so thank you guys very much
00:48:18.240 thank you