00:00:21.039
hey hi um so i'm going to talk today about tackling a large ruby refactoring with confidence and science
00:00:29.279
so i am jesse todd uh i'm jesse plus plus on the internet um github twitter
00:00:34.960
everywhere else um i am the head of web technology for github which means i work on the web
00:00:40.000
team behind github.com i work a lot on our ruby on rails application
00:00:45.440
especially focusing on database and performance things and this talk especially we'll talk about one of the projects i did
00:00:52.320
around those sort of things so tackling a large ruby refactoring what
00:00:57.360
do i mean by large ruby refactorings here well first let's break it down large
00:01:02.879
either it could be the size of the system that you're trying to change or especially this is the the case that i'm
00:01:10.159
talking about how risky it is to make this particular change um but then what about refactoring
00:01:16.400
traditionally refactoring shouldn't be risky right it's supposed to be just changing the structure of the code without changing
00:01:22.720
your behavior usually you can do it in small pieces and if you have good test coverage
00:01:28.320
you should be able to catch any accidental behavior changes so it shouldn't be risky but really i'm talking about in this
00:01:34.960
case sometimes really risky changes or maybe going a little bit further than your typical
00:01:40.640
refactoring not just changing the structure of the code but maybe changing some of the underpinnings of the code
00:01:46.560
but still without changing the behavior of the code such as maybe switching out a service or
00:01:51.680
a gem you want it to behave the same way but you want the code that you're using underneath to be different or maybe
00:01:56.799
you're optimizing a query moving it from talking to your model to like raw sql because you need it to be really fast
00:02:03.680
something like that so any code interacting with this code at his boundaries shouldn't notice a difference but things are going to
00:02:09.679
change underneath and hopefully be better for whatever your metric you're trying to improve which is usually
00:02:15.360
performance so maybe a better word for this is like reworking or replacing um
00:02:21.920
in some cases this might work but i didn't i didn't want to use that because i don't think it emphasizes enough the fact that your behavior really shouldn't
00:02:28.319
change when you're doing things like this rewriting is really the best word for this but if you've ever done a rewrite
00:02:35.360
you know that that word has lots of baggage most people have some some ptsd around
00:02:40.400
the fact that rewrites usually don't work they're very risky if they are of any size
00:02:47.599
larger than just a little rewrite they tend to drag on and on and more often than not
00:02:52.640
they fail because you you failed to duplicate the behavior of the original system and then you're left
00:02:57.680
with either something that's that's not matching the old system or you just have to throw it away and i'm going to try
00:03:02.959
again so that's very risky but anyway i'm going to talk about some
00:03:08.400
of the tools and techniques that we use at github to tackle these sort of problems
00:03:13.519
and i'm going to do that talking through a case study which was a rewrite that we did of the permission system behind
00:03:19.120
github.com so what we had behind github.com to control
00:03:25.440
all the permissions was a huge legacy system and the thing that we wanted to do with
00:03:30.640
this rewrite was to create a more flexible system to grant and revoke access to any of the things you could
00:03:36.560
have access to on github so repositories forks issues pull requests teams organizations basically anything
00:03:44.560
as you can imagine from that description this would be a pretty far-reaching change and so
00:03:50.000
if you make this change it's going to affect every single page load on github.com every single api request
00:03:56.080
it's a huge surface area and it's extremely risky so if you're going to make a change like this you have to have absolute
00:04:02.319
confidence in the correctness of the new system when it's time to to switch it out
00:04:07.519
so first of all the first question to ask is why would we want to do such a thing especially with such risk so to give
00:04:14.480
some background on that it requires a little bit of a history lesson on github itself so
00:04:20.160
in the beginning there was collaboration so one of the very first features of github.com
00:04:26.080
was that you could add other users to your repository as collaborators and allow them to
00:04:31.840
have access to your code and to make pull requests and make changes to it
00:04:36.880
and then you know that worked very well for quite a while but eventually got to a point where we needed a little more
00:04:42.720
than that so then there were organizations as github grew we needed to place a way to give repository access to groups of
00:04:50.240
people and to kind of control those for different groups different teams in different ways
00:04:56.320
the problem was that these two different ways of granting access
00:05:02.000
were actually written in separate ways they were developed at different times by different people
00:05:08.560
and as most legacy code goes it wasn't you know obvious at the time that these
00:05:13.840
were kind of the same concepts so they were developed in very different ways and represented in different ways behind
00:05:19.360
the scenes so this is what permissions look like to begin with this was
00:05:24.800
collaboration on repositories so it's a super simple join table basically
00:05:30.320
you say here's a user here's a repository they have access that's about it
00:05:36.400
and then this is team members so the way that you give access to repositories and organizations is through teams
00:05:42.800
and um this one's a little weird uh and every time i look at it i kind of shudder now
00:05:48.560
because this is a strange three-way joint table between well you have the team first of all but then you have
00:05:54.560
users and repositories and so to be a member of the team if you're a user you have a row in here
00:06:01.600
but also if you want to give users on the team access to a repository the repository has a row in the same table
00:06:09.199
where the repository is like a member of the team and this is pretty pretty terrible but
00:06:14.560
it's it's what we had and hey it worked for a really really long time
00:06:20.560
so with that in mind let's talk about how we would use these these things to access permissions so there were tons of places in the code
00:06:27.199
base that needed to get lists of things that people had access to so repositories pull requests teams those
00:06:32.800
sort of things and they all kind of access permissions in slightly different ways based on which kind of data they
00:06:38.880
wanted and in true legacy code fashion even different places like the api and the
00:06:44.800
web that wanted to access the very same list they did it in slightly different ways because they were written by different people at different times
00:06:52.240
so here's an example of this is what you would have to look at all the repositories
00:06:57.440
that an organization controls and probably this is only going to look at the team members side it's not going to
00:07:02.960
touch the collaborators or permissions table whereas on the other hand if you want to get every pull request that a user has
00:07:10.400
access to you've got to look at both the permissions and the team members table and see you know what repositories do
00:07:16.479
they have direct access to which ones do they have access to via organizations and teams
00:07:21.599
and kind of put that all together and what we saw with this is a lot of
00:07:27.759
bugs and edge cases especially around transitional states and because
00:07:33.360
like i said each each place that was calling this was different the bugs were slightly different each time
00:07:38.639
but the scariest thing was like i said around transitional states and unfortunately we have a lot of
00:07:44.400
transitional states on github so you can you can transfer a repository from one user to another
00:07:51.039
you can also transfer it from a user to an org or from an org to a user and that would change how the permissions should be represented or you can do something
00:07:58.160
fun which is you can transform your user account into an organization which should also change how how the
00:08:03.199
permissions are granted from you know the the permissions table to the team members table
00:08:09.039
or you know when you remove somebody from either a team or from your repository you want that
00:08:14.720
also to take away their access to the repo and that's that's considered a transitional state
00:08:20.000
and we started seeing problems with these lots of different bugs all over the place and and this is this was pretty scary to
00:08:25.599
us and to users because when you say you're giving access or you're removing access you want to be absolutely sure
00:08:31.360
that that's doing the right thing and that you're not you know you you thought you removed somebody from a repository but actually they
00:08:37.919
still have some access somehow beyond that we started seeing
00:08:44.080
performance degradation with this so over time as github grew and depending on the api that was being
00:08:50.880
called different places started seeing different performance problems especially in our api where we had to grab large
00:08:56.959
lists of things like give me all the repositories that anyone can have access to and
00:09:02.399
what started happening is that places that grabbed access to repositories started looking like this or grabbed
00:09:07.680
access to lists of anything they had each time a performance problem came up
00:09:13.600
somebody would go in and optimize it make it a little bit better and eventually they just looked like huge chunks of raw sql like
00:09:20.839
this but of course they had all they were all accessing slightly different things and they were all optimized at
00:09:26.880
different times by different people so these queries were slightly different in each case
00:09:32.560
which of course means that it's difficult to build anything on top of this because everything's just this huge
00:09:38.399
query you really can't i mean if you want to build something on top of that you're adding another line to that query and potentially breaking
00:09:44.480
the performance so uh when our ceo chris wants suggested
00:09:49.519
hey let's maybe try to find a way to make orgs better that sounded pretty hard especially
00:09:55.200
because his suggestion was well let's change permissions in this way in this way in this way
00:10:00.399
but with all the problems that we had been seeing with the existing permissions it seemed pretty impossible
00:10:05.519
to be able to make any changes like this without first fixing the existing system
00:10:11.519
so that's where we put on our hard hat and said all right let's rewrite this
00:10:16.800
we need to be we want to be able to rewrite the system uh in parallel to the legacy system so we wanted to not touch
00:10:23.839
this brittle old system as much as possible and to be able to run experiments and do performance testing
00:10:29.600
kind of side by side and use that in a way to test it without you know
00:10:36.000
touching the other system so we came up with these pretty simple goals for what we wanted
00:10:41.760
a super simple flexible interface to grant and revoke a general permission so not specific to repository permission or
00:10:49.120
team permission we wanted to be super fast since we were already seeing performance problems with
00:10:54.399
the old way anything new we built need to scale much further into the future
00:10:59.519
and we wanted to be easy to integrate and operate with the existing technologies we had
00:11:05.760
we looked at things like graph databases and things like that but it didn't make sense for us to go and try out a new
00:11:11.440
technology so we wanted to build it right on top of my sql the way the old system was built
00:11:18.320
so first thing we did was kind of spike something out it was important to us to be able to
00:11:24.079
test how the new system would potentially perform with production load as quickly as possible so we didn't go
00:11:29.760
down the wrong road and go way too far in building this new thing and find out that it just wasn't going to scale at
00:11:35.279
all so the first spike was called capabilities and it was something that
00:11:41.440
john barnett worked on a bit and you know it was a rewrite but there
00:11:46.480
was also a bit of refactoring that we had to do in order to even test this this spike
00:11:52.959
like i said earlier there were you know multiple points where even if the api and the website were
00:11:58.639
getting the same list of repositories they were doing it in different ways so in order to be able to test this new system out a little bit we needed to at
00:12:05.040
least refactor those points so they were going through one single point and that way we could grab at that point and test
00:12:11.360
the new system and the old system but this is where we ran into a problem
00:12:17.600
so i think we try to tackle the api and the organizations
00:12:23.519
to see all the repositories that an organization has access to but there was a problem with just doing
00:12:29.680
the refactoring because it was this huge chunk of sql we didn't have test coverage for all the
00:12:35.839
crazy complicated cases that we had in production and
00:12:41.519
even if we were able to duplicate all of those those cases in tests that would slow our test suite down a lot
00:12:47.920
it would still be hard for us to have confidence that we had tested every single case
00:12:53.440
there's also something that someone found where maybe there are bugs in the current implementation but people are
00:12:58.720
relying on those bugs so whatever we do we have to duplicate the bugs in the system and make sure that we really know
00:13:05.279
what the changes we're making are so at this point a bunch of engineers at
00:13:10.399
github kind of put their heads together and came up with this little hack to to try to do this this refactoring of this
00:13:16.000
one pass and basically it was a way to to test the
00:13:21.519
behavior of the legacy system and the behavior of a refactored path together basically it amounted to dark shipping
00:13:29.440
the refactored path so running it behind the scenes in production and then
00:13:36.000
using our instrumentation library to subscribe to an event basically compare the results of the two
00:13:41.519
paths and if they didn't match throw the results of that into redis and we could go look at that later and see
00:13:47.360
what was the difference why was there a difference and this actually worked really well it allowed us to gain confidence in these
00:13:54.240
initial refactorings and it also actually uncovered some bugs in the in the legacy implementation
00:14:01.120
so like i said this worked so well that we wanted to keep using it so we pulled it out into a science library that
00:14:07.199
anybody at github could use if they wanted to test a new code path and they they wanted to be really confident as
00:14:13.040
they were rolling it out so to give you an idea of what this looks like we're going to look at the
00:14:18.560
poolable bi method on the repository class so to do science you
00:14:24.079
you take your old method and you make a little science block and you give it a name we tend to name space
00:14:32.800
are the names of our experiments just so we could group them together because we ended up doing a lot of experiments
00:14:38.079
so then the first thing you do is you say take the original code that was in that
00:14:43.199
method so the old poolbomi method and we tended to pull it out and put it into a new method that we just named with like
00:14:49.279
underscore legacy at the end and then there's the use
00:14:54.720
use block basically you say whatever is in the use block you want the method
00:15:00.720
to continue returning that that is the old code and then you say
00:15:06.079
but try this other thing as well which whatever the new code path that you want to refactor
00:15:12.320
you know we tend to pull it into a new method we call this abilities so
00:15:17.440
we named it with abs at the end but we said okay but also at the same time try this new method compare the results
00:15:24.720
and tell us what the result of this experiment is also for um usefulness in this we added
00:15:30.880
a context so if things didn't match we wanted to know who was the user who is the repository
00:15:36.560
that we're trying to to test between each other
00:15:41.680
so then in a case where things don't match you want to publish the results basically
00:15:48.079
and so our published method just uses active support notifications to instrument these events
00:15:53.360
which you can then subscribe to and it looks kind of like this subscribing to a signed science event
00:15:59.600
you can subscribe to it and publish the interesting parts to like statsd
00:16:05.360
graphite something like that so we keep track of how many total times an experiment has been run and then do
00:16:12.240
some timings on each part of the experiment so the use and the try block how long did they take to run just so we can compare
00:16:19.680
that as well as the actual result and then you can also subscribe to the mismatch event so this is when things
00:16:26.560
don't match this event will be thrown and you want to keep track how many times things
00:16:32.880
go wrong how many times do we have a mismatch and then we just simply push the result of the mismatch into redis so that we
00:16:39.440
could look at it later one caveat about this technique is you can only use this on code paths that
00:16:46.000
have no side effects so if you're doing something that's writing you can't you can't use this because especially if
00:16:51.839
they were to change the same thing then it's not going to work
00:16:57.040
so using that we were able to do a little bit of spike and figure out what we wanted to do with the real
00:17:03.199
system for permissions so in getting that using that science we were able to get the refactoring into
00:17:09.520
like the little points that we needed to attach the new system when we did build it so taking those lessons
00:17:16.319
we put together a team and we started building a new system that we called abilities and we called ourselves the
00:17:21.760
abili buddies so uh this is this is what we came up with for our system it's it's super
00:17:28.640
simple i think and i really like it so you have actors and subjects and you ask an actor whether they can
00:17:35.760
perform an action on a subject so can this user read this repository
00:17:41.840
then a subject would grant a permission to an actor so you
00:17:47.280
say if i'm a repository i want to grant this user a right permission and then to remove a permission you just
00:17:53.360
revoke any permission granted to the actor the only additional constraint we had on this was
00:17:59.440
that we wanted a cascading functionality for things that were both subjects and actors so
00:18:05.600
if i have an ability that was granted to a team on a repository and then i granted a user
00:18:12.880
inability on the team we want an implicit ability here between the user and repository to cascade up
00:18:19.919
basically so the user has access to whatever the team has given access to
00:18:25.120
but that was about it that was that was the interface and once we figured that out it was it was time to start
00:18:30.480
implementing it so we wrote the the code that did the core of abilities in just a few months
00:18:36.320
it was it was pretty quick so once it was written though we needed to generate all the data to
00:18:42.880
actually have the permissions in the new system so we wrote some migrators that would
00:18:48.640
iterate through all of the permissions in the legacy system and try to generate new abilities
00:18:54.400
entries for each one so the generated data if you're if you're generating it it's only a
00:19:00.160
snapshot in time so then the other thing we had to do in order to be able to compare these was to any time that the legacy system
00:19:06.960
changed we also needed to change the abilities data so any new permissions needed to be written to abilities at the
00:19:12.559
same time and any changes needed to be updated as well but once we did that we were able to
00:19:19.840
start doing some science so we were able to add science to
00:19:25.440
all of the read points that touch permissions and there are a lot of those all throughout the code base
00:19:31.919
and because there were so many different places that touched permission we kind of had to split this up and parallelize
00:19:37.200
it so we had kind of one half of the team doing the refactorings to to find all the points where we needed to add
00:19:43.679
the science and then the other half was going through and using the things that already had science
00:19:49.200
analyzing the results of the experiment and seeing you know what was happening were things matching if they weren't matching
00:19:56.720
why and how can we fix that so let's dive into what the what the data
00:20:02.400
looks like that science collects and and how we can use that to determine correctness
00:20:08.320
so this is dashboard that we built to kind of summarize and visualize the data
00:20:14.240
from from graphite that we that we instrumented before the graphs on the left give you kind of
00:20:20.240
a day in the life of each experiment so what does it look like in the past 24 hours is it mostly green meaning no
00:20:27.039
mismatches is there's a little bit of yellow in there meaning a few mismatches and the ones that are really red are just you
00:20:32.640
know everything's kind of going wrong with those and the right side gives a little bit of statistics on how often something's
00:20:38.720
running you can see how many times per minute it's run what percentage of the time are we running this experiment
00:20:44.720
for things that are not very performant you don't want to be running this 100 of the time maybe you only want to run 1 or 10 just to start
00:20:51.360
collecting some data until you get less mismatches so if you click into any of those
00:20:58.880
summaries you can see a more detailed uh view of each experiment and inside here are the graphite graphs
00:21:05.600
of what's happening so those totals that we saw earlier and how many wrong is in there
00:21:12.159
as well as on the bottom there's the performance data so you can see how is the new code that i'm testing out
00:21:18.640
performing versus the old code so you can see in this case um you know we're running a lot of
00:21:24.799
experiments like 20 000 per minute and there aren't very many mismatches but they're not zero so
00:21:31.520
if you see this graph in the middle that's that has some mismatches then you then you want to go and look at those mismatches and see what is the
00:21:38.320
difference and what's going wrong so in order to do that
00:21:43.919
like i showed earlier we just throw these mismatches into redis so you go take a look at redis and for the
00:21:50.159
particular key that we've named it see how many mismatches are there so for this poolable buy
00:21:56.240
we have you know 000 mismatches and you can just grab the first result by you know popping it off and saying
00:22:01.600
okay show me the top mismatch and then there's a bunch of data in there um the context that we push on but
00:22:08.480
then you can see the most important part is tells you what the candidate did and
00:22:13.760
what the control did so it gives you the timing information it tells you whether an exception was raised and then it
00:22:19.600
gives you the return value so this for this method the pullbaby method is just a simple true or false value so the
00:22:26.159
candidate which is the experimental refactored code it returned false but if you look at the control here it
00:22:32.960
returned true so this was the difference and we can take a look at the user id because we push that onto
00:22:38.960
the context and repository id and try to investigate why for this particular case did it go wrong
00:22:45.280
and the results that we see from that especially at the beginning it was just
00:22:50.640
bugs and abilities we didn't get it totally right the first time we didn't account for all of the cases we didn't
00:22:56.640
realize that the old system did this particular thing so we just needed to fix that in these cases we would generally fix the bug
00:23:03.919
completely blow away the abilities database and rerun the migrators to regenerate all of the all the code in there
00:23:10.960
um we thought once we fixed those bugs that everything would be good and we'd be ready to ship abilities
00:23:17.520
but then we started running into other issues and that was with the data in the database
00:23:23.039
so if you look here i can show you a
00:23:31.120
sampling of the data quality issues that we saw
00:23:36.400
as we were trying to do this ability stuff so what we found in investigating some of
00:23:43.200
these mismatches is that there were issues in the legacy permissions tables that were actually being masked by
00:23:49.360
conditions in the code so there was bad data in the database we were pulling it out but then something in the code was
00:23:55.600
filtering it out and saying oh forget that you know the database says that this is poolable by this user but
00:24:01.840
under this condition we'll just forget about that and we'll say no it's not and so we decided that if if we wanted
00:24:09.279
to be generating the abilities data from the legacy data we needed
00:24:14.320
quality data it needed to be correct so this is the case where we needed to go back and actually fix the old data and
00:24:20.080
you can see how often this happened we had this is just a sampling of a few data quality issues we had quite a few
00:24:25.440
of these and we ended up adding a lot of process around doing migrations or what we call data
00:24:31.039
transitions to to fix data like this we ended up running them quite a bit we actually ended up writing some
00:24:37.679
throttling code to do these sort of migrations so that when we were deleting millions of rows we weren't hammering
00:24:44.000
our database during this time so then once we
00:24:49.120
fix the data quality issues or we thought we were going to be done except there was one more thing
00:24:55.440
that we found that we didn't expect coming into this there were still some mismatches and
00:25:01.760
it turned out that the reason things were mismatching is that the legacy system handled things inconsistently
00:25:09.039
there is literally no way to duplicate the functionality in the new system because the old system was order
00:25:14.960
dependent or time dependent based on when things happen and when the code changed there were different things in
00:25:20.159
the database and a lot of these centered around problems with forks and problems with forks that had different visibility
00:25:27.360
so public public repos with private forks and things like that and so we actually had to
00:25:32.880
like stop working on abilities take a break here and fix the problem in the code and and come
00:25:39.440
up with a consistent system this actually took a couple months because we had to fix the code we had to
00:25:44.880
fix the data because these were privacy issues we actually and it was really sensitive stuff we actually had to spend
00:25:50.799
time contacting users and emailing them warning them about the changes that we were going to make so that was a
00:25:57.919
that was something that we did not expect um and i i think if we hadn't had something like
00:26:03.919
science to test out this behavior we wouldn't have known it would it had happened and then we would have rolled
00:26:09.440
out the new system and people would have started sending requests like you know what's happening what's going on things have changed
00:26:15.520
so we were really lucky to have found that and uh definitely thanks to these tools
00:26:21.840
the next thing we started to find was performance problems i showed you the graph earlier at the
00:26:27.520
bottom that showed the difference in performance between the old code path and the new code path and for these
00:26:32.799
little things we optimized things at the beginning and there wasn't a big difference in
00:26:37.840
performance there the places we actually started to see performance problems were because we had had this dark shift for a very long
00:26:44.559
period of time we started seeing some pathological cases uh people were doing interesting
00:26:50.159
stuff with our api they were like removing huge amounts of permissions and re-adding them and there was one
00:26:55.679
customer who was basically doing this every night and we saw something like
00:27:01.120
this so every day at 5 00 pm a certain customer would drop all of their permissions and re-add them and do this
00:27:07.679
over and over again and we would see the spike in replication lag in our database
00:27:12.960
which is not good for us but because we had this kind of dark ship we
00:27:18.720
were able to see this and that gave us an opportunity to optimize abilities even more
00:27:24.240
so of course in order to fix this we we wanted to use our favorite tools science
00:27:29.760
and in this case we actually did some science within science so we had
00:27:35.039
we already had using abilities of self-science but then we wanted to test two different code paths within
00:27:40.960
abilities we wanted a way to do some inference of some things and to not explode things
00:27:46.000
out into the database quite as much so we tested that out
00:27:51.440
and you know we tried it out it worked and when we shipped it and we stopped having
00:27:56.880
these problems so that was a really great way to find that out
00:28:02.240
so uh let's take a step back and look at you know what at this point in the story what have we really done so far well
00:28:08.960
we've done a rewrite and a refactor we've done some science we've done some data quality repair
00:28:15.360
we've done some fixes for performance all right what's our progress
00:28:20.480
none of these things are using abilities we haven't shipped anything yet but that's okay once we got to this
00:28:26.880
point we were able to pretty quickly start shipping so once we were able to fix the last bit of
00:28:33.200
data quality issues we could start rolling this out so we rolled it out for organizations first
00:28:39.200
and then for teams so check check we've got we've got two things we've got a little bit of progress here
00:28:45.440
we had to do a lot of work up front to get that but once we got to this point we were able to to do uh
00:28:51.360
to get things out and using abilities much more quickly at this point the only thing left is repositories but repositories are the
00:28:59.279
largest and riskiest piece of this it's the most used part it's the most complicated
00:29:04.720
and we've been slowly chipping away at the bugs and the data quality issues surrounding it and we're almost to the
00:29:10.399
finish line in fact we get to a point where within a week we've only seen one kind
00:29:16.799
of mismatch around a particular data quality issue and so we said all right this is going
00:29:21.840
to be the last thing we have to fix and then we can ship it so i code up a pull request
00:29:27.840
to to fix this the problem was that when users were being deleted they weren't being removed from the
00:29:33.520
teams so the old record was just sitting around so i said all right let's change the codes that
00:29:39.600
these are always removed from teams upon deletion and then we'll clean up the old data with some of our transitions
00:29:46.000
so i wrote up this transition i meant to clean up all the deleted users from
00:29:51.440
teams instead i wrote a bad query and i accidentally removed every single
00:29:56.960
repository from every team on github oops
00:30:03.840
yeah you can see the outage there but we were we were right at the end uh
00:30:09.760
unfortunately science couldn't save me from my own mistakes but uh we were just about ready to ship
00:30:15.919
things were pretty much green so and also we have database backups it's not not everything is lost
00:30:23.039
but we couldn't instantly restore the backups so what we decided to do was to turn abilities on at this point to
00:30:29.360
restore access to people to use the new system a little sooner than i had expected to use it but it was an emergency test run
00:30:37.120
and we turned it on we were able to restore access probably an hour or two quicker than we would have before
00:30:42.799
restoring or storing the backups once we got the backups restored i reverted that and turned it back off
00:30:49.440
given my previous mistakes i wasn't quite confident enough to really push that into production yet
00:30:57.200
but this is where we were everything was green it was it was about time
00:31:04.000
to really ship this but i wasn't totally confident i wanted to
00:31:09.039
to make sure one more time that everything was okay that nothing was going to go wrong when we when we rolled
00:31:14.320
this out so i came up with the idea to basically run through every user on github and
00:31:20.320
have a transition that just called called to the associated repository so
00:31:26.000
said for each user give me a list of all the repositories they have access to and this is a way to basically exercise
00:31:32.320
every single permission in the database and i figured if we do this we should be able to tease out any of the last
00:31:38.399
mismatches that we'll find so i ran this and there are actually no mismatches at
00:31:44.159
all well you can see there are three mismatches there but we had a there was a particular issue with timing
00:31:49.600
on one of our jobs when you removed a member from a team it was a common thing that we had seen
00:31:54.640
throughout the time but there were there were no mismatches so it was ready to go it was time to
00:32:00.799
to give it to the world and say you can have abilities uh so what we did actually instead of
00:32:06.880
just removing all the science code is that we flipped the used block and the tri block to
00:32:12.960
begin with so we could continue having the science around it just in case any mismatches came up we could still look
00:32:18.799
at the performance so you can see where those graphs kind of flip the candidate can control
00:32:25.360
changing positions basically and we left that on for a few days and then once we were totally confident
00:32:31.279
that there were no more mismatches nothing was going wrong we removed the science completely so at this point
00:32:37.600
everything's shipped we've got abilities backing everything and this was
00:32:43.120
almost a year ago so this has been in production for a year we've had no problems with it
00:32:49.120
we've even had people starting to build new things on top of abilities because we made something that was general and flexible
00:32:55.360
so that was also successful so uh we have the code it's open source on github it's called scientist there's
00:33:02.640
a rubygem you can use it as well called scientist so i encourage you to go out and use
00:33:08.720
this for your refactorings rewritings or anything where you really want to gain confidence
00:33:14.159
thank you
00:33:24.960
hi i'm back um do we have any questions for jessie um and while you're thinking of a
00:33:30.720
question i just wanted to point out that um a little known fact about her uh she
00:33:36.159
has two cats and they're both named after disney characters she's a huge fan of the lion king apparently because one of them is named nala and the other one
00:33:42.559
eva after the girl in wally okay we have a question up on the left
00:33:49.760
you mentioned that scientists one of the caveats was that you can't really test
00:33:55.519
things that write to data it's only read up right how does science or does it handle
00:34:02.720
model changes if you're if you're using say the same tables the same model names but you're
00:34:08.879
slightly adjusting the columns in those models or the fields in those models can science handle that and then secondly
00:34:16.560
the problem that's coming to mind for me is we're attempting to do uh migration from rails three to rails four right now
00:34:22.480
and this would be incredibly useful for me to say as i'm moving these things over let me
00:34:27.839
test it as i go along to see the cases that i've not yet covered that i still need to work on is science
00:34:34.399
applicable in those situations so for the first question with the model
00:34:39.599
changes i think if you were
00:34:44.720
just accessing the different columns you could probably you could probably use it to test different columns
00:34:50.800
otherwise the way we did it was two completely separate models just because if if either one are modifying things
00:34:57.599
you don't want the overlap there you really can't test the separateness of that so i'm not sure
00:35:04.000
i might need a little bit more detail to understand the case but you may not be able to use it for that
00:35:10.560
for the the second part of the question doing rails upgrades i've never considered
00:35:16.480
using it for that do you mean um i mean what parts of the upgrade would you want to to test between well it sort
00:35:22.320
of sounds like um so we suck at testing
00:35:27.520
so uh this sort of seems like a kind of a decent way for me to say okay i'm i'm
00:35:32.640
now in the process of migrating let me turn on science to i know everything's going to go wrong with my rails 4
00:35:40.000
implementation at the moment because it doesn't exist let me start seeing what these results should be and as i build
00:35:45.920
up my rails for implementation let me ensure that they're matching what my rails 3 implementation does as opposed
00:35:53.200
to just so something at a much larger scale and not as granular which kind of
00:35:58.400
seems like the opposite of what science was intended to do but maybe is fits as well yeah i i guess you'd have to hook
00:36:05.040
it in somewhere way far down which right now it's it kind of goes inside of your rails
00:36:11.359
application i don't the way the code works i'm not sure how you would basically have to load up
00:36:17.520
two different rails applications and have it have it outside of both of those and running both which it definitely wasn't designed to
00:36:24.079
do at the moment you might you could test out something like that thank you
00:36:31.119
are there any other questions yes great
00:36:41.119
so by the time you ship abilities i assume that you have written a lot a lot of experiments
00:36:48.000
and when the old system goes away i imagine that there will
00:36:53.599
still be some uh potentially some usage of um from all the experience that has been written as
00:37:00.160
some sort of acceptance test do you see something like this or uh or did the uh already written experiments um
00:37:07.599
go away after shipping so the experiments do go away after shipping
00:37:13.680
we try to if there are cases where there aren't enough tests around it we can use science to compare the results
00:37:19.520
and add more tests in our test suite but it's not something that you want to
00:37:24.640
continue running in production forever because you're always basically doing double the work every time you're calling something
00:37:30.880
so we do it for as much as we need to to gather the data and have confidence but then the experiments tend to go away as
00:37:36.640
soon as you're finished that's okay
00:37:52.960
you