Tackling Large Ruby Refactorings with Confidence (and Science!)

Summarized using AI

Tackling Large Ruby Refactorings with Confidence (and Science!)

Jesse Toth • June 04, 2015 • Singapore • Talk

In the talk "Tackling Large Ruby Refactorings with Confidence (and Science!)" presented by Jesse Toth at the Red Dot Ruby Conference 2015, the focus is on the methodologies and tools utilized to undertake a significant restructuring of a critical system within GitHub's Ruby on Rails application. Key points discussed throughout the video include:

- Definition of Large Refactorings: Toth defines large Ruby refactorings as procedures that either modify extensive systems or introduce considerable risk, involving more than just syntactical changes or structural reorganizations.

- The Need for Change: The motivation behind GitHub's action was an outdated permissions system responsible for access management across various application features like repositories and teams, which had become cumbersome and error-prone over time due to separate implementations developed at different points.

- Case Study on Permissions System Rewrite: The speaker shares that the legacy permissions system was complex and led to bugs, especially during transitional states when permissions changed. To address this, a new flexible system called "abilities" was constructed to handle permissions more efficiently and maintain performance.

- Utilization of Science for Confidence in Refactoring: Toth introduces a method known as "dark shipping" which involves running new code paths in production alongside the legacy system, comparing results without affecting users. This innovative approach helped Toth’s team gain confidence in the new implementation while revealing bugs in the existing system.

- Data Quality and Performance: Throughout the development process, the team encountered various performance and data quality issues, often linked to inconsistencies in the old permissions handling logic. Fixing these problems required iterative testing and adjustments to both the code and the underlying data structure.

- Final Steps and Deployment: The conclusion of the discussion emphasizes that testing and quality checks continued until the new permissions system was ready for deployment. The speaker shares a personal anecdote about an unexpected deletion error that led to a temporary outage, underscoring the importance of thorough testing. The refactored abilities system is now in production after extensive iteration and has proven successful with no significant issues reported since launch.

The video concludes by promoting the use of a tool called "Scientist" created at GitHub to facilitate safe code refactorings by providing a framework for testing against old implementations along with valuable performance insights and behavior comparisons.

Tackling Large Ruby Refactorings with Confidence (and Science!)
Jesse Toth • June 04, 2015 • Singapore • Talk

Tackling Large Ruby Refactorings with Confidence (and Science!) by Jesse Toth

At GitHub, we recently replaced a large subsystem of our application – the permissions code – with a faster and more flexible version. In this talk, I’ll share our approach to this large-scale rewrite of a critical piece of our Rails application, and how we accomplished this feat while both preserving the performance of our app and proving the new technology over the course of the project.

Help us caption & translate this video!

http://amara.org/v/Ghc9/

Red Dot Ruby Conference 2015

00:00:21.039 hey hi um so i'm going to talk today about tackling a large ruby refactoring with confidence and science
00:00:29.279 so i am jesse todd uh i'm jesse plus plus on the internet um github twitter
00:00:34.960 everywhere else um i am the head of web technology for github which means i work on the web
00:00:40.000 team behind github.com i work a lot on our ruby on rails application
00:00:45.440 especially focusing on database and performance things and this talk especially we'll talk about one of the projects i did
00:00:52.320 around those sort of things so tackling a large ruby refactoring what
00:00:57.360 do i mean by large ruby refactorings here well first let's break it down large
00:01:02.879 either it could be the size of the system that you're trying to change or especially this is the the case that i'm
00:01:10.159 talking about how risky it is to make this particular change um but then what about refactoring
00:01:16.400 traditionally refactoring shouldn't be risky right it's supposed to be just changing the structure of the code without changing
00:01:22.720 your behavior usually you can do it in small pieces and if you have good test coverage
00:01:28.320 you should be able to catch any accidental behavior changes so it shouldn't be risky but really i'm talking about in this
00:01:34.960 case sometimes really risky changes or maybe going a little bit further than your typical
00:01:40.640 refactoring not just changing the structure of the code but maybe changing some of the underpinnings of the code
00:01:46.560 but still without changing the behavior of the code such as maybe switching out a service or
00:01:51.680 a gem you want it to behave the same way but you want the code that you're using underneath to be different or maybe
00:01:56.799 you're optimizing a query moving it from talking to your model to like raw sql because you need it to be really fast
00:02:03.680 something like that so any code interacting with this code at his boundaries shouldn't notice a difference but things are going to
00:02:09.679 change underneath and hopefully be better for whatever your metric you're trying to improve which is usually
00:02:15.360 performance so maybe a better word for this is like reworking or replacing um
00:02:21.920 in some cases this might work but i didn't i didn't want to use that because i don't think it emphasizes enough the fact that your behavior really shouldn't
00:02:28.319 change when you're doing things like this rewriting is really the best word for this but if you've ever done a rewrite
00:02:35.360 you know that that word has lots of baggage most people have some some ptsd around
00:02:40.400 the fact that rewrites usually don't work they're very risky if they are of any size
00:02:47.599 larger than just a little rewrite they tend to drag on and on and more often than not
00:02:52.640 they fail because you you failed to duplicate the behavior of the original system and then you're left
00:02:57.680 with either something that's that's not matching the old system or you just have to throw it away and i'm going to try
00:03:02.959 again so that's very risky but anyway i'm going to talk about some
00:03:08.400 of the tools and techniques that we use at github to tackle these sort of problems
00:03:13.519 and i'm going to do that talking through a case study which was a rewrite that we did of the permission system behind
00:03:19.120 github.com so what we had behind github.com to control
00:03:25.440 all the permissions was a huge legacy system and the thing that we wanted to do with
00:03:30.640 this rewrite was to create a more flexible system to grant and revoke access to any of the things you could
00:03:36.560 have access to on github so repositories forks issues pull requests teams organizations basically anything
00:03:44.560 as you can imagine from that description this would be a pretty far-reaching change and so
00:03:50.000 if you make this change it's going to affect every single page load on github.com every single api request
00:03:56.080 it's a huge surface area and it's extremely risky so if you're going to make a change like this you have to have absolute
00:04:02.319 confidence in the correctness of the new system when it's time to to switch it out
00:04:07.519 so first of all the first question to ask is why would we want to do such a thing especially with such risk so to give
00:04:14.480 some background on that it requires a little bit of a history lesson on github itself so
00:04:20.160 in the beginning there was collaboration so one of the very first features of github.com
00:04:26.080 was that you could add other users to your repository as collaborators and allow them to
00:04:31.840 have access to your code and to make pull requests and make changes to it
00:04:36.880 and then you know that worked very well for quite a while but eventually got to a point where we needed a little more
00:04:42.720 than that so then there were organizations as github grew we needed to place a way to give repository access to groups of
00:04:50.240 people and to kind of control those for different groups different teams in different ways
00:04:56.320 the problem was that these two different ways of granting access
00:05:02.000 were actually written in separate ways they were developed at different times by different people
00:05:08.560 and as most legacy code goes it wasn't you know obvious at the time that these
00:05:13.840 were kind of the same concepts so they were developed in very different ways and represented in different ways behind
00:05:19.360 the scenes so this is what permissions look like to begin with this was
00:05:24.800 collaboration on repositories so it's a super simple join table basically
00:05:30.320 you say here's a user here's a repository they have access that's about it
00:05:36.400 and then this is team members so the way that you give access to repositories and organizations is through teams
00:05:42.800 and um this one's a little weird uh and every time i look at it i kind of shudder now
00:05:48.560 because this is a strange three-way joint table between well you have the team first of all but then you have
00:05:54.560 users and repositories and so to be a member of the team if you're a user you have a row in here
00:06:01.600 but also if you want to give users on the team access to a repository the repository has a row in the same table
00:06:09.199 where the repository is like a member of the team and this is pretty pretty terrible but
00:06:14.560 it's it's what we had and hey it worked for a really really long time
00:06:20.560 so with that in mind let's talk about how we would use these these things to access permissions so there were tons of places in the code
00:06:27.199 base that needed to get lists of things that people had access to so repositories pull requests teams those
00:06:32.800 sort of things and they all kind of access permissions in slightly different ways based on which kind of data they
00:06:38.880 wanted and in true legacy code fashion even different places like the api and the
00:06:44.800 web that wanted to access the very same list they did it in slightly different ways because they were written by different people at different times
00:06:52.240 so here's an example of this is what you would have to look at all the repositories
00:06:57.440 that an organization controls and probably this is only going to look at the team members side it's not going to
00:07:02.960 touch the collaborators or permissions table whereas on the other hand if you want to get every pull request that a user has
00:07:10.400 access to you've got to look at both the permissions and the team members table and see you know what repositories do
00:07:16.479 they have direct access to which ones do they have access to via organizations and teams
00:07:21.599 and kind of put that all together and what we saw with this is a lot of
00:07:27.759 bugs and edge cases especially around transitional states and because
00:07:33.360 like i said each each place that was calling this was different the bugs were slightly different each time
00:07:38.639 but the scariest thing was like i said around transitional states and unfortunately we have a lot of
00:07:44.400 transitional states on github so you can you can transfer a repository from one user to another
00:07:51.039 you can also transfer it from a user to an org or from an org to a user and that would change how the permissions should be represented or you can do something
00:07:58.160 fun which is you can transform your user account into an organization which should also change how how the
00:08:03.199 permissions are granted from you know the the permissions table to the team members table
00:08:09.039 or you know when you remove somebody from either a team or from your repository you want that
00:08:14.720 also to take away their access to the repo and that's that's considered a transitional state
00:08:20.000 and we started seeing problems with these lots of different bugs all over the place and and this is this was pretty scary to
00:08:25.599 us and to users because when you say you're giving access or you're removing access you want to be absolutely sure
00:08:31.360 that that's doing the right thing and that you're not you know you you thought you removed somebody from a repository but actually they
00:08:37.919 still have some access somehow beyond that we started seeing
00:08:44.080 performance degradation with this so over time as github grew and depending on the api that was being
00:08:50.880 called different places started seeing different performance problems especially in our api where we had to grab large
00:08:56.959 lists of things like give me all the repositories that anyone can have access to and
00:09:02.399 what started happening is that places that grabbed access to repositories started looking like this or grabbed
00:09:07.680 access to lists of anything they had each time a performance problem came up
00:09:13.600 somebody would go in and optimize it make it a little bit better and eventually they just looked like huge chunks of raw sql like
00:09:20.839 this but of course they had all they were all accessing slightly different things and they were all optimized at
00:09:26.880 different times by different people so these queries were slightly different in each case
00:09:32.560 which of course means that it's difficult to build anything on top of this because everything's just this huge
00:09:38.399 query you really can't i mean if you want to build something on top of that you're adding another line to that query and potentially breaking
00:09:44.480 the performance so uh when our ceo chris wants suggested
00:09:49.519 hey let's maybe try to find a way to make orgs better that sounded pretty hard especially
00:09:55.200 because his suggestion was well let's change permissions in this way in this way in this way
00:10:00.399 but with all the problems that we had been seeing with the existing permissions it seemed pretty impossible
00:10:05.519 to be able to make any changes like this without first fixing the existing system
00:10:11.519 so that's where we put on our hard hat and said all right let's rewrite this
00:10:16.800 we need to be we want to be able to rewrite the system uh in parallel to the legacy system so we wanted to not touch
00:10:23.839 this brittle old system as much as possible and to be able to run experiments and do performance testing
00:10:29.600 kind of side by side and use that in a way to test it without you know
00:10:36.000 touching the other system so we came up with these pretty simple goals for what we wanted
00:10:41.760 a super simple flexible interface to grant and revoke a general permission so not specific to repository permission or
00:10:49.120 team permission we wanted to be super fast since we were already seeing performance problems with
00:10:54.399 the old way anything new we built need to scale much further into the future
00:10:59.519 and we wanted to be easy to integrate and operate with the existing technologies we had
00:11:05.760 we looked at things like graph databases and things like that but it didn't make sense for us to go and try out a new
00:11:11.440 technology so we wanted to build it right on top of my sql the way the old system was built
00:11:18.320 so first thing we did was kind of spike something out it was important to us to be able to
00:11:24.079 test how the new system would potentially perform with production load as quickly as possible so we didn't go
00:11:29.760 down the wrong road and go way too far in building this new thing and find out that it just wasn't going to scale at
00:11:35.279 all so the first spike was called capabilities and it was something that
00:11:41.440 john barnett worked on a bit and you know it was a rewrite but there
00:11:46.480 was also a bit of refactoring that we had to do in order to even test this this spike
00:11:52.959 like i said earlier there were you know multiple points where even if the api and the website were
00:11:58.639 getting the same list of repositories they were doing it in different ways so in order to be able to test this new system out a little bit we needed to at
00:12:05.040 least refactor those points so they were going through one single point and that way we could grab at that point and test
00:12:11.360 the new system and the old system but this is where we ran into a problem
00:12:17.600 so i think we try to tackle the api and the organizations
00:12:23.519 to see all the repositories that an organization has access to but there was a problem with just doing
00:12:29.680 the refactoring because it was this huge chunk of sql we didn't have test coverage for all the
00:12:35.839 crazy complicated cases that we had in production and
00:12:41.519 even if we were able to duplicate all of those those cases in tests that would slow our test suite down a lot
00:12:47.920 it would still be hard for us to have confidence that we had tested every single case
00:12:53.440 there's also something that someone found where maybe there are bugs in the current implementation but people are
00:12:58.720 relying on those bugs so whatever we do we have to duplicate the bugs in the system and make sure that we really know
00:13:05.279 what the changes we're making are so at this point a bunch of engineers at
00:13:10.399 github kind of put their heads together and came up with this little hack to to try to do this this refactoring of this
00:13:16.000 one pass and basically it was a way to to test the
00:13:21.519 behavior of the legacy system and the behavior of a refactored path together basically it amounted to dark shipping
00:13:29.440 the refactored path so running it behind the scenes in production and then
00:13:36.000 using our instrumentation library to subscribe to an event basically compare the results of the two
00:13:41.519 paths and if they didn't match throw the results of that into redis and we could go look at that later and see
00:13:47.360 what was the difference why was there a difference and this actually worked really well it allowed us to gain confidence in these
00:13:54.240 initial refactorings and it also actually uncovered some bugs in the in the legacy implementation
00:14:01.120 so like i said this worked so well that we wanted to keep using it so we pulled it out into a science library that
00:14:07.199 anybody at github could use if they wanted to test a new code path and they they wanted to be really confident as
00:14:13.040 they were rolling it out so to give you an idea of what this looks like we're going to look at the
00:14:18.560 poolable bi method on the repository class so to do science you
00:14:24.079 you take your old method and you make a little science block and you give it a name we tend to name space
00:14:32.800 are the names of our experiments just so we could group them together because we ended up doing a lot of experiments
00:14:38.079 so then the first thing you do is you say take the original code that was in that
00:14:43.199 method so the old poolbomi method and we tended to pull it out and put it into a new method that we just named with like
00:14:49.279 underscore legacy at the end and then there's the use
00:14:54.720 use block basically you say whatever is in the use block you want the method
00:15:00.720 to continue returning that that is the old code and then you say
00:15:06.079 but try this other thing as well which whatever the new code path that you want to refactor
00:15:12.320 you know we tend to pull it into a new method we call this abilities so
00:15:17.440 we named it with abs at the end but we said okay but also at the same time try this new method compare the results
00:15:24.720 and tell us what the result of this experiment is also for um usefulness in this we added
00:15:30.880 a context so if things didn't match we wanted to know who was the user who is the repository
00:15:36.560 that we're trying to to test between each other
00:15:41.680 so then in a case where things don't match you want to publish the results basically
00:15:48.079 and so our published method just uses active support notifications to instrument these events
00:15:53.360 which you can then subscribe to and it looks kind of like this subscribing to a signed science event
00:15:59.600 you can subscribe to it and publish the interesting parts to like statsd
00:16:05.360 graphite something like that so we keep track of how many total times an experiment has been run and then do
00:16:12.240 some timings on each part of the experiment so the use and the try block how long did they take to run just so we can compare
00:16:19.680 that as well as the actual result and then you can also subscribe to the mismatch event so this is when things
00:16:26.560 don't match this event will be thrown and you want to keep track how many times things
00:16:32.880 go wrong how many times do we have a mismatch and then we just simply push the result of the mismatch into redis so that we
00:16:39.440 could look at it later one caveat about this technique is you can only use this on code paths that
00:16:46.000 have no side effects so if you're doing something that's writing you can't you can't use this because especially if
00:16:51.839 they were to change the same thing then it's not going to work
00:16:57.040 so using that we were able to do a little bit of spike and figure out what we wanted to do with the real
00:17:03.199 system for permissions so in getting that using that science we were able to get the refactoring into
00:17:09.520 like the little points that we needed to attach the new system when we did build it so taking those lessons
00:17:16.319 we put together a team and we started building a new system that we called abilities and we called ourselves the
00:17:21.760 abili buddies so uh this is this is what we came up with for our system it's it's super
00:17:28.640 simple i think and i really like it so you have actors and subjects and you ask an actor whether they can
00:17:35.760 perform an action on a subject so can this user read this repository
00:17:41.840 then a subject would grant a permission to an actor so you
00:17:47.280 say if i'm a repository i want to grant this user a right permission and then to remove a permission you just
00:17:53.360 revoke any permission granted to the actor the only additional constraint we had on this was
00:17:59.440 that we wanted a cascading functionality for things that were both subjects and actors so
00:18:05.600 if i have an ability that was granted to a team on a repository and then i granted a user
00:18:12.880 inability on the team we want an implicit ability here between the user and repository to cascade up
00:18:19.919 basically so the user has access to whatever the team has given access to
00:18:25.120 but that was about it that was that was the interface and once we figured that out it was it was time to start
00:18:30.480 implementing it so we wrote the the code that did the core of abilities in just a few months
00:18:36.320 it was it was pretty quick so once it was written though we needed to generate all the data to
00:18:42.880 actually have the permissions in the new system so we wrote some migrators that would
00:18:48.640 iterate through all of the permissions in the legacy system and try to generate new abilities
00:18:54.400 entries for each one so the generated data if you're if you're generating it it's only a
00:19:00.160 snapshot in time so then the other thing we had to do in order to be able to compare these was to any time that the legacy system
00:19:06.960 changed we also needed to change the abilities data so any new permissions needed to be written to abilities at the
00:19:12.559 same time and any changes needed to be updated as well but once we did that we were able to
00:19:19.840 start doing some science so we were able to add science to
00:19:25.440 all of the read points that touch permissions and there are a lot of those all throughout the code base
00:19:31.919 and because there were so many different places that touched permission we kind of had to split this up and parallelize
00:19:37.200 it so we had kind of one half of the team doing the refactorings to to find all the points where we needed to add
00:19:43.679 the science and then the other half was going through and using the things that already had science
00:19:49.200 analyzing the results of the experiment and seeing you know what was happening were things matching if they weren't matching
00:19:56.720 why and how can we fix that so let's dive into what the what the data
00:20:02.400 looks like that science collects and and how we can use that to determine correctness
00:20:08.320 so this is dashboard that we built to kind of summarize and visualize the data
00:20:14.240 from from graphite that we that we instrumented before the graphs on the left give you kind of
00:20:20.240 a day in the life of each experiment so what does it look like in the past 24 hours is it mostly green meaning no
00:20:27.039 mismatches is there's a little bit of yellow in there meaning a few mismatches and the ones that are really red are just you
00:20:32.640 know everything's kind of going wrong with those and the right side gives a little bit of statistics on how often something's
00:20:38.720 running you can see how many times per minute it's run what percentage of the time are we running this experiment
00:20:44.720 for things that are not very performant you don't want to be running this 100 of the time maybe you only want to run 1 or 10 just to start
00:20:51.360 collecting some data until you get less mismatches so if you click into any of those
00:20:58.880 summaries you can see a more detailed uh view of each experiment and inside here are the graphite graphs
00:21:05.600 of what's happening so those totals that we saw earlier and how many wrong is in there
00:21:12.159 as well as on the bottom there's the performance data so you can see how is the new code that i'm testing out
00:21:18.640 performing versus the old code so you can see in this case um you know we're running a lot of
00:21:24.799 experiments like 20 000 per minute and there aren't very many mismatches but they're not zero so
00:21:31.520 if you see this graph in the middle that's that has some mismatches then you then you want to go and look at those mismatches and see what is the
00:21:38.320 difference and what's going wrong so in order to do that
00:21:43.919 like i showed earlier we just throw these mismatches into redis so you go take a look at redis and for the
00:21:50.159 particular key that we've named it see how many mismatches are there so for this poolable buy
00:21:56.240 we have you know 000 mismatches and you can just grab the first result by you know popping it off and saying
00:22:01.600 okay show me the top mismatch and then there's a bunch of data in there um the context that we push on but
00:22:08.480 then you can see the most important part is tells you what the candidate did and
00:22:13.760 what the control did so it gives you the timing information it tells you whether an exception was raised and then it
00:22:19.600 gives you the return value so this for this method the pullbaby method is just a simple true or false value so the
00:22:26.159 candidate which is the experimental refactored code it returned false but if you look at the control here it
00:22:32.960 returned true so this was the difference and we can take a look at the user id because we push that onto
00:22:38.960 the context and repository id and try to investigate why for this particular case did it go wrong
00:22:45.280 and the results that we see from that especially at the beginning it was just
00:22:50.640 bugs and abilities we didn't get it totally right the first time we didn't account for all of the cases we didn't
00:22:56.640 realize that the old system did this particular thing so we just needed to fix that in these cases we would generally fix the bug
00:23:03.919 completely blow away the abilities database and rerun the migrators to regenerate all of the all the code in there
00:23:10.960 um we thought once we fixed those bugs that everything would be good and we'd be ready to ship abilities
00:23:17.520 but then we started running into other issues and that was with the data in the database
00:23:23.039 so if you look here i can show you a
00:23:31.120 sampling of the data quality issues that we saw
00:23:36.400 as we were trying to do this ability stuff so what we found in investigating some of
00:23:43.200 these mismatches is that there were issues in the legacy permissions tables that were actually being masked by
00:23:49.360 conditions in the code so there was bad data in the database we were pulling it out but then something in the code was
00:23:55.600 filtering it out and saying oh forget that you know the database says that this is poolable by this user but
00:24:01.840 under this condition we'll just forget about that and we'll say no it's not and so we decided that if if we wanted
00:24:09.279 to be generating the abilities data from the legacy data we needed
00:24:14.320 quality data it needed to be correct so this is the case where we needed to go back and actually fix the old data and
00:24:20.080 you can see how often this happened we had this is just a sampling of a few data quality issues we had quite a few
00:24:25.440 of these and we ended up adding a lot of process around doing migrations or what we call data
00:24:31.039 transitions to to fix data like this we ended up running them quite a bit we actually ended up writing some
00:24:37.679 throttling code to do these sort of migrations so that when we were deleting millions of rows we weren't hammering
00:24:44.000 our database during this time so then once we
00:24:49.120 fix the data quality issues or we thought we were going to be done except there was one more thing
00:24:55.440 that we found that we didn't expect coming into this there were still some mismatches and
00:25:01.760 it turned out that the reason things were mismatching is that the legacy system handled things inconsistently
00:25:09.039 there is literally no way to duplicate the functionality in the new system because the old system was order
00:25:14.960 dependent or time dependent based on when things happen and when the code changed there were different things in
00:25:20.159 the database and a lot of these centered around problems with forks and problems with forks that had different visibility
00:25:27.360 so public public repos with private forks and things like that and so we actually had to
00:25:32.880 like stop working on abilities take a break here and fix the problem in the code and and come
00:25:39.440 up with a consistent system this actually took a couple months because we had to fix the code we had to
00:25:44.880 fix the data because these were privacy issues we actually and it was really sensitive stuff we actually had to spend
00:25:50.799 time contacting users and emailing them warning them about the changes that we were going to make so that was a
00:25:57.919 that was something that we did not expect um and i i think if we hadn't had something like
00:26:03.919 science to test out this behavior we wouldn't have known it would it had happened and then we would have rolled
00:26:09.440 out the new system and people would have started sending requests like you know what's happening what's going on things have changed
00:26:15.520 so we were really lucky to have found that and uh definitely thanks to these tools
00:26:21.840 the next thing we started to find was performance problems i showed you the graph earlier at the
00:26:27.520 bottom that showed the difference in performance between the old code path and the new code path and for these
00:26:32.799 little things we optimized things at the beginning and there wasn't a big difference in
00:26:37.840 performance there the places we actually started to see performance problems were because we had had this dark shift for a very long
00:26:44.559 period of time we started seeing some pathological cases uh people were doing interesting
00:26:50.159 stuff with our api they were like removing huge amounts of permissions and re-adding them and there was one
00:26:55.679 customer who was basically doing this every night and we saw something like
00:27:01.120 this so every day at 5 00 pm a certain customer would drop all of their permissions and re-add them and do this
00:27:07.679 over and over again and we would see the spike in replication lag in our database
00:27:12.960 which is not good for us but because we had this kind of dark ship we
00:27:18.720 were able to see this and that gave us an opportunity to optimize abilities even more
00:27:24.240 so of course in order to fix this we we wanted to use our favorite tools science
00:27:29.760 and in this case we actually did some science within science so we had
00:27:35.039 we already had using abilities of self-science but then we wanted to test two different code paths within
00:27:40.960 abilities we wanted a way to do some inference of some things and to not explode things
00:27:46.000 out into the database quite as much so we tested that out
00:27:51.440 and you know we tried it out it worked and when we shipped it and we stopped having
00:27:56.880 these problems so that was a really great way to find that out
00:28:02.240 so uh let's take a step back and look at you know what at this point in the story what have we really done so far well
00:28:08.960 we've done a rewrite and a refactor we've done some science we've done some data quality repair
00:28:15.360 we've done some fixes for performance all right what's our progress
00:28:20.480 none of these things are using abilities we haven't shipped anything yet but that's okay once we got to this
00:28:26.880 point we were able to pretty quickly start shipping so once we were able to fix the last bit of
00:28:33.200 data quality issues we could start rolling this out so we rolled it out for organizations first
00:28:39.200 and then for teams so check check we've got we've got two things we've got a little bit of progress here
00:28:45.440 we had to do a lot of work up front to get that but once we got to this point we were able to to do uh
00:28:51.360 to get things out and using abilities much more quickly at this point the only thing left is repositories but repositories are the
00:28:59.279 largest and riskiest piece of this it's the most used part it's the most complicated
00:29:04.720 and we've been slowly chipping away at the bugs and the data quality issues surrounding it and we're almost to the
00:29:10.399 finish line in fact we get to a point where within a week we've only seen one kind
00:29:16.799 of mismatch around a particular data quality issue and so we said all right this is going
00:29:21.840 to be the last thing we have to fix and then we can ship it so i code up a pull request
00:29:27.840 to to fix this the problem was that when users were being deleted they weren't being removed from the
00:29:33.520 teams so the old record was just sitting around so i said all right let's change the codes that
00:29:39.600 these are always removed from teams upon deletion and then we'll clean up the old data with some of our transitions
00:29:46.000 so i wrote up this transition i meant to clean up all the deleted users from
00:29:51.440 teams instead i wrote a bad query and i accidentally removed every single
00:29:56.960 repository from every team on github oops
00:30:03.840 yeah you can see the outage there but we were we were right at the end uh
00:30:09.760 unfortunately science couldn't save me from my own mistakes but uh we were just about ready to ship
00:30:15.919 things were pretty much green so and also we have database backups it's not not everything is lost
00:30:23.039 but we couldn't instantly restore the backups so what we decided to do was to turn abilities on at this point to
00:30:29.360 restore access to people to use the new system a little sooner than i had expected to use it but it was an emergency test run
00:30:37.120 and we turned it on we were able to restore access probably an hour or two quicker than we would have before
00:30:42.799 restoring or storing the backups once we got the backups restored i reverted that and turned it back off
00:30:49.440 given my previous mistakes i wasn't quite confident enough to really push that into production yet
00:30:57.200 but this is where we were everything was green it was it was about time
00:31:04.000 to really ship this but i wasn't totally confident i wanted to
00:31:09.039 to make sure one more time that everything was okay that nothing was going to go wrong when we when we rolled
00:31:14.320 this out so i came up with the idea to basically run through every user on github and
00:31:20.320 have a transition that just called called to the associated repository so
00:31:26.000 said for each user give me a list of all the repositories they have access to and this is a way to basically exercise
00:31:32.320 every single permission in the database and i figured if we do this we should be able to tease out any of the last
00:31:38.399 mismatches that we'll find so i ran this and there are actually no mismatches at
00:31:44.159 all well you can see there are three mismatches there but we had a there was a particular issue with timing
00:31:49.600 on one of our jobs when you removed a member from a team it was a common thing that we had seen
00:31:54.640 throughout the time but there were there were no mismatches so it was ready to go it was time to
00:32:00.799 to give it to the world and say you can have abilities uh so what we did actually instead of
00:32:06.880 just removing all the science code is that we flipped the used block and the tri block to
00:32:12.960 begin with so we could continue having the science around it just in case any mismatches came up we could still look
00:32:18.799 at the performance so you can see where those graphs kind of flip the candidate can control
00:32:25.360 changing positions basically and we left that on for a few days and then once we were totally confident
00:32:31.279 that there were no more mismatches nothing was going wrong we removed the science completely so at this point
00:32:37.600 everything's shipped we've got abilities backing everything and this was
00:32:43.120 almost a year ago so this has been in production for a year we've had no problems with it
00:32:49.120 we've even had people starting to build new things on top of abilities because we made something that was general and flexible
00:32:55.360 so that was also successful so uh we have the code it's open source on github it's called scientist there's
00:33:02.640 a rubygem you can use it as well called scientist so i encourage you to go out and use
00:33:08.720 this for your refactorings rewritings or anything where you really want to gain confidence
00:33:14.159 thank you
00:33:24.960 hi i'm back um do we have any questions for jessie um and while you're thinking of a
00:33:30.720 question i just wanted to point out that um a little known fact about her uh she
00:33:36.159 has two cats and they're both named after disney characters she's a huge fan of the lion king apparently because one of them is named nala and the other one
00:33:42.559 eva after the girl in wally okay we have a question up on the left
00:33:49.760 you mentioned that scientists one of the caveats was that you can't really test
00:33:55.519 things that write to data it's only read up right how does science or does it handle
00:34:02.720 model changes if you're if you're using say the same tables the same model names but you're
00:34:08.879 slightly adjusting the columns in those models or the fields in those models can science handle that and then secondly
00:34:16.560 the problem that's coming to mind for me is we're attempting to do uh migration from rails three to rails four right now
00:34:22.480 and this would be incredibly useful for me to say as i'm moving these things over let me
00:34:27.839 test it as i go along to see the cases that i've not yet covered that i still need to work on is science
00:34:34.399 applicable in those situations so for the first question with the model
00:34:39.599 changes i think if you were
00:34:44.720 just accessing the different columns you could probably you could probably use it to test different columns
00:34:50.800 otherwise the way we did it was two completely separate models just because if if either one are modifying things
00:34:57.599 you don't want the overlap there you really can't test the separateness of that so i'm not sure
00:35:04.000 i might need a little bit more detail to understand the case but you may not be able to use it for that
00:35:10.560 for the the second part of the question doing rails upgrades i've never considered
00:35:16.480 using it for that do you mean um i mean what parts of the upgrade would you want to to test between well it sort
00:35:22.320 of sounds like um so we suck at testing
00:35:27.520 so uh this sort of seems like a kind of a decent way for me to say okay i'm i'm
00:35:32.640 now in the process of migrating let me turn on science to i know everything's going to go wrong with my rails 4
00:35:40.000 implementation at the moment because it doesn't exist let me start seeing what these results should be and as i build
00:35:45.920 up my rails for implementation let me ensure that they're matching what my rails 3 implementation does as opposed
00:35:53.200 to just so something at a much larger scale and not as granular which kind of
00:35:58.400 seems like the opposite of what science was intended to do but maybe is fits as well yeah i i guess you'd have to hook
00:36:05.040 it in somewhere way far down which right now it's it kind of goes inside of your rails
00:36:11.359 application i don't the way the code works i'm not sure how you would basically have to load up
00:36:17.520 two different rails applications and have it have it outside of both of those and running both which it definitely wasn't designed to
00:36:24.079 do at the moment you might you could test out something like that thank you
00:36:31.119 are there any other questions yes great
00:36:41.119 so by the time you ship abilities i assume that you have written a lot a lot of experiments
00:36:48.000 and when the old system goes away i imagine that there will
00:36:53.599 still be some uh potentially some usage of um from all the experience that has been written as
00:37:00.160 some sort of acceptance test do you see something like this or uh or did the uh already written experiments um
00:37:07.599 go away after shipping so the experiments do go away after shipping
00:37:13.680 we try to if there are cases where there aren't enough tests around it we can use science to compare the results
00:37:19.520 and add more tests in our test suite but it's not something that you want to
00:37:24.640 continue running in production forever because you're always basically doing double the work every time you're calling something
00:37:30.880 so we do it for as much as we need to to gather the data and have confidence but then the experiments tend to go away as
00:37:36.640 soon as you're finished that's okay
00:37:52.960 you
Explore all talks recorded at Red Dot Ruby Conference 2015
+18