Ruby Video | Building a ChatOps framework

Building a ChatOps framework

Play on YouTube

Red Dot Ruby Conference 2016

Building a ChatOps framework

Kir Shatrov • June 24, 2016 • Singapore • Talk

In this presentation by Kir Shatrov at the Red Dot Ruby Conference 2016, the focus is on building a ChatOps framework within the context of Shopify's extensive deployment of ChatOps. ChatOps, a term originated from GitHub, allows teams to integrate technical and business operations into chat platforms for improved communication and efficiency. Here are the key points discussed throughout the video:

The presentation underscores the significant role that ChatOps can play in modern software development practices, particularly within large teams, by fostering better communication and automating routine processes.

Building a ChatOps framework
Kir Shatrov • June 24, 2016 • Singapore • Talk

Speaker: Kir Shatrov, Production Engineer, Shopify

At Shopify, we run a massive ChatOps deployment that ties out Internal tools together. We’re developing a platform for the useful scripts written by developers around the company to be discoverable. The platform makes it simple for any employee to automate workflow by writing a script. I will talk about the history of ChatOps and its culture at Shopify, about the reasons behind creating our own chat framework, building the DSL and grammar rules parser, scaling ChatOps and providing the better chat experience than other frameworks have.

Speaker's Bio
Kir Shatrov is a Production Engineer at Shopify, a current maintainer of Capistrano and a Rails contributor. He coaches RailsGirls and hosts the RubyNoName Podcast.

Event Page: http://www.reddotrubyconf.com

Produced by Engineers.SG

Help us caption & translate this video!

http://amara.org/v/ONqG/

Red Dot Ruby Conference 2016

00:00:12.799 yeah

00:00:15.150 so my name is Keisha trough and my talk

00:00:17.609 today is called building a chat apps

00:00:19.439 framework a bit about myself so my name

00:00:24.090 is cure I work at the developer

00:00:26.880 acceleration team I try fi I'll talk

00:00:29.939 about more about this team in my in my

00:00:33.870 talk I live in Canada

00:00:36.570 it's where Shopify is based and we may

00:00:40.500 probably have worked together on some

00:00:42.270 open source projects like rails

00:00:44.700 Capistrano route bench and that's me

00:00:48.090 with that cat so let's start with chat

00:00:52.170 ups please raise your hand if you heard

00:00:55.320 about that cool service chat ups you can

00:01:01.350 move your technical and business

00:01:03.510 operations into chat into conversation

00:01:08.010 with your team and this term was first

00:01:11.760 introduced by github and they first

00:01:15.780 started to talk about that on

00:01:17.159 conferences they first made chatter

00:01:20.880 framework and yeah it also connected to

00:01:27.600 a term conversation driven development

00:01:30.259 as you probably heard there is

00:01:32.610 test-driven development behavior driven

00:01:34.500 development and many other kind of

00:01:36.390 German developments and with chat apps

00:01:40.380 and conversation driven development you

00:01:43.229 can bring all of that into a chat with

00:01:45.630 your team

00:01:48.020 and a bit about Shopify we have quite a

00:01:51.890 lot of developers more than 300 yeah so

00:01:57.110 if you don't know about sci-fi its

00:01:59.360 e-commerce platform for small and medium

00:02:02.540 business and when you have so many

00:02:06.590 developers you need to build tools for

00:02:10.520 those developers so the developers could

00:02:13.010 be productive and my team where I work

00:02:16.520 is called developers acceleration and we

00:02:19.220 build tools for internal tools for our

00:02:21.920 developers to make their productivity

00:02:26.630 better and chat ups and all that kind of

00:02:32.630 automation is one of the things that

00:02:35.390 developer acceleration team is working

00:02:38.239 on

00:02:43.650 so for you to have a better idea how all

00:02:48.670 of that looks let's start with with an

00:02:51.640 example so in Shopify every developer is

00:02:56.530 responsible for shipping his or her own

00:03:00.040 features that means we don't have any

00:03:03.910 kind of release engineers who push

00:03:06.930 comments of other people so if you made

00:03:10.959 a feature you're responsible to deploy

00:03:13.420 this feature to see that it works and it

00:03:17.349 doesn't if it doesn't work to roll it

00:03:19.510 back or do something about that so

00:03:23.079 imagine you may need a pull request

00:03:25.140 you're about to merge it you merge it if

00:03:29.049 everything is ok with with a CI and in a

00:03:32.799 few minutes you get a message from a

00:03:36.819 chat bot that your your feature your

00:03:41.500 Comet in the master branch is ready to

00:03:45.010 be shipped and you tell about ok let's

00:03:49.060 ship it and in the in the group channel

00:03:53.380 in slack we use slack everyone will see

00:03:57.669 that you are deploying something what

00:04:00.310 comments do deploy and also the result

00:04:03.639 of this deployment so it's usually it's

00:04:06.940 usually succeeded but it can also fail

00:04:09.280 like on this slide

00:04:12.800 and this is how the ploy work so right

00:04:16.280 after the deploy or after you committed

00:04:20.780 something sometimes it happens that we

00:04:24.050 have an excellent for instance if sign

00:04:28.370 up is down for example someone comes to

00:04:32.180 this the same chat and starts an

00:04:36.050 incident an incident is a special

00:04:39.170 procedure to to manage some kind of bad

00:04:45.740 thing that happens in production and in

00:04:49.580 it includes actions like updating status

00:04:52.880 webpage and investigating what's wrong

00:04:56.920 we also have a chat command for all of

00:05:00.740 that

00:05:01.880 another example is monitoring the most

00:05:06.430 the most heavy SQL queries or the most

00:05:10.790 heavy customers who who bring a lot of

00:05:15.950 laud on our service and another example

00:05:19.670 of automation is creating new

00:05:22.700 repositories so if you work in a small

00:05:26.540 company and small team you probably have

00:05:30.340 CTO or someone who is admin in your

00:05:34.130 github organization who can create a new

00:05:37.190 repo for you but if you have hundreds of

00:05:40.220 people there is no there can be no

00:05:43.130 special person who who has fairness

00:05:47.450 ability to create and europeís for

00:05:49.280 someone and another aspect is that as a

00:05:53.600 developer you don't even know who to ask

00:05:55.640 to create a ream for you now we have

00:05:58.370 special events called hack days where we

00:06:02.750 have a lot of internal hackathons

00:06:04.700 sheriff I

00:06:06.220 and on these days recreate a hundred new

00:06:09.220 repositories during a couple of days so

00:06:12.190 this is an action that should be

00:06:14.350 automated as well and speaking about

00:06:19.300 chat offs it's all about it's also about

00:06:23.020 the interface if we would if we would

00:06:28.030 take another another path will probably

00:06:32.140 create a web interface in bootstrap or

00:06:37.260 or something else and to give developers

00:06:41.850 all actions Oh to give developers

00:06:45.460 ability to trigger all those actions and

00:06:48.730 scrape that automated but with chat ops

00:06:52.690 is just another interface which is chat

00:06:55.320 which has a lot of advantages for

00:06:58.330 example your team will will see what's

00:07:02.860 happening and what actions are you

00:07:05.320 taking to do something

00:07:11.580 now we come to the next part of my talk

00:07:15.419 which is about frameworks about chatter

00:07:18.819 frameworks that exists and about our own

00:07:21.819 kind of framework that we wrote and

00:07:25.810 reasons why we wrote it

00:07:28.139 the first framework is called cubot it's

00:07:31.840 the framework invented by github that I

00:07:34.479 mentioned how what is written in

00:07:38.650 CoffeeScript which means it's in

00:07:41.620 JavaScript in no GS and as a ruby

00:07:45.280 developers you're probably maybe some of

00:07:48.940 you don't like JavaScript and the

00:07:53.110 remaining Ruby developers who don't like

00:07:54.759 JavaScript but in case of chat ups

00:07:57.430 framework JavaScript may be a good thing

00:08:00.699 because it brings a lot of a synchronous

00:08:06.130 support to your code which is important

00:08:09.340 in case of cheddar framework because

00:08:11.500 comments have to be asynchronous and one

00:08:14.710 heavy comment shouldn't block comments

00:08:17.229 from from other people another framework

00:08:23.349 is called Lita it's written in Ruby and

00:08:26.289 it's very well extendable its few years

00:08:31.240 old a very good framework and it's fair

00:08:34.390 to mention that both of these frameworks

00:08:38.320 have different adapters to every chat

00:08:42.310 provider we use slack so it's like is

00:08:46.870 the the only adapter we use but if you

00:08:49.300 use some very rare chat solution you can

00:08:56.190 you can find exist an adapter or right

00:08:58.800 here Oh an adapter so let's see how how

00:09:03.890 chat scripts and how DSL looks like so

00:09:09.720 this is the later DSL you just define

00:09:14.120 small Ruby class which has a macro code

00:09:19.740 root in this macro you describe a

00:09:22.860 regular expression with the comment that

00:09:26.130 you would like to trigger

00:09:27.930 so with this handler if I go to slack

00:09:31.200 and write echo something the bot will

00:09:34.140 will catch this phrase and reply with

00:09:38.730 the second word that comes after echo

00:09:42.240 and the who bought syntax is very is

00:09:48.600 very similar to leta we also defined the

00:09:52.190 the regular expression that the bot

00:09:55.470 should should wait for and send a reply

00:10:05.640 if we take a closer look we will see

00:10:09.640 that both of these details are based on

00:10:14.170 regular expressions and you should write

00:10:18.910 regular expression to tell the bot what

00:10:22.540 comments to detect and why regular

00:10:26.260 expressions it is the the easiest way to

00:10:30.930 tell the board what comment to watch and

00:10:36.480 this approach have a few disadvantages

00:10:40.570 like it cannot detect typos it cannot

00:10:49.530 reply with this comment was not found

00:10:53.320 maybe man something else it can also it

00:10:57.310 also cannot do in input validation like

00:11:01.450 if the comment was was right but the

00:11:05.920 argument was was wrong and that argument

00:11:09.100 may have not matched by the regular

00:11:11.830 expression and this comment won't be

00:11:13.390 found and having regular expressions in

00:11:17.710 your chat BOTS means that all developers

00:11:21.820 should should be really good in regular

00:11:26.230 expressions and it's always easy to make

00:11:29.980 a mistake and find a regular expression

00:11:33.040 that will conflict with a different

00:11:37.030 script regular expression so we thought

00:11:43.660 that maybe we could do something else

00:11:46.060 without regular expressions and yeah

00:11:49.810 here is an example

00:11:51.850 the first option to write their common

00:11:56.320 syntax with a regular expression and the

00:11:58.149 second one is to write it with some kind

00:12:00.970 of pattern language and with echo the

00:12:07.329 difference is not that big but where the

00:12:09.639 bigger comment like github add user name

00:12:12.790 to team name the relative expression

00:12:15.820 becomes quite long and it's quite easy

00:12:19.420 to make a mistake there as I said so we

00:12:23.889 we thought that maybe we can improve

00:12:25.660 that that experience of writing chat

00:12:30.420 handlers and what we wanted to to have

00:12:36.730 from that solution we want to be a

00:12:39.970 friendly for both developers and the

00:12:42.070 user by being friendly for developer it

00:12:46.540 means that developer wouldn't need to

00:12:48.190 write a regular expression and friendly

00:12:50.740 for user means that we would suggest the

00:12:52.839 write comment if user made a mistake

00:12:55.709 we're also we also have a lot of Ruby

00:13:02.069 infrastructure code written in Ruby it's

00:13:04.480 surely fine so we decided that wants to

00:13:07.510 stick with Ruby after we tried both

00:13:11.220 aletan who bought in production and we

00:13:15.760 wanted simpler and more powerful DSL

00:13:19.350 that would provide a better argument

00:13:22.209 support so our solution we we decided to

00:13:27.579 make it on top of Lita

00:13:29.260 with the custom common router and custom

00:13:32.620 DSL and this is how this DSL looks like

00:13:37.050 first of all is very similar to Lita but

00:13:40.510 instead of defining the regular

00:13:42.310 expression

00:13:46.540 here you define special puttering and

00:13:51.860 you also define a help and right after

00:13:55.520 this pattern is matched its dispatched

00:13:57.620 to a ruby method with a keyword argument

00:14:01.480 and in this case it's very simple

00:14:03.980 handler it will reply with the same

00:14:06.680 command so let's take a look on a bit

00:14:10.280 more complex handler it has two

00:14:12.980 arguments one of them is yeah this

00:14:16.700 handler is for displaying some chart

00:14:19.520 from your like the first variable is

00:14:23.120 application name and the second is

00:14:25.250 format a foreman is enum field it can be

00:14:30.410 depth daily or hourly value and help and

00:14:34.940 it should be converted into a calling of

00:14:39.110 routine method which is kind of simple

00:14:42.460 so this pattern would match all the

00:14:47.600 following user inputs can be my app so

00:14:52.220 hourly is the default value for the

00:14:54.950 format variable you can override it here

00:14:59.570 and here and you can also define it in

00:15:02.600 the explicit way which is useful when

00:15:06.320 you have more arguments and maybe you

00:15:11.150 don't remember the order of them so we

00:15:15.200 also wanted to have the explicit formats

00:15:19.790 and we to be able to work without

00:15:24.380 regular expressions we we talkin eyes

00:15:27.680 this pattern with different kind of

00:15:31.340 tokens first are to our static tokens so

00:15:35.770 the user inputs to start with a new

00:15:38.780 relic and chart and then there is that

00:15:41.890 simple variable and then there is a

00:15:43.970 variable with default it looks like this

00:15:48.800 so this comment consists of four tokens

00:15:53.920 our next goal is to convert the user

00:15:59.510 input of New Relic chart my app daily

00:16:02.120 into coin rule method actually yeah

00:16:07.040 instant changing the new Eric Cantor and

00:16:09.200 Cohen that method with those keyword

00:16:14.060 arguments and

00:16:19.450 this may seem as a as a task is a

00:16:27.220 difficult task until we discovered the

00:16:31.090 class in Ruby standard library which is

00:16:33.820 called shrink scanner yeah it's a class

00:16:36.940 in a ruby standard library please raise

00:16:38.980 your hand if he heard about the class

00:16:41.430 yes and not too many people the string

00:16:47.230 scanner works as a scanner I'll have an

00:16:50.020 example now so you initiate an object

00:16:54.630 with with a string in this case string

00:16:58.900 is the user input and there is a method

00:17:04.420 called scan and you give just just

00:17:10.240 talking to that scan command and yeah so

00:17:18.240 it scans and if this user input would

00:17:22.300 start from something else it wouldn't

00:17:24.840 scan the string at all so if it would

00:17:29.170 start with the github or some other

00:17:33.330 command it wouldn't miss can't then we

00:17:37.150 have the next token which is chart

00:17:38.830 static token it is also scanned so we

00:17:43.030 can go further then we scan for variable

00:17:49.140 so that

00:17:51.510 it's scanned and then the next variable

00:17:55.810 and we get the the values for these

00:18:02.230 variables so it wouldn't be honest so

00:18:06.640 it's not very honest to say that we

00:18:10.000 completely got rid of regular

00:18:13.000 expressions but the end developer of of

00:18:17.620 a handler doesn't have to write a

00:18:19.570 regular expression but we we use some

00:18:22.180 where expressions under the hood

00:18:29.050 yes so more than that we have type

00:18:34.070 version

00:18:35.750 so when defining handler you can you can

00:18:41.540 declare the type of variable for

00:18:44.330 instance the target this is the comment

00:18:46.970 used to tell to the infrastructure that

00:18:50.600 some server will go to down time that

00:18:55.820 means that maybe we are going to restart

00:18:57.890 the server or repair it somehow and

00:19:01.930 there are two arguments one of them is

00:19:05.960 target which is a chef node

00:19:08.870 it should be a valid chat HF node

00:19:11.270 address and then duration duration can

00:19:14.750 be one minute or one second or one hour

00:19:20.740 or just an iteration and we also declare

00:19:24.740 that so the first one dives chef node

00:19:27.680 second types duration and when the

00:19:31.040 message receives so when we call this

00:19:35.540 Ruby method inside this method you'll

00:19:37.760 have duration as activesupport duration

00:19:41.600 and target will be valid chef node so we

00:19:45.890 can be sure in this method that both of

00:19:49.010 these arguments are valid so this

00:19:52.550 comment will well develop for the first

00:19:56.420 input but for the second input it won't

00:19:59.720 develop and it will return an error and

00:20:03.280 this this method won't be called at all

00:20:06.710 so when writing code in that method you

00:20:09.740 will be sure that you get the right

00:20:13.880 input

00:20:21.100 after shipping these DSL to our

00:20:24.610 developers money developers could could

00:20:29.679 write their own chat handlers to

00:20:32.529 automate their their workflows and we

00:20:35.799 got to the number of more than 200 boss

00:20:39.039 scripts handlers and more than 60 of

00:20:44.039 more than 600 common invitations on a

00:20:48.580 busy day in in slack so this became a

00:20:53.580 part of infrastructure that we had to

00:20:56.200 scale as I mentioned where we based our

00:21:01.059 framework on on Lita so it was just the

00:21:06.460 next on top of fleeta and that means so

00:21:12.820 little is written in Ruby that means

00:21:14.649 that it doesn't have any support for a

00:21:19.330 synchronous workflow which meant that if

00:21:22.529 you ask a bot for some common that takes

00:21:27.580 a minute

00:21:32.340 working on comments from other users so

00:21:35.620 the bot was blocked for that minute and

00:21:37.570 it couldn't accept comments from other

00:21:40.630 users which was super bad especially in

00:21:44.410 the scale so we decided that we will as

00:21:49.060 one option we can make a nice read on

00:21:52.960 every comment and do all the operation

00:21:55.330 in that thread she not block receiving

00:21:58.930 new comments from slack and Ruby threads

00:22:04.270 are not are not so good in some kind of

00:22:07.750 in most kind of operations but in in our

00:22:10.780 case when most of the handlers for chat

00:22:14.440 BOTS were only making HTTP queries or

00:22:17.850 they were invoking some other kinds of

00:22:21.370 systems they didn't they didn't do any

00:22:24.790 calculations on the both side so they

00:22:28.240 just requested data from other systems

00:22:30.790 so in this case Ruby threads were were

00:22:34.360 quite efficient and this approach helped

00:22:39.090 but we we thought that maybe there is

00:22:43.390 some other approach and we went with

00:22:49.470 with the master process and Redis and

00:22:52.210 when the master process received a

00:22:55.060 comment from slack it pushed that

00:22:57.670 comment to Redis and we had a pool of

00:22:59.620 workers and we could have multiple

00:23:03.070 machines that are that work as workers

00:23:06.030 so we could scale it scale that

00:23:08.580 horizontally and it works the same as a

00:23:13.930 side kick or delay job worker queue

00:23:22.970 having that approach we could have

00:23:25.070 active and passive instances of the

00:23:29.540 bought server running and slack would

00:23:32.390 make a callback to a load balancer with

00:23:35.870 the message for a bot and then the world

00:23:39.020 browser could determine to which to

00:23:43.220 which machine to Trude the message and

00:23:47.380 that comes to the availability problem

00:23:51.550 so if you remember and at the beginning

00:23:55.400 of this year github was down for like

00:23:59.030 four three or four hours that was a

00:24:04.520 pretty big downtime and one of the

00:24:08.720 reasons for for that such a long

00:24:11.750 downtime was that github is heavily

00:24:15.770 based on on their chat ups scripts and

00:24:20.140 butBut chat ups was down as well because

00:24:23.000 of some network failure so they couldn't

00:24:25.370 use any of the chat of scripts to

00:24:27.650 recover the system because chat up setup

00:24:30.950 was down as well and we know the problem

00:24:34.160 is Sean v we also had this problem when

00:24:37.640 our bot was unavailable or slack was

00:24:40.490 down and in this case we couldn't do

00:24:43.430 anything so we decided that we'll build

00:24:47.480 a special offline or rescue mode in our

00:24:51.080 board so you should be if you have this

00:24:55.340 bought locally on your laptop you just

00:24:58.960 launch it with a special bean command

00:25:01.400 and you will have exactly the same

00:25:03.980 interface in your common line as you

00:25:07.010 would have interface for a bot in chat

00:25:09.790 but it works even if slack or something

00:25:14.810 else is down

00:25:18.100 summary this is very important so

00:25:22.570 probably you learned a bit about chat

00:25:26.320 ups and how it can automate things and

00:25:28.710 you thought okay cool I'm going to try

00:25:32.410 that in my team in my company no but I

00:25:36.600 would like to say that it's it only

00:25:39.820 makes sense but if you have a very big

00:25:42.340 team because when I worked on in smaller

00:25:47.530 teams and smaller companies I would say

00:25:50.920 that we didn't need all of that just

00:25:53.590 because we it wasn't on such scale that

00:25:57.790 that we needed to automate things with

00:26:00.610 shadows and in this case it's really

00:26:03.400 easier to come to your CTO and asked to

00:26:07.060 create a new repo for you instead of

00:26:10.080 bringing more code and more

00:26:12.190 infrastructure to keep the chat ups

00:26:15.730 running so if you're interested in

00:26:19.510 working on such a big scale systems we

00:26:23.140 welcome to check Shopify careers and I

00:26:25.900 had I have mentioned a lot of projects

00:26:29.800 in Ruby a lot of gems and some other

00:26:33.670 things so you can go to my Twitter and

00:26:36.600 the last weight is a gift with with all

00:26:40.180 links that I mentioned today which

00:26:42.820 you're welcome to check thank you

00:26:50.509 thank you very much here any questions

00:26:53.909 for

00:27:00.050 how many of you use some form of bots on

00:27:04.340 your favorite too

00:27:08.920 for my favorite slack bot is the flip

00:27:12.400 table what so what am i typing it above

00:27:15.340 it with a region guide so if there are

00:27:20.920 no questions back here

00:27:22.420 roundup father yet what it means

explore all talks recorded at Red Dot Ruby Conference 2016

Explore all talks recorded at Red Dot Ruby Conference 2016

Red Dot Ruby Conference 2016