Summarized using AI

Building a ChatOps framework

Kir Shatrov • June 24, 2016 • Singapore • Talk

In this presentation by Kir Shatrov at the Red Dot Ruby Conference 2016, the focus is on building a ChatOps framework within the context of Shopify's extensive deployment of ChatOps. ChatOps, a term originated from GitHub, allows teams to integrate technical and business operations into chat platforms for improved communication and efficiency. Here are the key points discussed throughout the video:

  • Introduction to ChatOps: Kir introduces the concept of ChatOps and its relevance to team workflows. The importance of moving operations into conversations is emphasized, where developers are responsible for deploying their features directly through chat interfaces.

  • Developer Acceleration Team at Shopify: Shopify's Developer Acceleration team focuses on creating internal tools to enhance developer productivity. Kir elaborates on how ChatOps is part of these efforts, facilitating automation and communication among developers.

  • Examples of Automation: Several examples illustrate how ChatOps is implemented:

    • Developers receive chatbot notifications in Slack when their features are ready to deploy.
    • Incident management can be initiated directly in chat for quick responses to production issues.
    • Automation of repository creation during internal hackathons.
  • ChatOps Frameworks: Kir discusses various frameworks available in the ChatOps ecosystem, notably Cubot from GitHub (written in CoffeeScript) and Lita (in Ruby). He explains the different advantages and potential downsides of each framework and discusses the need for a Ruby-based solution, emphasizing user-friendliness and simplicity.

  • Custom DSL Development: Building on Lita, Kir presents a custom domain-specific language (DSL) designed to simplify the creation of chat commands. This DSL enables developers to define commands without needing to write complex regular expressions, making it more accessible to a broader audience.

  • Handling Workflow Concerns: Kir addresses how to manage response times and server loads with the use of background processing via Redis, allowing for asynchronous handling of chat commands and scaling across multiple machines to prevent bottlenecks.

  • Resilience Features: To enhance reliability, a rescue mode allows the bot to function offline, providing a command-line interface for users even when the primary ChatOps system is down.

  • Conclusion: Kir emphasizes that while ChatOps can significantly benefit large teams, it may not be necessary for smaller companies. Developers are encouraged to explore larger scale systems, hinting at career opportunities at Shopify.

The presentation underscores the significant role that ChatOps can play in modern software development practices, particularly within large teams, by fostering better communication and automating routine processes.

Building a ChatOps framework
Kir Shatrov • June 24, 2016 • Singapore • Talk

Speaker: Kir Shatrov, Production Engineer, Shopify

At Shopify, we run a massive ChatOps deployment that ties out Internal tools together. We’re developing a platform for the useful scripts written by developers around the company to be discoverable. The platform makes it simple for any employee to automate workflow by writing a script. I will talk about the history of ChatOps and its culture at Shopify, about the reasons behind creating our own chat framework, building the DSL and grammar rules parser, scaling ChatOps and providing the better chat experience than other frameworks have.

Speaker's Bio
Kir Shatrov is a Production Engineer at Shopify, a current maintainer of Capistrano and a Rails contributor. He coaches RailsGirls and hosts the RubyNoName Podcast.

Event Page: http://www.reddotrubyconf.com

Produced by Engineers.SG

Help us caption & translate this video!

http://amara.org/v/ONqG/

Red Dot Ruby Conference 2016

00:00:12.799 yeah
00:00:15.150 so my name is Keisha trough and my talk
00:00:17.609 today is called building a chat apps
00:00:19.439 framework a bit about myself so my name
00:00:24.090 is cure I work at the developer
00:00:26.880 acceleration team I try fi I'll talk
00:00:29.939 about more about this team in my in my
00:00:33.870 talk I live in Canada
00:00:36.570 it's where Shopify is based and we may
00:00:40.500 probably have worked together on some
00:00:42.270 open source projects like rails
00:00:44.700 Capistrano route bench and that's me
00:00:48.090 with that cat so let's start with chat
00:00:52.170 ups please raise your hand if you heard
00:00:55.320 about that cool service chat ups you can
00:01:01.350 move your technical and business
00:01:03.510 operations into chat into conversation
00:01:08.010 with your team and this term was first
00:01:11.760 introduced by github and they first
00:01:15.780 started to talk about that on
00:01:17.159 conferences they first made chatter
00:01:20.880 framework and yeah it also connected to
00:01:27.600 a term conversation driven development
00:01:30.259 as you probably heard there is
00:01:32.610 test-driven development behavior driven
00:01:34.500 development and many other kind of
00:01:36.390 German developments and with chat apps
00:01:40.380 and conversation driven development you
00:01:43.229 can bring all of that into a chat with
00:01:45.630 your team
00:01:48.020 and a bit about Shopify we have quite a
00:01:51.890 lot of developers more than 300 yeah so
00:01:57.110 if you don't know about sci-fi its
00:01:59.360 e-commerce platform for small and medium
00:02:02.540 business and when you have so many
00:02:06.590 developers you need to build tools for
00:02:10.520 those developers so the developers could
00:02:13.010 be productive and my team where I work
00:02:16.520 is called developers acceleration and we
00:02:19.220 build tools for internal tools for our
00:02:21.920 developers to make their productivity
00:02:26.630 better and chat ups and all that kind of
00:02:32.630 automation is one of the things that
00:02:35.390 developer acceleration team is working
00:02:38.239 on
00:02:43.650 so for you to have a better idea how all
00:02:48.670 of that looks let's start with with an
00:02:51.640 example so in Shopify every developer is
00:02:56.530 responsible for shipping his or her own
00:03:00.040 features that means we don't have any
00:03:03.910 kind of release engineers who push
00:03:06.930 comments of other people so if you made
00:03:10.959 a feature you're responsible to deploy
00:03:13.420 this feature to see that it works and it
00:03:17.349 doesn't if it doesn't work to roll it
00:03:19.510 back or do something about that so
00:03:23.079 imagine you may need a pull request
00:03:25.140 you're about to merge it you merge it if
00:03:29.049 everything is ok with with a CI and in a
00:03:32.799 few minutes you get a message from a
00:03:36.819 chat bot that your your feature your
00:03:41.500 Comet in the master branch is ready to
00:03:45.010 be shipped and you tell about ok let's
00:03:49.060 ship it and in the in the group channel
00:03:53.380 in slack we use slack everyone will see
00:03:57.669 that you are deploying something what
00:04:00.310 comments do deploy and also the result
00:04:03.639 of this deployment so it's usually it's
00:04:06.940 usually succeeded but it can also fail
00:04:09.280 like on this slide
00:04:12.800 and this is how the ploy work so right
00:04:16.280 after the deploy or after you committed
00:04:20.780 something sometimes it happens that we
00:04:24.050 have an excellent for instance if sign
00:04:28.370 up is down for example someone comes to
00:04:32.180 this the same chat and starts an
00:04:36.050 incident an incident is a special
00:04:39.170 procedure to to manage some kind of bad
00:04:45.740 thing that happens in production and in
00:04:49.580 it includes actions like updating status
00:04:52.880 webpage and investigating what's wrong
00:04:56.920 we also have a chat command for all of
00:05:00.740 that
00:05:01.880 another example is monitoring the most
00:05:06.430 the most heavy SQL queries or the most
00:05:10.790 heavy customers who who bring a lot of
00:05:15.950 laud on our service and another example
00:05:19.670 of automation is creating new
00:05:22.700 repositories so if you work in a small
00:05:26.540 company and small team you probably have
00:05:30.340 CTO or someone who is admin in your
00:05:34.130 github organization who can create a new
00:05:37.190 repo for you but if you have hundreds of
00:05:40.220 people there is no there can be no
00:05:43.130 special person who who has fairness
00:05:47.450 ability to create and europeís for
00:05:49.280 someone and another aspect is that as a
00:05:53.600 developer you don't even know who to ask
00:05:55.640 to create a ream for you now we have
00:05:58.370 special events called hack days where we
00:06:02.750 have a lot of internal hackathons
00:06:04.700 sheriff I
00:06:06.220 and on these days recreate a hundred new
00:06:09.220 repositories during a couple of days so
00:06:12.190 this is an action that should be
00:06:14.350 automated as well and speaking about
00:06:19.300 chat offs it's all about it's also about
00:06:23.020 the interface if we would if we would
00:06:28.030 take another another path will probably
00:06:32.140 create a web interface in bootstrap or
00:06:37.260 or something else and to give developers
00:06:41.850 all actions Oh to give developers
00:06:45.460 ability to trigger all those actions and
00:06:48.730 scrape that automated but with chat ops
00:06:52.690 is just another interface which is chat
00:06:55.320 which has a lot of advantages for
00:06:58.330 example your team will will see what's
00:07:02.860 happening and what actions are you
00:07:05.320 taking to do something
00:07:11.580 now we come to the next part of my talk
00:07:15.419 which is about frameworks about chatter
00:07:18.819 frameworks that exists and about our own
00:07:21.819 kind of framework that we wrote and
00:07:25.810 reasons why we wrote it
00:07:28.139 the first framework is called cubot it's
00:07:31.840 the framework invented by github that I
00:07:34.479 mentioned how what is written in
00:07:38.650 CoffeeScript which means it's in
00:07:41.620 JavaScript in no GS and as a ruby
00:07:45.280 developers you're probably maybe some of
00:07:48.940 you don't like JavaScript and the
00:07:53.110 remaining Ruby developers who don't like
00:07:54.759 JavaScript but in case of chat ups
00:07:57.430 framework JavaScript may be a good thing
00:08:00.699 because it brings a lot of a synchronous
00:08:06.130 support to your code which is important
00:08:09.340 in case of cheddar framework because
00:08:11.500 comments have to be asynchronous and one
00:08:14.710 heavy comment shouldn't block comments
00:08:17.229 from from other people another framework
00:08:23.349 is called Lita it's written in Ruby and
00:08:26.289 it's very well extendable its few years
00:08:31.240 old a very good framework and it's fair
00:08:34.390 to mention that both of these frameworks
00:08:38.320 have different adapters to every chat
00:08:42.310 provider we use slack so it's like is
00:08:46.870 the the only adapter we use but if you
00:08:49.300 use some very rare chat solution you can
00:08:56.190 you can find exist an adapter or right
00:08:58.800 here Oh an adapter so let's see how how
00:09:03.890 chat scripts and how DSL looks like so
00:09:09.720 this is the later DSL you just define
00:09:14.120 small Ruby class which has a macro code
00:09:19.740 root in this macro you describe a
00:09:22.860 regular expression with the comment that
00:09:26.130 you would like to trigger
00:09:27.930 so with this handler if I go to slack
00:09:31.200 and write echo something the bot will
00:09:34.140 will catch this phrase and reply with
00:09:38.730 the second word that comes after echo
00:09:42.240 and the who bought syntax is very is
00:09:48.600 very similar to leta we also defined the
00:09:52.190 the regular expression that the bot
00:09:55.470 should should wait for and send a reply
00:10:05.640 if we take a closer look we will see
00:10:09.640 that both of these details are based on
00:10:14.170 regular expressions and you should write
00:10:18.910 regular expression to tell the bot what
00:10:22.540 comments to detect and why regular
00:10:26.260 expressions it is the the easiest way to
00:10:30.930 tell the board what comment to watch and
00:10:36.480 this approach have a few disadvantages
00:10:40.570 like it cannot detect typos it cannot
00:10:49.530 reply with this comment was not found
00:10:53.320 maybe man something else it can also it
00:10:57.310 also cannot do in input validation like
00:11:01.450 if the comment was was right but the
00:11:05.920 argument was was wrong and that argument
00:11:09.100 may have not matched by the regular
00:11:11.830 expression and this comment won't be
00:11:13.390 found and having regular expressions in
00:11:17.710 your chat BOTS means that all developers
00:11:21.820 should should be really good in regular
00:11:26.230 expressions and it's always easy to make
00:11:29.980 a mistake and find a regular expression
00:11:33.040 that will conflict with a different
00:11:37.030 script regular expression so we thought
00:11:43.660 that maybe we could do something else
00:11:46.060 without regular expressions and yeah
00:11:49.810 here is an example
00:11:51.850 the first option to write their common
00:11:56.320 syntax with a regular expression and the
00:11:58.149 second one is to write it with some kind
00:12:00.970 of pattern language and with echo the
00:12:07.329 difference is not that big but where the
00:12:09.639 bigger comment like github add user name
00:12:12.790 to team name the relative expression
00:12:15.820 becomes quite long and it's quite easy
00:12:19.420 to make a mistake there as I said so we
00:12:23.889 we thought that maybe we can improve
00:12:25.660 that that experience of writing chat
00:12:30.420 handlers and what we wanted to to have
00:12:36.730 from that solution we want to be a
00:12:39.970 friendly for both developers and the
00:12:42.070 user by being friendly for developer it
00:12:46.540 means that developer wouldn't need to
00:12:48.190 write a regular expression and friendly
00:12:50.740 for user means that we would suggest the
00:12:52.839 write comment if user made a mistake
00:12:55.709 we're also we also have a lot of Ruby
00:13:02.069 infrastructure code written in Ruby it's
00:13:04.480 surely fine so we decided that wants to
00:13:07.510 stick with Ruby after we tried both
00:13:11.220 aletan who bought in production and we
00:13:15.760 wanted simpler and more powerful DSL
00:13:19.350 that would provide a better argument
00:13:22.209 support so our solution we we decided to
00:13:27.579 make it on top of Lita
00:13:29.260 with the custom common router and custom
00:13:32.620 DSL and this is how this DSL looks like
00:13:37.050 first of all is very similar to Lita but
00:13:40.510 instead of defining the regular
00:13:42.310 expression
00:13:46.540 here you define special puttering and
00:13:51.860 you also define a help and right after
00:13:55.520 this pattern is matched its dispatched
00:13:57.620 to a ruby method with a keyword argument
00:14:01.480 and in this case it's very simple
00:14:03.980 handler it will reply with the same
00:14:06.680 command so let's take a look on a bit
00:14:10.280 more complex handler it has two
00:14:12.980 arguments one of them is yeah this
00:14:16.700 handler is for displaying some chart
00:14:19.520 from your like the first variable is
00:14:23.120 application name and the second is
00:14:25.250 format a foreman is enum field it can be
00:14:30.410 depth daily or hourly value and help and
00:14:34.940 it should be converted into a calling of
00:14:39.110 routine method which is kind of simple
00:14:42.460 so this pattern would match all the
00:14:47.600 following user inputs can be my app so
00:14:52.220 hourly is the default value for the
00:14:54.950 format variable you can override it here
00:14:59.570 and here and you can also define it in
00:15:02.600 the explicit way which is useful when
00:15:06.320 you have more arguments and maybe you
00:15:11.150 don't remember the order of them so we
00:15:15.200 also wanted to have the explicit formats
00:15:19.790 and we to be able to work without
00:15:24.380 regular expressions we we talkin eyes
00:15:27.680 this pattern with different kind of
00:15:31.340 tokens first are to our static tokens so
00:15:35.770 the user inputs to start with a new
00:15:38.780 relic and chart and then there is that
00:15:41.890 simple variable and then there is a
00:15:43.970 variable with default it looks like this
00:15:48.800 so this comment consists of four tokens
00:15:53.920 our next goal is to convert the user
00:15:59.510 input of New Relic chart my app daily
00:16:02.120 into coin rule method actually yeah
00:16:07.040 instant changing the new Eric Cantor and
00:16:09.200 Cohen that method with those keyword
00:16:14.060 arguments and
00:16:19.450 this may seem as a as a task is a
00:16:27.220 difficult task until we discovered the
00:16:31.090 class in Ruby standard library which is
00:16:33.820 called shrink scanner yeah it's a class
00:16:36.940 in a ruby standard library please raise
00:16:38.980 your hand if he heard about the class
00:16:41.430 yes and not too many people the string
00:16:47.230 scanner works as a scanner I'll have an
00:16:50.020 example now so you initiate an object
00:16:54.630 with with a string in this case string
00:16:58.900 is the user input and there is a method
00:17:04.420 called scan and you give just just
00:17:10.240 talking to that scan command and yeah so
00:17:18.240 it scans and if this user input would
00:17:22.300 start from something else it wouldn't
00:17:24.840 scan the string at all so if it would
00:17:29.170 start with the github or some other
00:17:33.330 command it wouldn't miss can't then we
00:17:37.150 have the next token which is chart
00:17:38.830 static token it is also scanned so we
00:17:43.030 can go further then we scan for variable
00:17:49.140 so that
00:17:51.510 it's scanned and then the next variable
00:17:55.810 and we get the the values for these
00:18:02.230 variables so it wouldn't be honest so
00:18:06.640 it's not very honest to say that we
00:18:10.000 completely got rid of regular
00:18:13.000 expressions but the end developer of of
00:18:17.620 a handler doesn't have to write a
00:18:19.570 regular expression but we we use some
00:18:22.180 where expressions under the hood
00:18:29.050 yes so more than that we have type
00:18:34.070 version
00:18:35.750 so when defining handler you can you can
00:18:41.540 declare the type of variable for
00:18:44.330 instance the target this is the comment
00:18:46.970 used to tell to the infrastructure that
00:18:50.600 some server will go to down time that
00:18:55.820 means that maybe we are going to restart
00:18:57.890 the server or repair it somehow and
00:19:01.930 there are two arguments one of them is
00:19:05.960 target which is a chef node
00:19:08.870 it should be a valid chat HF node
00:19:11.270 address and then duration duration can
00:19:14.750 be one minute or one second or one hour
00:19:20.740 or just an iteration and we also declare
00:19:24.740 that so the first one dives chef node
00:19:27.680 second types duration and when the
00:19:31.040 message receives so when we call this
00:19:35.540 Ruby method inside this method you'll
00:19:37.760 have duration as activesupport duration
00:19:41.600 and target will be valid chef node so we
00:19:45.890 can be sure in this method that both of
00:19:49.010 these arguments are valid so this
00:19:52.550 comment will well develop for the first
00:19:56.420 input but for the second input it won't
00:19:59.720 develop and it will return an error and
00:20:03.280 this this method won't be called at all
00:20:06.710 so when writing code in that method you
00:20:09.740 will be sure that you get the right
00:20:13.880 input
00:20:21.100 after shipping these DSL to our
00:20:24.610 developers money developers could could
00:20:29.679 write their own chat handlers to
00:20:32.529 automate their their workflows and we
00:20:35.799 got to the number of more than 200 boss
00:20:39.039 scripts handlers and more than 60 of
00:20:44.039 more than 600 common invitations on a
00:20:48.580 busy day in in slack so this became a
00:20:53.580 part of infrastructure that we had to
00:20:56.200 scale as I mentioned where we based our
00:21:01.059 framework on on Lita so it was just the
00:21:06.460 next on top of fleeta and that means so
00:21:12.820 little is written in Ruby that means
00:21:14.649 that it doesn't have any support for a
00:21:19.330 synchronous workflow which meant that if
00:21:22.529 you ask a bot for some common that takes
00:21:27.580 a minute
00:21:32.340 working on comments from other users so
00:21:35.620 the bot was blocked for that minute and
00:21:37.570 it couldn't accept comments from other
00:21:40.630 users which was super bad especially in
00:21:44.410 the scale so we decided that we will as
00:21:49.060 one option we can make a nice read on
00:21:52.960 every comment and do all the operation
00:21:55.330 in that thread she not block receiving
00:21:58.930 new comments from slack and Ruby threads
00:22:04.270 are not are not so good in some kind of
00:22:07.750 in most kind of operations but in in our
00:22:10.780 case when most of the handlers for chat
00:22:14.440 BOTS were only making HTTP queries or
00:22:17.850 they were invoking some other kinds of
00:22:21.370 systems they didn't they didn't do any
00:22:24.790 calculations on the both side so they
00:22:28.240 just requested data from other systems
00:22:30.790 so in this case Ruby threads were were
00:22:34.360 quite efficient and this approach helped
00:22:39.090 but we we thought that maybe there is
00:22:43.390 some other approach and we went with
00:22:49.470 with the master process and Redis and
00:22:52.210 when the master process received a
00:22:55.060 comment from slack it pushed that
00:22:57.670 comment to Redis and we had a pool of
00:22:59.620 workers and we could have multiple
00:23:03.070 machines that are that work as workers
00:23:06.030 so we could scale it scale that
00:23:08.580 horizontally and it works the same as a
00:23:13.930 side kick or delay job worker queue
00:23:22.970 having that approach we could have
00:23:25.070 active and passive instances of the
00:23:29.540 bought server running and slack would
00:23:32.390 make a callback to a load balancer with
00:23:35.870 the message for a bot and then the world
00:23:39.020 browser could determine to which to
00:23:43.220 which machine to Trude the message and
00:23:47.380 that comes to the availability problem
00:23:51.550 so if you remember and at the beginning
00:23:55.400 of this year github was down for like
00:23:59.030 four three or four hours that was a
00:24:04.520 pretty big downtime and one of the
00:24:08.720 reasons for for that such a long
00:24:11.750 downtime was that github is heavily
00:24:15.770 based on on their chat ups scripts and
00:24:20.140 butBut chat ups was down as well because
00:24:23.000 of some network failure so they couldn't
00:24:25.370 use any of the chat of scripts to
00:24:27.650 recover the system because chat up setup
00:24:30.950 was down as well and we know the problem
00:24:34.160 is Sean v we also had this problem when
00:24:37.640 our bot was unavailable or slack was
00:24:40.490 down and in this case we couldn't do
00:24:43.430 anything so we decided that we'll build
00:24:47.480 a special offline or rescue mode in our
00:24:51.080 board so you should be if you have this
00:24:55.340 bought locally on your laptop you just
00:24:58.960 launch it with a special bean command
00:25:01.400 and you will have exactly the same
00:25:03.980 interface in your common line as you
00:25:07.010 would have interface for a bot in chat
00:25:09.790 but it works even if slack or something
00:25:14.810 else is down
00:25:18.100 summary this is very important so
00:25:22.570 probably you learned a bit about chat
00:25:26.320 ups and how it can automate things and
00:25:28.710 you thought okay cool I'm going to try
00:25:32.410 that in my team in my company no but I
00:25:36.600 would like to say that it's it only
00:25:39.820 makes sense but if you have a very big
00:25:42.340 team because when I worked on in smaller
00:25:47.530 teams and smaller companies I would say
00:25:50.920 that we didn't need all of that just
00:25:53.590 because we it wasn't on such scale that
00:25:57.790 that we needed to automate things with
00:26:00.610 shadows and in this case it's really
00:26:03.400 easier to come to your CTO and asked to
00:26:07.060 create a new repo for you instead of
00:26:10.080 bringing more code and more
00:26:12.190 infrastructure to keep the chat ups
00:26:15.730 running so if you're interested in
00:26:19.510 working on such a big scale systems we
00:26:23.140 welcome to check Shopify careers and I
00:26:25.900 had I have mentioned a lot of projects
00:26:29.800 in Ruby a lot of gems and some other
00:26:33.670 things so you can go to my Twitter and
00:26:36.600 the last weight is a gift with with all
00:26:40.180 links that I mentioned today which
00:26:42.820 you're welcome to check thank you
00:26:50.509 thank you very much here any questions
00:26:53.909 for
00:27:00.050 how many of you use some form of bots on
00:27:04.340 your favorite too
00:27:08.920 for my favorite slack bot is the flip
00:27:12.400 table what so what am i typing it above
00:27:15.340 it with a region guide so if there are
00:27:20.920 no questions back here
00:27:22.420 roundup father yet what it means
Explore all talks recorded at Red Dot Ruby Conference 2016
+17