Everything We Learned While Implementing ActiveRecord::Encryption

Everything We Learned While Implementing ActiveRecord::Encryption
Kylie Stradley • October 05, 2023 • Amsterdam, Netherlands • Talk

‪@GitHub‬ encrypts your data at rest, as well as specific sensitive database columns. What you may not know is that they recently replaced their in-house db column encryption strategy with ActiveRecord::Encryption, in place. While they were able to complete this transition seamlessly for GitHub’s developers, the process was not quite seamless for our team and some of our customers, and mistakes were made along the way.

Senior Product Security Engineer Kylie Stradley shares why despite the mistakes they feel the migration was worth it for their team, GitHub developers and most importantly, GitHub customers.

Slides available here: https://speakerdeck.com/kyfast/railsw...

Links:
https://rubyonrails.org/
https://github.blog/changelog/2022-08-18-false-alert-flags-will-appear-in-users-security-log-due-to-a-bug-in-2fa-recovery-events/

#RailsWorld #RubyonRails #Rails7 #opensource #OSS #Rails #ActiveRecord #encryption #GitHub

Thank you Dell APEX for sponsoring the editing and post-production of these videos. Visit them at: https://dell.com/APEX

Rails World 2023

00:00:15.080 yes so again good morning uh this is
00:00:17.760 everything that we learned the hard way
00:00:19.520 implementing active record encryption um
00:00:22.119 thank you so much Miriam for the
00:00:23.760 introduction um my name is Kylie
00:00:25.800 stradley you may recognize me oops um
00:00:30.000 from a presentation I gave earlier this
00:00:31.759 year at rails comp in Atlanta Georgia
00:00:33.640 with my cooworker Matt um this was
00:00:36.399 called active record encryption stop
00:00:38.600 Hackers from reading your data and it
00:00:40.840 was kind of about you know sort of just
00:00:42.559 convincing people the value of using
00:00:44.360 active record encryption and we spoke a
00:00:46.480 little bit about some of the changes
00:00:47.920 that we made um if you are jetl like I
00:00:51.320 am or uh aren't sure which talk you are
00:00:53.800 in we can just do a quick refresh of
00:00:55.559 active record encryption so you can be
00:00:57.160 certain you want to be here for you know
00:00:58.719 the next 30 minutes
00:01:00.760 um it's this really lovely API that we
00:01:03.199 get for free from rails right it
00:01:05.560 provides automatic encryption of
00:01:07.080 database records and pl Tex access when
00:01:09.400 you need
00:01:10.560 it we upgraded our previously encrypted
00:01:14.360 columns with an internal strategy that
00:01:16.280 the team wrote in about 2020 and um
00:01:19.400 upgraded some plain text columns to
00:01:21.360 active record encryption so before we
00:01:25.600 had this bespoke internal strategy which
00:01:28.040 was a very easy to use API
00:01:30.360 um if you look at it it looks very
00:01:32.040 similar to active record encryption
00:01:33.799 right but we had a couple of problems uh
00:01:37.040 the main one uh from my perspective was
00:01:39.479 this key generation bottleneck um to
00:01:42.200 start encrypting records you needed an
00:01:44.520 encryption key we found that product
00:01:46.600 Engineers were not comfortable
00:01:47.880 generating their own encryption te's so
00:01:49.920 it was on my team to generate the keys
00:01:52.680 for the product engineers and it just
00:01:55.159 kind of took a while to get things going
00:01:57.719 uh also once active record increased
00:01:59.799 cryption was introduced we were now in
00:02:02.200 the position of maintaining Divergent
00:02:04.000 column encryption code so um this was a
00:02:07.520 little harder to do and you saw the
00:02:09.360 active record encryption API so do just
00:02:11.160 a second ago it was a bit easier to do
00:02:13.520 and um keys are actually derived
00:02:15.319 basically at the time of writing code or
00:02:17.120 you could even consider it runtime uh if
00:02:19.440 you really want to get down to it um so
00:02:22.040 active record encryption was very
00:02:23.400 tempting to our product engineers and if
00:02:26.000 they started using it without us going
00:02:28.040 through and securing everything and
00:02:29.800 cluding like um it relied on this
00:02:32.000 encrypted rails secrets. yaml file which
00:02:35.040 is uh not somewhere that we can actually
00:02:36.680 store our encryption Keys that's like a
00:02:38.319 violation of our our service level
00:02:40.040 objective uh it could be a big mess so
00:02:43.920 um after we upgraded we still had a
00:02:47.440 really easy to use API right the lovely
00:02:49.720 API I showed you before uh we reduced
00:02:52.239 some of our bespoke code that we had to
00:02:54.239 maintain and now we have a couple of
00:02:57.319 benefits that we just couldn't provide
00:02:58.800 before keep these are derived at the
00:03:00.840 time of running code or or writing code
00:03:03.040 or I said if you like to get really
00:03:04.640 picky about it at runtime really um we
00:03:08.400 have a strategy now to easily upgrade
00:03:11.360 columns from plain text to encrypted
00:03:13.599 this could be done with our previous
00:03:14.799 system but it was difficult and uh
00:03:17.040 frankly intimidating so people didn't
00:03:18.680 choose it and finally we centralized key
00:03:21.400 rotation so we took the responsibility
00:03:23.920 of rotating encryption keys from product
00:03:26.040 teams and put it back on our team a
00:03:27.840 security team and um the made them much
00:03:30.480 more comfortable and
00:03:34.519 happier so in the previous presentation
00:03:37.120 I mentioned we wanted to show how
00:03:38.439 straightforward and easy column
00:03:40.000 encryption can be um but we weren't
00:03:42.680 entirely honest right deploying the new
00:03:45.319 column encryption strategy wasn't always
00:03:47.439 straightforward and
00:03:49.599 easy um so this talk is more of a real
00:03:52.400 world case study and um it's anyone who
00:03:55.680 might be converting from an existing
00:03:57.120 column encryption strategy so this might
00:03:59.239 be some of you
00:04:00.400 you um those who are working in
00:04:02.840 distributed systems uh given that reals
00:04:05.000 is 20 years old I expect that this will
00:04:06.879 be possibly some more of
00:04:09.560 you um and those who will rotate their
00:04:11.920 encryption Keys uh this should be all of
00:04:14.360 you uh we can choose to rotate our
00:04:17.639 encryption keys on a you know a
00:04:19.280 maintenance Cadence as scheduled but
00:04:21.519 actually a fun thing about working in
00:04:23.240 security or um working at a very uh a
00:04:27.160 company that has desirable data is you
00:04:29.039 may not Choose You may not always get to
00:04:30.960 choose when you rotate your encryption
00:04:32.560 keys right um there may come a time when
00:04:35.280 you just need to rotate your encryption
00:04:36.800 keys and you need to have extreme
00:04:39.160 confidence that you can do it really
00:04:40.800 well and you can do it perfectly
00:04:43.199 because um encryption must be perfect
00:04:46.199 there's simply no there's no other way
00:04:47.880 around it really uh so we learned a lot
00:04:51.160 about deploying and maintaining
00:04:52.720 resilient security
00:04:54.880 software we made a lot of assumptions
00:04:57.120 about what we thought we knew about
00:04:58.400 encryption and what we thought our
00:05:00.080 hardest problems would
00:05:01.800 be um and not to brag but we did do this
00:05:04.600 with all no
00:05:07.000 downtime we learned it the hard way but
00:05:09.759 I'm hoping that with this presentation
00:05:11.880 you don't have
00:05:13.680 to so we challenged and disproved our
00:05:17.199 own assumptions about what we thought
00:05:18.800 the hardest part of this project would
00:05:20.240 be and these are some of the the
00:05:22.319 assumptions right the hardest part of
00:05:24.400 encryption is Key Management maybe this
00:05:26.199 is something that you've read I learned
00:05:27.520 this in school uh one bite in one bite
00:05:30.560 out which is kind of a sound bite for a
00:05:32.600 larger problem we um inter interacted
00:05:35.639 with and then uh the price of it just
00:05:39.400 works is performance this is what we
00:05:42.240 believed so let's get right into it the
00:05:45.319 hardest part of uh encryption is key rot
00:05:49.080 is Key
00:05:50.120 Management so what we thought this meant
00:05:52.479 was things like fips compliance right um
00:05:54.800 fips is this standard that you have to
00:05:56.800 adhere to if you work with US Government
00:05:59.080 um into ities i' GitHub we're not
00:06:01.280 required to work with this standard but
00:06:03.280 we try to meet it as closely as
00:06:06.759 possible um and we weren't worried about
00:06:09.080 this because we were meeting it using
00:06:10.599 AAS AAS considered the best in class for
00:06:13.319 symmetric
00:06:15.599 encryption uh we were worried about non
00:06:17.919 free use non in cryptography means
00:06:20.599 number used once right um we were
00:06:23.840 concerned about nons foruse because of
00:06:25.280 the sheer number of encryptions we do
00:06:26.919 every single year at GitHub and you know
00:06:28.919 we had this information because we had a
00:06:30.720 previously uh we had a previous column
00:06:33.039 encryption strategy in
00:06:35.080 place the more encryption operations
00:06:37.319 that you do the more likely you are to
00:06:39.400 generate a duplicate knots this is
00:06:41.720 called knot exhaustion so encrypting two
00:06:44.560 things with the same Nots is just
00:06:47.039 absolutely fatal to the confidentiality
00:06:49.160 and integrity of AES so it's very very
00:06:52.120 important that you do not generate a
00:06:53.639 duplicate kns and and use it to encrypt
00:06:55.360 two
00:06:56.400 values um but we weren't worried about
00:06:59.080 this right right uh we thought we were
00:07:00.199 being very clever we added the current
00:07:02.560 year as anti exhaustion data which
00:07:05.479 effectively derived a new key for each
00:07:08.319 column every single
00:07:10.319 year um another thing that we were
00:07:12.919 concerned about that you might read
00:07:14.039 about is secure key storage right um and
00:07:17.720 we weren't worried about this though
00:07:19.039 because we were using Hashi Corp Vault
00:07:21.000 to store a keys and we had been using
00:07:23.000 that in part of our previous column
00:07:24.840 encryption strategy transition for some
00:07:28.319 time
00:07:30.120 so that's everything we thought we knew
00:07:32.680 right um you go into a project thinking
00:07:34.720 oh we know what's going to go wrong
00:07:36.039 we're not worried we're smart people um
00:07:38.960 key deployment is actually an extremely
00:07:41.759 difficult part of
00:07:43.319 encryption let's say you have one rail
00:07:45.520 server right and you uh have active
00:07:48.240 record encryption enabled and you
00:07:50.360 encrypt records with your key and you
00:07:51.840 save them to the
00:07:54.159 database um and now let's say you add a
00:07:57.280 key as part of your key rotation process
00:08:00.759 so going forward all your records will
00:08:03.319 now be encrypted with your new TL
00:08:06.960 key when you rotate your key before
00:08:10.039 you've re-encrypted every single record
00:08:12.240 to use your new key your latest key this
00:08:15.360 teal key cannot decrypt purple records
00:08:18.759 right um some of might you might be
00:08:21.039 thinking uh but this will not happen
00:08:23.120 right active record encryption considers
00:08:25.360 this and um you would be right in this
00:08:27.599 specific scenario um rails will just
00:08:31.080 move right along to the next key in the
00:08:32.800 list and you'll be able to decrypt the
00:08:34.680 record um this is wonderful for you
00:08:37.240 truly I love this for you the rails
00:08:38.959 guard will take you far in life you
00:08:40.760 probably don't need the rest of the
00:08:42.479 presentation um but maybe you like me
00:08:45.240 work with more than one rail server or
00:08:47.440 maybe if today you work with one rail
00:08:49.040 server you'd like to work with
00:08:51.320 more so let's see your app is a bit
00:08:54.120 bigger than just one rail server right
00:08:56.320 maybe your app is more in the range of
00:08:58.040 like a fleet of rail servers and uh
00:09:01.320 maybe you serve like thousands of
00:09:02.760 requests per second and you're doing
00:09:04.600 quite a few encryptions and decryptions
00:09:06.519 every
00:09:07.959 year um here's a really simplified
00:09:10.800 version of the scenario to describe we
00:09:12.959 have um a couple of servers that are
00:09:15.760 writing encrypted records to the
00:09:17.560 database uh very nice now when you
00:09:21.279 rotate your key you have to propagate
00:09:23.760 the change to all of your servers it is
00:09:26.839 difficult to propagate to coordinate
00:09:28.839 such a propagation all at once um can
00:09:31.600 you automically deploy a key to all of
00:09:33.519 your servers at the same time I
00:09:38.800 cannot what does this mean for you right
00:09:42.200 until you propagated your keys to all of
00:09:45.399 your servers any processes that may be
00:09:48.079 holding on to uh references to Old keys
00:09:50.680 or any servers that haven't received the
00:09:52.200 new keys yet you might find yourself in
00:09:55.079 a situation where record is encrypted on
00:09:57.279 one server so see we have this new teal
00:10:00.200 key coming out it's encrypted with this
00:10:02.040 teal key but it cannot be decrypted on
00:10:04.399 another server because that server has
00:10:05.920 not yet received the decryption
00:10:09.120 key so what we learned is just dep
00:10:12.320 pending a new key didn't work in our
00:10:14.240 distributed
00:10:16.680 system in a distributed system to rotate
00:10:19.720 Keys we also had to distribute keys so
00:10:23.000 before we started this project our
00:10:24.600 previous encryption service was actually
00:10:26.600 networked which is Maybe not a choice
00:10:29.760 you would want to make we decided local
00:10:31.720 encryption would be sufficient um but we
00:10:35.079 distributed Keys through a database
00:10:36.560 backed API in that situation so we
00:10:38.320 didn't have to worry about this type of
00:10:39.680 key rotation and distribution until
00:10:43.040 now we needed a solution that ensured
00:10:45.800 encryption would happen with new keys
00:10:47.839 only once they were propagated to all
00:10:50.160 servers so we decided to distribute our
00:10:52.720 key by using a two key strategy first
00:10:55.880 Distributing the new decryption key then
00:10:59.040 Distributing the same value as an
00:11:00.560 encryption
00:11:02.120 key so we can distribute the decryption
00:11:04.920 key wait for the process to signal that
00:11:07.680 that key has been propagated to all of
00:11:09.399 our servers and can ensure that it's
00:11:11.920 present in all
00:11:13.600 servers if we attempt to decrypt with
00:11:16.720 the new decryption key rails will do
00:11:19.240 what it does well and it will just move
00:11:20.800 along in the list and attempt to use the
00:11:22.519 next key which should be the correct key
00:11:25.279 uh in this situation in the system that
00:11:27.000 we've set up and this works
00:11:29.720 once we know that the process is
00:11:31.399 complete to distribute the decryption
00:11:33.240 key now we can distribute the encryption
00:11:36.040 key with both keys in place we ran our
00:11:38.839 specialized migration to reencrypt all
00:11:41.519 of the records in place with again no
00:11:43.800 downtime and no using a plain text
00:11:47.279 mode so using two keys enabled us to
00:11:50.480 maintain our encryption SLO um we didn't
00:11:53.440 want to use planex mode it's very handy
00:11:55.560 but for us we had records that were
00:11:56.880 previously encrypted and we couldn't
00:11:58.320 allow them to to be stored in plain text
00:12:01.240 also because we have so many records our
00:12:03.920 re-encryption process was fairly lengthy
00:12:06.320 right so maybe if you have just a couple
00:12:08.519 records your re-encryption process will
00:12:10.320 not take so long and it's okay but it
00:12:12.839 would not be acceptable for that length
00:12:14.519 of time for our records to be available
00:12:16.120 in plain
00:12:17.519 text so what can you learn from our
00:12:22.199 experience well we learned this during a
00:12:24.600 plan test of our key rotation strategy
00:12:27.480 um and this is certainly one way that
00:12:29.480 you can find out about faults in your
00:12:32.040 system um I do not think that this is
00:12:35.279 the best way to find out about potential
00:12:36.959 faults in your system or your deployment
00:12:39.120 strategy uh you're really your best bet
00:12:41.480 for detecting this type of potential
00:12:43.639 failure is knowledge of your system um
00:12:46.720 so I would encourage you to do some
00:12:48.560 research and understand your
00:12:50.519 capabilities for updating keys for your
00:12:52.279 production
00:12:54.079 servers can you orchestrate updating all
00:12:56.880 of your keys at once if not how can you
00:12:59.959 roll out your keys to prevent this
00:13:02.040 potential decryption
00:13:04.040 failure we have a centralized key
00:13:06.360 management system that we push out
00:13:08.600 updates do you update Keys via push or a
00:13:11.800 pull what triggers a push um if you use
00:13:15.040 a pull system how do you trigger a pool
00:13:17.800 do you pull for updates how frequently
00:13:19.839 do you
00:13:21.399 pull how and when is your data migrated
00:13:24.639 um is your database sharded will data
00:13:27.360 ever move between shards
00:13:29.519 how will you re-encrypt records for key
00:13:31.560 rotation how long does re-encryption
00:13:33.959 take these are the kinds of things that
00:13:35.760 you should think about when you're
00:13:36.839 thinking about deploying your key
00:13:39.680 rotation
00:13:41.760 system so key rotation and re-encryption
00:13:45.360 of all records was always in our road
00:13:47.079 map but looking back we all agreed that
00:13:49.720 it probably should have been the first
00:13:50.920 thing that we looked at and looked at
00:13:52.480 really hard active record encryption
00:13:55.320 with the backwards compatibility key
00:13:57.120 list makes key rotation really really
00:13:59.680 easy so take advantage of
00:14:02.279 that all right what we thought we knew
00:14:06.399 um one bite in one bite out all of our
00:14:08.759 Cipher texts are the same size but is
00:14:11.480 sorry just double-checking that the
00:14:13.560 slide is showing everything I wanted
00:14:15.360 to um so as GCM 256 works like a stream
00:14:19.480 Cipher um some cryptographers in the
00:14:21.600 audience might be saying oh but it also
00:14:23.040 works like a block Cipher this is true
00:14:25.000 but we really don't have time for that
00:14:26.160 today I'm happy to talk to you about it
00:14:27.880 later um and all Cipher text will be 128
00:14:32.440 bits right so that sounds good um some
00:14:35.920 columns are migrating from plain text
00:14:38.240 but they just need to be resized to hold
00:14:39.759 about 128 bits right I think see some of
00:14:42.639 you see where I'm going with this um our
00:14:45.519 previous encryption scheme stored some
00:14:48.120 metadata but it didn't store quite as
00:14:50.639 much metadata as rails does and not
00:14:53.000 quite in the same
00:14:54.440 way so what did we
00:14:57.120 learn Cipher TCH text is not a onetoone
00:15:00.240 mapping to encrypted record or what
00:15:02.399 active record encryption Cipher calls
00:15:04.120 the encrypted message and while we
00:15:07.000 accounted for some overhead and
00:15:08.680 migration to the new scheme we didn't
00:15:11.000 fully think this
00:15:13.120 through um all of our Cipher texts are
00:15:15.680 the same size that is true one by out
00:15:17.759 one bite in one by out there yes however
00:15:20.120 cyppher text is not all that is stored
00:15:21.880 in the
00:15:22.639 database um rails uses eded a key ID as
00:15:27.320 part of a simple envelope and encryption
00:15:29.120 strategy um it stores all of this in an
00:15:31.120 adjacent object and it includes like a
00:15:33.800 couple of other headers if you want to
00:15:36.160 add values to this envelope you need to
00:15:38.160 account for the size any of these
00:15:39.920 metadata bits may be adding to the total
00:15:42.079 size of your
00:15:44.120 record this one uh we thought we were
00:15:46.800 being very clever uh I mentioned before
00:15:49.519 we have our anti-n exhaustion data um
00:15:52.240 that is quite a bit of text I think that
00:15:53.880 might be 26 characters long uh we wanted
00:15:56.800 to uh write for read ability right we
00:15:59.519 were writing this and our previous
00:16:01.079 encryption scheme was really good but
00:16:02.880 developers didn't totally get how it
00:16:04.959 worked so we wanted to just be so
00:16:06.440 explicit and clear with everything you
00:16:08.120 know in case they wanted to look at the
00:16:09.519 internals and see the changes that we
00:16:11.160 had made um this is the actual name of
00:16:14.440 the tag that we used if you are familiar
00:16:17.160 with envelope encryption or the types of
00:16:19.079 metadata headers that get appended to
00:16:21.120 Cipher text in encrypted messages you
00:16:24.000 know that usually the names of these
00:16:26.199 tags are just one or two characters
00:16:29.920 so just for a bit of comparison you can
00:16:32.639 see our previous strategy um in red we
00:16:35.800 have a key ID and then in Gray is the
00:16:38.360 encrypted message and these are both
00:16:40.560 packed as binary strings so this comes
00:16:42.440 out to about 42 characters which is
00:16:45.120 quite small right and quite nice active
00:16:47.680 record encryption uses adjacent object
00:16:50.160 which I think is also a nice way to
00:16:51.880 store an encrypted message and we did
00:16:54.480 consider you know that there are there
00:16:56.079 are headers added differently and a
00:16:57.800 little bit larger because they are an
00:16:59.440 object and not you know a packed a
00:17:01.399 packed binary stream but you can see
00:17:03.600 active record encryption the payload
00:17:05.799 with the key P um is quite small and
00:17:09.480 then uh the message headers with the key
00:17:11.760 H is a bit bigger I think this whole
00:17:13.919 thing comes out to about 208 characters
00:17:16.000 only 60 of which are the payload um and
00:17:18.919 you'll see at the bottom someone has
00:17:20.400 added a very long uh message tag with
00:17:23.400 the name anti-n exhaustion
00:17:27.400 data um so we ended up resizing our
00:17:31.880 existing columns and recommending that
00:17:34.000 all of our um new encrypted columns use
00:17:37.520 to type MySQL text uh we made this
00:17:41.080 recommendation based on the fact that we
00:17:43.840 are not allowing deterministic
00:17:45.360 encryption at GitHub um You probably
00:17:48.880 don't want to use can't use text if you
00:17:51.280 need to index on your encrypted
00:17:53.919 columns I personally feel that encrypted
00:17:57.720 records should not be indexed on and you
00:17:59.360 should not use deterministic encryption
00:18:02.080 um but as happened at rails confid as
00:18:04.799 I'm sure will happen here someone will
00:18:06.039 tell me a good use case that they have
00:18:08.120 but um for my or for our use case we've
00:18:10.799 decided not to enable deterministic
00:18:12.919 encryption so if you need deterministic
00:18:15.120 encryption just make sure that you use a
00:18:17.400 large enough size uh column but probably
00:18:19.640 not
00:18:21.600 text so what can you learn from our
00:18:25.200 experience um really truly if you are
00:18:28.640 any changes to the message headers or
00:18:31.240 the metadata understand the size of
00:18:32.919 those bits that you're storing along the
00:18:34.240 cipher
00:18:35.440 text um in our case we really did not
00:18:38.320 need to name the tag anti-n exhaustion
00:18:41.480 data um product Engineers were not quite
00:18:44.159 chomping at the bit to like dig into the
00:18:46.200 internal changes that we made um and
00:18:48.440 really wanting to understand the API at
00:18:50.200 that level so we may be over optimized
00:18:52.400 for thinking people would be as excited
00:18:54.159 about this as we were um although they
00:18:56.400 are quite excited about what it buys
00:18:59.520 them um next is the actual value for any
00:19:04.840 um any header values that you may add
00:19:07.600 right our anti-n not exhaustion data is
00:19:09.840 the year we determine with the number of
00:19:11.679 encryptions that we do every year
00:19:13.400 rotating the key yearly automatically in
00:19:15.880 this way would be sufficient for us the
00:19:18.760 most important thing about this value I
00:19:20.840 think is that it is of a fixed length
00:19:24.320 right so um you wouldn't going to use
00:19:26.960 something like um a model and a column
00:19:29.799 name because those are not fixed length
00:19:31.679 right they could be different lengths
00:19:32.880 depending on the model and column um and
00:19:35.159 you might be thinking uh Kylie year is
00:19:37.840 not guaranteed fixed length and that's
00:19:40.400 fair but we do have about 8,000 years
00:19:43.039 before that value will become longer and
00:19:45.480 I think that this is probably enough
00:19:46.880 time for us to figure out a solution if
00:19:48.720 we need to make a change however because
00:19:51.120 we use
00:19:52.080 text I think we will be okay if we add
00:19:54.600 one more
00:19:56.039 character um just a highlight once again
00:20:00.080 all all of these other message headers
00:20:01.960 which are implemented by the rails team
00:20:04.159 are just one or two uh characters and
00:20:06.600 our message header's name is quite
00:20:09.840 long so make it easy for your engineers
00:20:12.880 don't even let them find out about the
00:20:14.480 size of message headers um that will be
00:20:17.480 added to the encrypted message right um
00:20:21.840 so consider longevity like I said um
00:20:25.000 using text and using year for antiox
00:20:27.600 exhaustion data does buy us about 8,000
00:20:29.880 years and we have a lot of work in the
00:20:31.799 backlog but I think you know if it comes
00:20:33.640 to it and we have to make a change we
00:20:35.159 have the
00:20:36.200 time that's a a joke so you can't laugh
00:20:40.520 or it's your morning
00:20:43.360 too
00:20:45.000 so uh what we thought we knew the price
00:20:47.960 of it just works is performance right
00:20:51.880 when something just works you pay a
00:20:54.799 price right uh and we assume that this
00:20:57.720 would be performance right encryption
00:20:59.679 can take some time but we were moving
00:21:02.200 from one encryption sceme to another so
00:21:05.159 our Engineers were familiar and
00:21:07.400 understanding of how much time would
00:21:09.760 conceivably be added right and so we
00:21:12.200 figured that this was negligible and for
00:21:14.200 those upgrading a plain text column to
00:21:16.520 encrypted again it will just be such a
00:21:18.720 small amount of time and it's acceptable
00:21:20.400 to our engineering
00:21:23.360 team so what we weren't thinking about
00:21:27.320 was the pride of item the price of item
00:21:29.440 potency right sometimes you pay in how
00:21:31.960 much time and sometimes you pay in how
00:21:34.200 many times if it just works how did we
00:21:37.720 find out um with monitoring um and
00:21:41.799 unfortunately with our customer audit
00:21:43.520 log which may some of you may have
00:21:46.159 noticed
00:21:47.720 so we had a bit of a red herring um and
00:21:50.919 we found that some of our custom code
00:21:53.120 was somewhat of the problem but what we
00:21:55.919 ultimately
00:21:57.039 learned some data is just extra special
00:22:01.200 and we had one such extra special column
00:22:03.720 two Factory recovery codes we had
00:22:06.000 monitoring in place to measure things
00:22:07.840 like encryption and decryption failures
00:22:10.559 but we didn't have anything internal in
00:22:12.600 place to measure side effects of our
00:22:14.760 upgrade
00:22:17.120 strategy our upgrade strategy relied on
00:22:19.679 a type to feature flag right the flag
00:22:22.840 would determine if a record should be
00:22:24.320 encrypted or not and it would be set to
00:22:27.120 encrypt before we ran our upgrade mic
00:22:29.520 creation um and this seemed like a good
00:22:31.840 system and we upgraded a couple columns
00:22:34.440 from our previously encrypted strategy
00:22:36.600 to encrypted and you know we saved what
00:22:39.200 we F felt was like a a special column
00:22:41.279 for a little bit later making sure we
00:22:42.880 had really battle tested it before we
00:22:44.880 tried this
00:22:45.919 one um but we overlooked that this
00:22:48.559 column relied on changed in place right
00:22:52.799 so when we were migrating this column
00:22:54.720 which was previously encrypted we found
00:22:57.200 that changed in place would compare
00:22:59.120 decrypted plain text to the encrypted
00:23:02.159 Cipher text um and this will always show
00:23:05.039 is changed right this is U the virtue of
00:23:08.159 encryption this is the main value of
00:23:09.840 encryption knowing the cipher text
00:23:11.880 should tell you absolutely nothing about
00:23:13.600 the value of the plain text um so the
00:23:15.919 encryption Works quite well uh but we
00:23:18.400 neglected to delegate that changed in
00:23:21.400 place method to the active record
00:23:24.000 encrypted type attribute right um so
00:23:27.679 when we migrated this column related to
00:23:30.120 two- Factor
00:23:31.720 authentication this had the unexpected
00:23:33.799 side effect of causing all of these
00:23:35.440 records to appear to have been changed
00:23:37.840 when they were in fact not um and this
00:23:40.799 generated audit audit logs for our
00:23:44.880 customers um fortunately though there
00:23:46.960 was no actual change to the data the
00:23:50.080 data itself is fine um and our
00:23:52.400 authentication team was really great and
00:23:54.080 very understanding and they annotated
00:23:55.960 the false alerts to indicate this to to
00:23:57.919 the affected customers we fix this by
00:24:00.400 delegating the changed in place to the
00:24:02.039 encrypted attribute type um and this now
00:24:05.039 compares the decrypted cipher text to
00:24:07.440 the plain text which again if encryption
00:24:09.919 was done correctly and with active
00:24:11.440 record encryption it is uh always
00:24:16.200 match um so we also noticed and this was
00:24:19.760 the red herring we thought perhaps our
00:24:21.440 issue is item potency right the idea
00:24:23.840 that maybe to get something done a
00:24:26.200 method has to be called a couple
00:24:27.559 different times before it can really
00:24:30.200 fully take effect but the encrypt method
00:24:32.880 is item potent but we notied with our
00:24:35.000 monitoring dashboard that encrypt was
00:24:36.600 being called twice um this did not
00:24:40.399 directly contribute to the erroneous
00:24:42.279 audit logs but because we generated
00:24:44.960 these eron audit logs we did notice
00:24:47.799 this however luckily like I said encrypt
00:24:50.440 is item potent um there was a 35 second
00:24:54.799 period where I was extremely ill
00:24:56.480 thinking that we had double in encrypted
00:24:58.120 records and thinking oh my goodness how
00:25:00.440 are we going to find out which ones are
00:25:01.440 double encrypted and how are we going to
00:25:03.200 decrypt them and set them back to you
00:25:05.080 know the standard single encryption
00:25:06.799 strategy that we have um but like I
00:25:10.000 mentioned encrypt is item potent and the
00:25:12.799 bug was fixed if you are for some reason
00:25:15.279 like us relying on Counting the number
00:25:17.640 of encryptions or decryptions to detect
00:25:19.720 potential
00:25:21.760 failures um yes and the bug has been
00:25:24.679 fixed so you you can rest well knowing
00:25:28.600 that uh it will only be called once per
00:25:30.799 encryption which is uh I think a really
00:25:32.960 good and appropriate number of times for
00:25:34.640 it to be called so what we learned it
00:25:38.200 just works means right it did just work
00:25:41.080 for most
00:25:42.120 cases but we had a really special case
00:25:45.760 and monitoring can help detect special
00:25:48.399 cases but the problem with monitoring is
00:25:51.120 it's too late this is in production the
00:25:54.080 special case has hit production and
00:25:55.960 could now be affecting your customer
00:25:57.440 data
00:26:00.120 data some data is just extra special and
00:26:03.760 you need to take extra time and care to
00:26:05.520 get it
00:26:06.960 right so what can you learn from our
00:26:09.880 experience uh maybe you know don't
00:26:11.520 meddle with the internals but maybe you
00:26:13.960 like us can't help it encrypted data is
00:26:16.919 being encrypted for a reason there might
00:26:19.039 be special monitors or side effects
00:26:21.279 associated with these
00:26:24.200 records but despite all of this all of
00:26:27.440 these bad things that seem to have
00:26:29.320 happened and um quite a bit of sweating
00:26:31.600 on my part we delivered a seamless
00:26:34.320 column encryption strategy we're no
00:26:36.480 longer maintaining Divergent column
00:26:38.159 encryption code we have this new easy to
00:26:40.679 use process to upgrade columns from
00:26:42.880 plain text to encrypted which before we
00:26:45.679 could provide but was difficult and a
00:26:47.679 bit arduous and not appealing to our
00:26:50.399 developers um we greatly sped up
00:26:53.480 development time of new encrypted
00:26:55.360 columns which greatly increased adoption
00:26:57.559 op of encrypted columns um and we
00:27:00.679 maintain our service level uh with no
00:27:03.200 service interruptions at all um so I
00:27:05.480 think if you keep a couple things in
00:27:07.080 mind you can too um and the two that I
00:27:10.399 think are most important probably are to
00:27:13.080 build with key rotation and Key
00:27:15.000 Management in mind first I mentioned
00:27:17.240 this before active record encryption
00:27:20.039 makes this very very easy and you should
00:27:22.320 take advantage of it um you could
00:27:25.440 probably build your own column
00:27:26.799 encryption Str stry right but building
00:27:29.399 your own key rotation strategy without
00:27:31.760 the support of active record encryption
00:27:33.600 is very difficult and I do not advise
00:27:36.480 you go down that path um and then the
00:27:39.640 next most important thing I think is
00:27:42.159 understand why a column should be
00:27:43.559 encrypted and what side effects there
00:27:45.279 may be on those records right this is
00:27:47.480 live data this affects your customers
00:27:49.519 this is the livelihood of your
00:27:53.080 application deploying seamless column
00:27:56.240 encryption was not seamless but it's
00:27:59.440 very doable and I really believe that
00:28:01.880 it's
00:28:03.120 worthwhile um if you enjoyed this
00:28:05.320 presentation uh if you'd like to learn
00:28:07.360 more we have two blog posts that
00:28:09.720 detailed um first uh all the changes
00:28:13.000 that we made and why and then um we have
00:28:15.760 another one that tells you in more
00:28:17.279 detail how a kind of like simplified
00:28:19.480 version of our key rotation strategy and
00:28:22.039 it pass some sample code um which you
00:28:24.120 can use and it relies on this really
00:28:25.640 handy gym that I like from Shopify
00:28:27.480 called maintenance tasks that helps you
00:28:29.640 handle these kind of like special
00:28:31.159 transitional migrations um the active
00:28:33.799 record encryption guide which I joked
00:28:35.960 earlier but it really will take you very
00:28:37.760 far um this presentation is just about
00:28:40.480 you know the handful of things that
00:28:41.720 weren't in there um we gave a
00:28:44.080 presentation earlier this year uh which
00:28:46.240 kind of maybe sells you on active record
00:28:48.320 encryption of it if you're not sold and
00:28:50.159 then two of my absolute favorite uh
00:28:52.159 security engineering books that I've
00:28:53.600 read real world cryptography by David
00:28:56.200 Wong um I opened this book probably like
00:28:58.960 every day or every other day while
00:29:00.279 working on this project and then I
00:29:02.480 really enjoyed um Google's building
00:29:04.679 secure and reliable systems book it's
00:29:06.640 kind of their security answer to the SRE
00:29:09.000 book um yeah I don't know that we'll
00:29:12.039 have time for questions but I would love
00:29:13.640 for you to come ask me them in person or
00:29:15.559 if you feel shy to come uh you can
00:29:17.440 message me on the conference
00:29:19.480 slack thank you so
00:29:26.320 much
Explore all talks recorded at Rails World 2023
+26