Summarized using AI

Generate Anonymised Databases with MasKING

Chikahiro Tokoro • June 09, 2024 • Hamburg, Germany • Talk

The video titled Generate Anonymised Databases with MasKING, presented by Chikahiro Tokoro at Ruby Unconf 2024, focuses on the importance and methodology of creating anonymised databases using a technique called masking. The speaker discusses various scenarios in which anonymising data is essential, particularly emphasizing the need to protect sensitive information in development environments. Key points from the presentation include:

  • Reasons for Anonymisation: Tokoro shares a story where the main website faced downtime due to delays in database migration. He highlights various reasons for needing anonymised data, such as complying with regulations like GDPR, optimizing SQL performance, and identifying bugs from production data.

  • Strategies for Implementing Anonymisation: The speaker outlines multiple strategies for generating anonymised datasets, including:

    • Copying production databases and updating records.
    • Utilizing database triggers to anonymise data during replication.
    • Using a proxy connection to modify data on-the-fly.
    • Generating dumps of anonymised databases directly.
  • Evaluation of Strategies: Tokoro evaluates these methods based on effort, complexity, and performance, concluding that a simpler, preferable method is copying the database and updating records due to its lower complexity and risk of data leakage.

  • Introduction to MasKING Tool: The video presents MasKING, Tokoro’s open-source tool designed for database anonymisation. He explains its configuration and ease of use, demonstrating how to set it up with YAML files to define how sensitive data, such as email addresses and phone numbers, should be anonymised.

  • Design Principles: Tokoro elaborates on design philosophies guiding the development of MasKING, including keeping it simple and prioritizing non-dependency on external libraries. He emphasizes the importance of Test-Driven Development (TDD) in ensuring code quality and reliability.

  • Challenges and Solutions: The speaker discusses challenges faced when parsing SQL data for anonymisation, describing a complex issue he encountered with production data and how community resources helped him find a solution.

In conclusion, Tokoro underlines the successful implementation of MasKING, now stable for production use. He invites feedback and support as he continues to enhance his project. The session not only provides insights into practical anonymisation techniques but also reflects on the code development philosophy that can bolster software quality.

Generate Anonymised Databases with MasKING
Chikahiro Tokoro • June 09, 2024 • Hamburg, Germany • Talk

Ruby Unconf 2024

00:00:00.199 chikahiro tooco who will Enlighten us I hope that was right who tooro I'm sorry
00:00:07.960 um who will Enlighten us about how to create um generate anonymized databases
00:00:14.200 by with masking so please give him a warm
00:00:21.039 Applause great now okay okay can you hear
00:00:27.599 me great so then thank you for the voting so I would present to you how to
00:00:36.040 generate anonymized database with masking masking is my open source
00:00:42.039 project name it's funny this is my second talk as a unconference St so EUR
00:00:48.840 2023 also like I was chosen but as unconference talk so only unconference
00:00:54.079 stock so far so far now okay but for here why do we want to anonymize
00:01:04.479 database I give you the I share my story or real story one day you are doing for
00:01:11.320 some release and then you notice or everybody notice your main website is
00:01:18.240 time out time out okay and then metx shows something really long right but
00:01:27.320 you don't know what happens and then after the investigation you see in the database
00:01:33.159 there's a bunch of waiting query so what happen is like your main
00:01:39.119 table was log during the database migration and post mem check the
00:01:46.479 delivery process what it how it happens so it is quite normal I think local and
00:01:53.520 then staging QA and production the point is here there is a
00:02:01.680 def quantity of data set and the local maybe 200 MC
00:02:08.280 staging okay it's a check it staging 3 second production 1
00:02:16.800 hour so then crash right but this is I think it's not only
00:02:23.599 one time like know it's common I can see and but what if if you have similar data
00:02:29.920 set of production in your development process and then here in your local
00:02:35.360 Wonder of course you should notice something wrong right but but okay then
00:02:43.480 let's copy to the prodution database please don't there are a bunch of reasons but
00:02:50.080 if you I pick just one ddpr so email the phone numbers please
00:02:57.400 don't copy to your local or even staying testing data so but how but still
00:03:03.799 problem is here then you can use anonymized data
00:03:09.959 set now the question is how to generate it ah sorry but it's uh but for the use
00:03:17.519 case it's not on it's a database migration is for sure one of use case
00:03:24.360 like as I showed you the story but not only sometime so
00:03:30.640 migration could fail because of depends on data let's say for you want to the folding key and production there are
00:03:36.239 some like violation then migration can be fail or this is also typical use case
00:03:43.000 SQL performance optimization so you have a slow query log in production but if
00:03:49.400 you want to optimize you need really accurate execution plan for this you
00:03:56.000 need similar data set with production or sometime so some some bugs is coming
00:04:04.680 from data for instance you have the product table and then you don't know why price
00:04:12.480 is negative or sometime it's okay order and
00:04:18.160 payments records and there's a missing record of order why it
00:04:24.120 happens or duplicate payment this kind bug it's a it's
00:04:30.199 sometimes harder to find or de produce as well or another use case maybe better
00:04:36.440 feature pre preview to the your stakeholders or be use case so this is
00:04:42.880 qu it's actually like sometime it's useful because um like only analyze
00:04:48.080 analyzing purpose or stress test or there might be bunch of use case
00:04:55.479 your idea can be here and actually there a start for doing these same things found in
00:05:03.520 2020 but for the Post glass um but the catch is when I start
00:05:09.880 my masking the date is know 2019 I'm faster faster than
00:05:17.000 them so okay but how to implement
00:05:23.520 anonymization there okay one of strategy maybe copy database and upate record
00:05:30.720 so you have the production data database just copy it and connect and update your
00:05:38.360 database this is when I see the similar similar tools this is most common
00:05:44.479 way but this is not the only way I can imagine set up leca and database
00:05:52.960 trigger so production is here and leca is here you can set up database trigger
00:06:00.039 in this replica to update your sensitive
00:06:05.120 datas or proxy connection your production database here and set up
00:06:11.680 something connection proxy and skill is coming from from from
00:06:18.800 user or application then for the production go for as it is but for the
00:06:25.720 inter left scale on the fly to and might sensitive
00:06:32.039 datas another strategy is it's similar with first one but read database from
00:06:39.880 database client and then generate anonymize database this is a possible
00:06:46.800 and this is second common way when I see the similar product a little bit different it's
00:06:54.639 similar but also you don't need a client just make it dump
00:07:00.160 anonymize it and generate also
00:07:05.479 possible okay now we have four strategies which one do you want to use
00:07:11.039 it let's evaluate this is my evaluation the the
00:07:16.199 effort copy and data update it's for me it's obvious okay maybe effort is low
00:07:21.280 other yeah not sure but database proxy is may be more tricky because you have
00:07:26.800 to read good one and also like solid one and thinking about for the data data
00:07:34.319 leak risk and first one is worst because you have to have your production data as
00:07:41.680 a copy and then see during the update you still have the problem you can be
00:07:47.280 the leing this there um database to also during the
00:07:53.560 trigger there is a data here but trigger also can be fail there a slightly possible to risk here other than
00:08:01.720 not configuration first one and four one is easier just connecting the database
00:08:09.199 proxy yeah just one setup it's gone toer it's very very high because you have to
00:08:14.840 implement logic there performance regation can be happen
00:08:20.120 for the database proxy because your dat production database is a proxy now think
00:08:26.720 about it and then time Gap the first one is a kind of for the two step copy and
00:08:35.120 update it's going to be longer other like third second one to
00:08:41.519 and proxy is almost real time this is quite beneficial um then Al last one is
00:08:47.120 a I can say it's a medium so which one do you want to choose for me I don't need real time
00:08:55.640 data and then so I choose much more like relatively low disk one I choose this
00:09:03.440 one and then let's try to do that that's a result of my tool of the
00:09:11.160 masking so written by Luby and compatible with my skill and
00:09:16.800 Maria DB and how to use
00:09:22.480 this this is quite simple now you just Define for the yamu file which database
00:09:28.640 table and cab name and how how you want to anonymize that value but tricky part
00:09:36.399 is here email because it needs to be often
00:09:42.519 unique and there's a special press folder for this if you put this p
00:09:48.079 parcent and bracket and M then it will be signal
00:09:54.800 number so and so you put with my
00:10:00.480 dump pipe it and then makes the dump file like like a normal like myc dump
00:10:07.839 procedure and then destroy it if you
00:10:13.720 lazy you can do one line I show you
00:10:21.040 now okay just for inake for the um demo just I put from like my
00:10:28.440 scale sample database this is something for just prettyy normal one like customers and then let's take it here
00:10:35.279 sorry it's a bit small but here is a customers data and there's know customer
00:10:41.320 names last name first name phone number address you know there's a bunch of
00:10:46.880 sensitive datas right and then this is a
00:10:55.079 configuration customers is a t like table name and customer name is an
00:11:00.839 customers first name last name or maybe customers maybe I can put the s number
00:11:06.360 to be here like 1 2 3 4 5 6 7even and then phone number can be can be the
00:11:11.920 static address line address line two this is a bit Point here um it's address
00:11:20.079 line two is actually uh n Alo now it's n
00:11:27.680 it's okay so it's a n so if you see the data here so um address line two is yes
00:11:34.920 look it's a can be Nar or level three no it's a nullable column in this case you
00:11:42.240 can specify question mark then it will respect for the
00:11:47.440 now let's try it out okay now you have
00:11:53.839 here uh MySQL dump okay I need complete insert
00:11:59.480 so this I will explain L later and this is a table database name and masking to
00:12:04.920 be here and then just simply animize it okay works then
00:12:19.399 anonymized classic models try out ah I
00:12:26.240 forget classic models ah
00:12:31.399 typos right okay so now here I would
00:12:36.680 reload now can you
00:12:41.880 see so now it's you know customers here last name first one so phon and numbers
00:12:48.000 are here it's anonymized this is small data set that's why it's super fast but
00:12:53.120 it's this is how how works it's clear
00:13:01.720 okay so I show you the demo now okay now you can use you can make the animiz
00:13:07.519 database but now I come to the second part how to implement this internal
00:13:14.000 libraries after he hear this talk you can be more lazy
00:13:19.959 developer there be Design Concepts what I put
00:13:25.240 intentionally put it here in when I develop this tool
00:13:30.760 um first one kiss keep it simple
00:13:37.120 stupid in another word do one thing and do it
00:13:42.639 well this is from unique's philosophy what do I mean
00:13:48.519 here so again this is a interface of this two myc dump masking and then file
00:13:56.160 to be here no database connection because my skill dump is doing for
00:14:03.000 you and no fire Hing only dealing with standard input
00:14:09.440 and output this is one of the big Philosophy for The Unique philosophy so and also this is safe because process
00:14:17.600 happen on the memory so you don't you don't have the chance to the having the production sensitive data on the other
00:14:24.839 word maybe this is liness the development
00:14:31.399 so avoid the invent of oil the Alex also like explained yesterday's nice talk to
00:14:37.839 be it's exactly very similar things so for the handling for the files and for the outputs standard output but this is
00:14:45.920 what is exactly needed it's a passing data and process data same thing applies
00:14:52.560 for my case database connection which is much much complicated than file so but
00:14:57.880 this is a it's really really good for I can really simplify the
00:15:04.279 implementation no external dependencies on the other word no G
00:15:11.160 dependencies use only standard libraries uh L it's having bunch of
00:15:19.560 dependence here to be fair they are internal dependency though but it's like for I mean for the instance so James can
00:15:28.800 be explain what is the dependency here this is masking
00:15:35.120 nothing right again this is laziness so avoid
00:15:40.279 thir party dependencies but this is quite beneficial especially I upgrading
00:15:45.720 Ruby 3.01 to current wi Ruby
00:15:51.079 3.3 almost nothing to be worry about it so it just puts the CI to the Ruby 3
00:15:57.800 next version and then pass C acceptance stest okay because it's just depending
00:16:02.880 on the standard libraries and it's R standard Library it's quite Rich you can
00:16:09.160 do actually many things and uh another concept is a
00:16:15.120 quality I put a lot of technique for the tdd this is more internal process but I
00:16:20.839 found it's very useful tdd it's uh for me it's the most important thing is not
00:16:26.199 testing fast rather than you write test and then you see very addly feedback as
00:16:33.720 if something wrong to writing test then this is a signal that often something
00:16:39.920 bad for the design and this is what I feel for the most tdd and then so I just put I I Tred to
00:16:48.759 do it by by myself and then I tried to Matrix if you do strict apply for the tdd it's a car 100% it automatic comes
00:16:57.360 but don't be atic you don't have to be stick about it but I just try to try it
00:17:03.079 out by myself and then so this is more quality of the code there's it's it
00:17:10.360 is arguable what is a good Cod good good code quality but there's some like to
00:17:15.760 provide you C healthiness and I am the person who most surpris about this
00:17:21.000 result because I I put this analytics after I made this two and I was so
00:17:27.559 surprised okay nothing to fix it and acceptance test we are lucky
00:17:34.440 because we are in the era of the container and then we can use real process for my case my scale or Marb so
00:17:43.320 I can write uh inside for the against for the docker container and accept is
00:17:49.559 share script this is looks like hard but is it's simple just setting up like as I
00:17:55.120 showed you in the demo setting up for the importing data and then test it with real command
00:18:02.200 musular dump and anonymize and then compare the result
00:18:07.880 assion so and then put it when I set up here put it on CI I can test every
00:18:16.159 version in automatically in the CI this makes more
00:18:21.440 confident how do implement it so again so this is masking is as
00:18:28.320 overview it needs for the configuration file and then input as SQL Dum file use input for
00:18:37.679 p and then making anonymized damp this is the overview of the this this
00:18:45.039 two and when I start implementing there's a uncertainty for me um mapping
00:18:52.120 configuration this was more obvious for me but it's more how to pass SQL and also I was not sure how to build
00:19:00.400 whole architecture and again DDD comes to me like you know this is I found it's very
00:19:07.280 helpful um tdd is actually like having two types so classic tdd or Mo tdd
00:19:15.159 inside out outside in let's say so I don't go for very much detail for this
00:19:21.120 but it's classic tdd is a using real object like a layers unit test Mo t it's
00:19:29.360 like oh sorry outside in it's a as it means like it's mocking mocking to the
00:19:35.840 external like any like external dependencies so for here okay let's when
00:19:42.440 I do for the architecture I try to do with tdd and then this input and output
00:19:47.799 is clear for me and then I started from outside in and just for the standard
00:19:54.080 input and standard output and next I try to with the
00:19:59.200 modeling so map configuration and then go for the person rebuild I use inside
00:20:05.159 out after I make this kind of parts parts and then I try to make some kind
00:20:12.200 of connection for them I put the processor but might be this can call it
00:20:17.799 controller or service or use case but whatever but this is very th it doesn't
00:20:25.120 have the logic and then last one is easy so because it's moing so just for the
00:20:31.640 connecting the parts This is How I build the I build the internal
00:20:38.039 applications now pass and rebuild SQL rebuilding it's it's not hard but
00:20:44.679 passing was I was also I was not sure how to do that and also was yes it was
00:20:50.799 bit harder how to pass dump okay this is a dump file but if you
00:20:58.080 analyze so actually this is quite structured one the first one is like
00:21:04.200 setting up database like escale options and second one is a defining
00:21:09.360 table schemas types constraints and then inserting data and looping how until
00:21:17.799 like the table how many tables do you have this is a basic structure of the
00:21:23.159 dump and then so here is a trick when for my use case I need to have I need to
00:21:31.720 know what is inside of the table and then in in here inserting data has no
00:21:37.600 data about what is a table that means I have to pass the table schema that is
00:21:43.799 yes I can do that but I if we can I want to avoid it and then here there's a my
00:21:52.120 skill dump options called complete insert and if you put this option
00:21:59.960 so here structure was not not changing but so in the insert lines now
00:22:07.320 it has the table information then now I don't need to the
00:22:12.440 pass schema so it's now problem become very simple you pass this one line
00:22:18.559 inside line and then rebuild it so makes much much easier so this is what when
00:22:25.600 I'm finding out then how to pass
00:22:31.440 SQL but still you have to pass this what I try to do is I actually try to do with
00:22:39.480 regular expression so but basically this is three parts table information and column
00:22:46.960 information and actual datails just simply um prepare for the regular
00:22:54.360 expression for them but still values it's a bit tricky key because it's uh it
00:23:00.640 can be many many columns but it's already you know how many columns are
00:23:06.520 there in this here so because it's that one two three so then prepare for
00:23:14.120 the express leg expression for the value and then multiply on the fly so it's a
00:23:22.720 kind of dynamically um making the leg expression and how many columns are
00:23:29.240 there depends on and then now actually possible to
00:23:34.799 pass it okay let's try with production data
00:23:41.400 so reality so just as a part is not enough but uh because it's often
00:23:49.520 production having much more harder because you have many many datas kind of like invalid or not invalid but it's
00:23:55.880 unexpected datas but let's see okay let's try it without the journey journey to lady for
00:24:03.600 the production data okay here is the SQL and then okay
00:24:08.640 now empty empty string okay I just forgot it okay so then just fix
00:24:14.799 it and then next okay here there's already some
00:24:21.360 problem this is you know same with like split of of strings okay yes yes fair
00:24:27.320 enough fair enough okay let's fix it and next go
00:24:34.200 here do you see so here there's a it's called scientific notion I didn't know I
00:24:41.080 didn't know the name that time before but okay let's fix it I think okay now I
00:24:46.960 think much much okay then come on don't do
00:24:54.799 that but okay it's a reality don't put that don't put binary in the database but
00:25:00.840 okay as a library Creator I cannot complain for this okay they fix it and
00:25:07.760 then now I can pass it product to later yes same let's go for the restore and
00:25:16.520 somehow it complained that it is invalid SQL and then I was really not sure why
00:25:23.720 and I really need to the investigation what's wrong after a while I realize this escale
00:25:33.159 already have the problem so okay the strategy I use for the
00:25:41.559 passing the values okay how you distinguish the records I use this
00:25:47.880 parenes and camera as a separator so to be here like like spr to
00:25:54.919 here and you can see here do you see now so problem is here so string contains
00:26:04.039 this pattern so that's why it's actually passing
00:26:09.640 wrongly and it was really hard for me it's crushing and for me it's like a
00:26:16.640 like a last B and oh my God I was almost stucking about it I really don't know
00:26:22.960 how to solve this issue almost one two weeks I have no idea how to fix this
00:26:29.760 but after a while I notice if the move par to be
00:26:36.760 here this looks like CSV this passing problem or this pattern
00:26:44.600 is actually same pattern with CSV then now try to see try to see the
00:26:51.240 how to the standard Library fortunately so sorry but
00:26:57.039 Japanese but you can Google translate but but he is a maintainer of
00:27:04.679 the standard CSV liability and he wrote A Blog and ex sorry sorry go ahead
00:27:13.080 oh to be he here oh oh sorry
00:27:18.600 yeah he explains exactly same problem in the CSV
00:27:25.000 if having so you know that say same latest here and then he explain how to solve it
00:27:32.640 let's pick his solution thank you for the community okay now okay I have the
00:27:39.919 programmatics SQL and then still be here there three
00:27:45.559 separators and then now count single
00:27:50.840 quote here so there are two single quote means this is body string and it's body
00:27:57.039 record it's it's if it's even okay this is okay but it's a next next if it's
00:28:04.720 broken broken string then it's not even it's a OD number so that's why you can
00:28:10.679 detect okay this is something wrong and after that how you do that
00:28:16.000 it's actually concatenate with next element now to be here and then count it
00:28:24.799 again four string sorry four single quotes now even number so now it's
00:28:31.960 valid and then next one also like four four single quot also B it okay then now
00:28:39.720 pass it so this is how I Implement here it's quite complex but it's actually you know
00:28:48.039 um kind of like this presentation documentation is actually for me too because it's I can remember this like
00:28:54.799 very complex problem so Al like how to it's very hard to explain in the comment
00:28:59.840 so it's also helping to me too so but this is how I Implement internally
00:29:06.440 so and then but now the great thing is now I can pass it production data and
00:29:14.480 build it now I can happy to say that now
00:29:19.840 version 1.0 so this is already happened in 2019 and then I'm pretty confident that
00:29:26.960 you can use for the production what's next this can be
00:29:33.080 randomized for the fake data because now the static values and only for the price holders but it's like fake faking data
00:29:40.279 lomize data and passing table schema Al one of them like type if you do that you can
00:29:47.799 validate the type in the config because now it's type agnostic so you don't know like and l l the
00:29:55.760 restore and also if you do that we don't need complete insert option anymore and
00:30:02.120 performance optimization um roughly now it's uh slower than slower than 20% for the
00:30:10.159 normal my scale dump which is quite okay for me but it's like still there's a space for the improve for the
00:30:17.279 performance and also postgress I'm plan to do that it's actually like try to do
00:30:23.159 but it I didn't work so much for this here uh if you want to support me GitHub
00:30:28.840 support is sponsor is open or so but or just talk to
00:30:34.440 me okay then that's it thank you so
00:30:40.919 much any feedback is welcome thank you
00:30:46.360 chaho tooro now it's right hopefully thanks a lot for um enlightening us on
00:30:52.279 your journey and your experience with that
Explore all talks recorded at Ruby Unconf 2024
+5