00:00:00.199
chikahiro tooco who will Enlighten us I hope that was right who tooro I'm sorry
00:00:07.960
um who will Enlighten us about how to create um generate anonymized databases
00:00:14.200
by with masking so please give him a warm
00:00:21.039
Applause great now okay okay can you hear
00:00:27.599
me great so then thank you for the voting so I would present to you how to
00:00:36.040
generate anonymized database with masking masking is my open source
00:00:42.039
project name it's funny this is my second talk as a unconference St so EUR
00:00:48.840
2023 also like I was chosen but as unconference talk so only unconference
00:00:54.079
stock so far so far now okay but for here why do we want to anonymize
00:01:04.479
database I give you the I share my story or real story one day you are doing for
00:01:11.320
some release and then you notice or everybody notice your main website is
00:01:18.240
time out time out okay and then metx shows something really long right but
00:01:27.320
you don't know what happens and then after the investigation you see in the database
00:01:33.159
there's a bunch of waiting query so what happen is like your main
00:01:39.119
table was log during the database migration and post mem check the
00:01:46.479
delivery process what it how it happens so it is quite normal I think local and
00:01:53.520
then staging QA and production the point is here there is a
00:02:01.680
def quantity of data set and the local maybe 200 MC
00:02:08.280
staging okay it's a check it staging 3 second production 1
00:02:16.800
hour so then crash right but this is I think it's not only
00:02:23.599
one time like know it's common I can see and but what if if you have similar data
00:02:29.920
set of production in your development process and then here in your local
00:02:35.360
Wonder of course you should notice something wrong right but but okay then
00:02:43.480
let's copy to the prodution database please don't there are a bunch of reasons but
00:02:50.080
if you I pick just one ddpr so email the phone numbers please
00:02:57.400
don't copy to your local or even staying testing data so but how but still
00:03:03.799
problem is here then you can use anonymized data
00:03:09.959
set now the question is how to generate it ah sorry but it's uh but for the use
00:03:17.519
case it's not on it's a database migration is for sure one of use case
00:03:24.360
like as I showed you the story but not only sometime so
00:03:30.640
migration could fail because of depends on data let's say for you want to the folding key and production there are
00:03:36.239
some like violation then migration can be fail or this is also typical use case
00:03:43.000
SQL performance optimization so you have a slow query log in production but if
00:03:49.400
you want to optimize you need really accurate execution plan for this you
00:03:56.000
need similar data set with production or sometime so some some bugs is coming
00:04:04.680
from data for instance you have the product table and then you don't know why price
00:04:12.480
is negative or sometime it's okay order and
00:04:18.160
payments records and there's a missing record of order why it
00:04:24.120
happens or duplicate payment this kind bug it's a it's
00:04:30.199
sometimes harder to find or de produce as well or another use case maybe better
00:04:36.440
feature pre preview to the your stakeholders or be use case so this is
00:04:42.880
qu it's actually like sometime it's useful because um like only analyze
00:04:48.080
analyzing purpose or stress test or there might be bunch of use case
00:04:55.479
your idea can be here and actually there a start for doing these same things found in
00:05:03.520
2020 but for the Post glass um but the catch is when I start
00:05:09.880
my masking the date is know 2019 I'm faster faster than
00:05:17.000
them so okay but how to implement
00:05:23.520
anonymization there okay one of strategy maybe copy database and upate record
00:05:30.720
so you have the production data database just copy it and connect and update your
00:05:38.360
database this is when I see the similar similar tools this is most common
00:05:44.479
way but this is not the only way I can imagine set up leca and database
00:05:52.960
trigger so production is here and leca is here you can set up database trigger
00:06:00.039
in this replica to update your sensitive
00:06:05.120
datas or proxy connection your production database here and set up
00:06:11.680
something connection proxy and skill is coming from from from
00:06:18.800
user or application then for the production go for as it is but for the
00:06:25.720
inter left scale on the fly to and might sensitive
00:06:32.039
datas another strategy is it's similar with first one but read database from
00:06:39.880
database client and then generate anonymize database this is a possible
00:06:46.800
and this is second common way when I see the similar product a little bit different it's
00:06:54.639
similar but also you don't need a client just make it dump
00:07:00.160
anonymize it and generate also
00:07:05.479
possible okay now we have four strategies which one do you want to use
00:07:11.039
it let's evaluate this is my evaluation the the
00:07:16.199
effort copy and data update it's for me it's obvious okay maybe effort is low
00:07:21.280
other yeah not sure but database proxy is may be more tricky because you have
00:07:26.800
to read good one and also like solid one and thinking about for the data data
00:07:34.319
leak risk and first one is worst because you have to have your production data as
00:07:41.680
a copy and then see during the update you still have the problem you can be
00:07:47.280
the leing this there um database to also during the
00:07:53.560
trigger there is a data here but trigger also can be fail there a slightly possible to risk here other than
00:08:01.720
not configuration first one and four one is easier just connecting the database
00:08:09.199
proxy yeah just one setup it's gone toer it's very very high because you have to
00:08:14.840
implement logic there performance regation can be happen
00:08:20.120
for the database proxy because your dat production database is a proxy now think
00:08:26.720
about it and then time Gap the first one is a kind of for the two step copy and
00:08:35.120
update it's going to be longer other like third second one to
00:08:41.519
and proxy is almost real time this is quite beneficial um then Al last one is
00:08:47.120
a I can say it's a medium so which one do you want to choose for me I don't need real time
00:08:55.640
data and then so I choose much more like relatively low disk one I choose this
00:09:03.440
one and then let's try to do that that's a result of my tool of the
00:09:11.160
masking so written by Luby and compatible with my skill and
00:09:16.800
Maria DB and how to use
00:09:22.480
this this is quite simple now you just Define for the yamu file which database
00:09:28.640
table and cab name and how how you want to anonymize that value but tricky part
00:09:36.399
is here email because it needs to be often
00:09:42.519
unique and there's a special press folder for this if you put this p
00:09:48.079
parcent and bracket and M then it will be signal
00:09:54.800
number so and so you put with my
00:10:00.480
dump pipe it and then makes the dump file like like a normal like myc dump
00:10:07.839
procedure and then destroy it if you
00:10:13.720
lazy you can do one line I show you
00:10:21.040
now okay just for inake for the um demo just I put from like my
00:10:28.440
scale sample database this is something for just prettyy normal one like customers and then let's take it here
00:10:35.279
sorry it's a bit small but here is a customers data and there's know customer
00:10:41.320
names last name first name phone number address you know there's a bunch of
00:10:46.880
sensitive datas right and then this is a
00:10:55.079
configuration customers is a t like table name and customer name is an
00:11:00.839
customers first name last name or maybe customers maybe I can put the s number
00:11:06.360
to be here like 1 2 3 4 5 6 7even and then phone number can be can be the
00:11:11.920
static address line address line two this is a bit Point here um it's address
00:11:20.079
line two is actually uh n Alo now it's n
00:11:27.680
it's okay so it's a n so if you see the data here so um address line two is yes
00:11:34.920
look it's a can be Nar or level three no it's a nullable column in this case you
00:11:42.240
can specify question mark then it will respect for the
00:11:47.440
now let's try it out okay now you have
00:11:53.839
here uh MySQL dump okay I need complete insert
00:11:59.480
so this I will explain L later and this is a table database name and masking to
00:12:04.920
be here and then just simply animize it okay works then
00:12:19.399
anonymized classic models try out ah I
00:12:26.240
forget classic models ah
00:12:31.399
typos right okay so now here I would
00:12:36.680
reload now can you
00:12:41.880
see so now it's you know customers here last name first one so phon and numbers
00:12:48.000
are here it's anonymized this is small data set that's why it's super fast but
00:12:53.120
it's this is how how works it's clear
00:13:01.720
okay so I show you the demo now okay now you can use you can make the animiz
00:13:07.519
database but now I come to the second part how to implement this internal
00:13:14.000
libraries after he hear this talk you can be more lazy
00:13:19.959
developer there be Design Concepts what I put
00:13:25.240
intentionally put it here in when I develop this tool
00:13:30.760
um first one kiss keep it simple
00:13:37.120
stupid in another word do one thing and do it
00:13:42.639
well this is from unique's philosophy what do I mean
00:13:48.519
here so again this is a interface of this two myc dump masking and then file
00:13:56.160
to be here no database connection because my skill dump is doing for
00:14:03.000
you and no fire Hing only dealing with standard input
00:14:09.440
and output this is one of the big Philosophy for The Unique philosophy so and also this is safe because process
00:14:17.600
happen on the memory so you don't you don't have the chance to the having the production sensitive data on the other
00:14:24.839
word maybe this is liness the development
00:14:31.399
so avoid the invent of oil the Alex also like explained yesterday's nice talk to
00:14:37.839
be it's exactly very similar things so for the handling for the files and for the outputs standard output but this is
00:14:45.920
what is exactly needed it's a passing data and process data same thing applies
00:14:52.560
for my case database connection which is much much complicated than file so but
00:14:57.880
this is a it's really really good for I can really simplify the
00:15:04.279
implementation no external dependencies on the other word no G
00:15:11.160
dependencies use only standard libraries uh L it's having bunch of
00:15:19.560
dependence here to be fair they are internal dependency though but it's like for I mean for the instance so James can
00:15:28.800
be explain what is the dependency here this is masking
00:15:35.120
nothing right again this is laziness so avoid
00:15:40.279
thir party dependencies but this is quite beneficial especially I upgrading
00:15:45.720
Ruby 3.01 to current wi Ruby
00:15:51.079
3.3 almost nothing to be worry about it so it just puts the CI to the Ruby 3
00:15:57.800
next version and then pass C acceptance stest okay because it's just depending
00:16:02.880
on the standard libraries and it's R standard Library it's quite Rich you can
00:16:09.160
do actually many things and uh another concept is a
00:16:15.120
quality I put a lot of technique for the tdd this is more internal process but I
00:16:20.839
found it's very useful tdd it's uh for me it's the most important thing is not
00:16:26.199
testing fast rather than you write test and then you see very addly feedback as
00:16:33.720
if something wrong to writing test then this is a signal that often something
00:16:39.920
bad for the design and this is what I feel for the most tdd and then so I just put I I Tred to
00:16:48.759
do it by by myself and then I tried to Matrix if you do strict apply for the tdd it's a car 100% it automatic comes
00:16:57.360
but don't be atic you don't have to be stick about it but I just try to try it
00:17:03.079
out by myself and then so this is more quality of the code there's it's it
00:17:10.360
is arguable what is a good Cod good good code quality but there's some like to
00:17:15.760
provide you C healthiness and I am the person who most surpris about this
00:17:21.000
result because I I put this analytics after I made this two and I was so
00:17:27.559
surprised okay nothing to fix it and acceptance test we are lucky
00:17:34.440
because we are in the era of the container and then we can use real process for my case my scale or Marb so
00:17:43.320
I can write uh inside for the against for the docker container and accept is
00:17:49.559
share script this is looks like hard but is it's simple just setting up like as I
00:17:55.120
showed you in the demo setting up for the importing data and then test it with real command
00:18:02.200
musular dump and anonymize and then compare the result
00:18:07.880
assion so and then put it when I set up here put it on CI I can test every
00:18:16.159
version in automatically in the CI this makes more
00:18:21.440
confident how do implement it so again so this is masking is as
00:18:28.320
overview it needs for the configuration file and then input as SQL Dum file use input for
00:18:37.679
p and then making anonymized damp this is the overview of the this this
00:18:45.039
two and when I start implementing there's a uncertainty for me um mapping
00:18:52.120
configuration this was more obvious for me but it's more how to pass SQL and also I was not sure how to build
00:19:00.400
whole architecture and again DDD comes to me like you know this is I found it's very
00:19:07.280
helpful um tdd is actually like having two types so classic tdd or Mo tdd
00:19:15.159
inside out outside in let's say so I don't go for very much detail for this
00:19:21.120
but it's classic tdd is a using real object like a layers unit test Mo t it's
00:19:29.360
like oh sorry outside in it's a as it means like it's mocking mocking to the
00:19:35.840
external like any like external dependencies so for here okay let's when
00:19:42.440
I do for the architecture I try to do with tdd and then this input and output
00:19:47.799
is clear for me and then I started from outside in and just for the standard
00:19:54.080
input and standard output and next I try to with the
00:19:59.200
modeling so map configuration and then go for the person rebuild I use inside
00:20:05.159
out after I make this kind of parts parts and then I try to make some kind
00:20:12.200
of connection for them I put the processor but might be this can call it
00:20:17.799
controller or service or use case but whatever but this is very th it doesn't
00:20:25.120
have the logic and then last one is easy so because it's moing so just for the
00:20:31.640
connecting the parts This is How I build the I build the internal
00:20:38.039
applications now pass and rebuild SQL rebuilding it's it's not hard but
00:20:44.679
passing was I was also I was not sure how to do that and also was yes it was
00:20:50.799
bit harder how to pass dump okay this is a dump file but if you
00:20:58.080
analyze so actually this is quite structured one the first one is like
00:21:04.200
setting up database like escale options and second one is a defining
00:21:09.360
table schemas types constraints and then inserting data and looping how until
00:21:17.799
like the table how many tables do you have this is a basic structure of the
00:21:23.159
dump and then so here is a trick when for my use case I need to have I need to
00:21:31.720
know what is inside of the table and then in in here inserting data has no
00:21:37.600
data about what is a table that means I have to pass the table schema that is
00:21:43.799
yes I can do that but I if we can I want to avoid it and then here there's a my
00:21:52.120
skill dump options called complete insert and if you put this option
00:21:59.960
so here structure was not not changing but so in the insert lines now
00:22:07.320
it has the table information then now I don't need to the
00:22:12.440
pass schema so it's now problem become very simple you pass this one line
00:22:18.559
inside line and then rebuild it so makes much much easier so this is what when
00:22:25.600
I'm finding out then how to pass
00:22:31.440
SQL but still you have to pass this what I try to do is I actually try to do with
00:22:39.480
regular expression so but basically this is three parts table information and column
00:22:46.960
information and actual datails just simply um prepare for the regular
00:22:54.360
expression for them but still values it's a bit tricky key because it's uh it
00:23:00.640
can be many many columns but it's already you know how many columns are
00:23:06.520
there in this here so because it's that one two three so then prepare for
00:23:14.120
the express leg expression for the value and then multiply on the fly so it's a
00:23:22.720
kind of dynamically um making the leg expression and how many columns are
00:23:29.240
there depends on and then now actually possible to
00:23:34.799
pass it okay let's try with production data
00:23:41.400
so reality so just as a part is not enough but uh because it's often
00:23:49.520
production having much more harder because you have many many datas kind of like invalid or not invalid but it's
00:23:55.880
unexpected datas but let's see okay let's try it without the journey journey to lady for
00:24:03.600
the production data okay here is the SQL and then okay
00:24:08.640
now empty empty string okay I just forgot it okay so then just fix
00:24:14.799
it and then next okay here there's already some
00:24:21.360
problem this is you know same with like split of of strings okay yes yes fair
00:24:27.320
enough fair enough okay let's fix it and next go
00:24:34.200
here do you see so here there's a it's called scientific notion I didn't know I
00:24:41.080
didn't know the name that time before but okay let's fix it I think okay now I
00:24:46.960
think much much okay then come on don't do
00:24:54.799
that but okay it's a reality don't put that don't put binary in the database but
00:25:00.840
okay as a library Creator I cannot complain for this okay they fix it and
00:25:07.760
then now I can pass it product to later yes same let's go for the restore and
00:25:16.520
somehow it complained that it is invalid SQL and then I was really not sure why
00:25:23.720
and I really need to the investigation what's wrong after a while I realize this escale
00:25:33.159
already have the problem so okay the strategy I use for the
00:25:41.559
passing the values okay how you distinguish the records I use this
00:25:47.880
parenes and camera as a separator so to be here like like spr to
00:25:54.919
here and you can see here do you see now so problem is here so string contains
00:26:04.039
this pattern so that's why it's actually passing
00:26:09.640
wrongly and it was really hard for me it's crushing and for me it's like a
00:26:16.640
like a last B and oh my God I was almost stucking about it I really don't know
00:26:22.960
how to solve this issue almost one two weeks I have no idea how to fix this
00:26:29.760
but after a while I notice if the move par to be
00:26:36.760
here this looks like CSV this passing problem or this pattern
00:26:44.600
is actually same pattern with CSV then now try to see try to see the
00:26:51.240
how to the standard Library fortunately so sorry but
00:26:57.039
Japanese but you can Google translate but but he is a maintainer of
00:27:04.679
the standard CSV liability and he wrote A Blog and ex sorry sorry go ahead
00:27:13.080
oh to be he here oh oh sorry
00:27:18.600
yeah he explains exactly same problem in the CSV
00:27:25.000
if having so you know that say same latest here and then he explain how to solve it
00:27:32.640
let's pick his solution thank you for the community okay now okay I have the
00:27:39.919
programmatics SQL and then still be here there three
00:27:45.559
separators and then now count single
00:27:50.840
quote here so there are two single quote means this is body string and it's body
00:27:57.039
record it's it's if it's even okay this is okay but it's a next next if it's
00:28:04.720
broken broken string then it's not even it's a OD number so that's why you can
00:28:10.679
detect okay this is something wrong and after that how you do that
00:28:16.000
it's actually concatenate with next element now to be here and then count it
00:28:24.799
again four string sorry four single quotes now even number so now it's
00:28:31.960
valid and then next one also like four four single quot also B it okay then now
00:28:39.720
pass it so this is how I Implement here it's quite complex but it's actually you know
00:28:48.039
um kind of like this presentation documentation is actually for me too because it's I can remember this like
00:28:54.799
very complex problem so Al like how to it's very hard to explain in the comment
00:28:59.840
so it's also helping to me too so but this is how I Implement internally
00:29:06.440
so and then but now the great thing is now I can pass it production data and
00:29:14.480
build it now I can happy to say that now
00:29:19.840
version 1.0 so this is already happened in 2019 and then I'm pretty confident that
00:29:26.960
you can use for the production what's next this can be
00:29:33.080
randomized for the fake data because now the static values and only for the price holders but it's like fake faking data
00:29:40.279
lomize data and passing table schema Al one of them like type if you do that you can
00:29:47.799
validate the type in the config because now it's type agnostic so you don't know like and l l the
00:29:55.760
restore and also if you do that we don't need complete insert option anymore and
00:30:02.120
performance optimization um roughly now it's uh slower than slower than 20% for the
00:30:10.159
normal my scale dump which is quite okay for me but it's like still there's a space for the improve for the
00:30:17.279
performance and also postgress I'm plan to do that it's actually like try to do
00:30:23.159
but it I didn't work so much for this here uh if you want to support me GitHub
00:30:28.840
support is sponsor is open or so but or just talk to
00:30:34.440
me okay then that's it thank you so
00:30:40.919
much any feedback is welcome thank you
00:30:46.360
chaho tooro now it's right hopefully thanks a lot for um enlightening us on
00:30:52.279
your journey and your experience with that