Generate Anonymised Databases with MasKING

00:00:12.420 Hello, everyone. My name is Chikahiro Tokoro. I’m from Japan, and currently, I live in Berlin, Germany. Today, I would like to thank you all for your support and hope that you find something insightful or helpful in my talk. I will be discussing generating an anonymized database with masking, which is my open-source project.

00:00:37.980 Let’s begin by discussing why we want an anonymized database. This comes from my past experience. One day, during a release, our main service timed out and became unreachable. The metrics showed all red, indicating a critical issue. Naturally, we needed to investigate the root cause. We found that the main database was locked during migration, and there were issues with queries on other tables as well. We conducted a postmortem where we reassessed our delivery processes for both local development and staging, including QA and production environments. However, this is not an uncommon situation; many teams experience similar problems.

00:01:37.920 The challenge we faced was that local development had a smaller dataset, which often resulted in missing important data points that would have otherwise been caught in a production environment. If production data and local development data are similar, you could expect to encounter problems within an hour. This is particularly difficult since we should avoid using production datasets for testing. There are several reasons for this, including GDPR compliance, as production datasets often contain sensitive information like emails and names.

00:02:56.200 This leads us to the importance of working with anonymized datasets. A common use case for these anonymized datasets is simulating database migrations, as I previously mentioned. You can effectively recreate migration scenarios in a controlled environment. Sometimes, you may encounter unique constraints that don’t exist in your local environment due to duplicated records, which can lead to failures.

00:03:10.200 Using anonymized datasets can also assist with SQL performance optimization. If you notice slow queries taking around 25 seconds, you will need to ensure you have a representative dataset; otherwise, you will not achieve an accurate execution plan. There could also be bugs that arise from unexpected data values, such as negative prices in orders or missing foreign key records, which can lead to further complications in development and testing.

00:04:27.180 There are many potential use cases for anonymized datasets, including stress testing and the analysis of beta feature previews. There's a service called Snaplet that was founded in 2020 for PostgreSQL that offers similar functionalities but with additional features. It is a paid platform and caters to a more commercial market. My own open-source project, which I began in 2019, aims to provide similar capabilities.

00:05:51.420 Now let’s discuss how to implement the anonymization process. The first approach involves copying the database and updating the records to anonymize sensitive information. This is a common method used when deciding on which tools or libraries to implement for anonymization.

00:06:36.800 Another approach involves setting up database triggers that activate when records are created or updated. This ensures that the new records are automatically anonymized as they are inserted into the database. Alternatively, you can plug into the database connection, intercept SQL commands, and anonymize the data on the fly.

00:07:17.020 Additionally, there’s the option to create an anonymous dump of the database. This involves using your database client to read records and then create an anonymized version of that data. This client can enact the necessary anonymization processes as it interacts with the database.

00:08:09.600 When evaluating these approaches, consider the effort involved. Copying and operating data may seem straightforward, but it comes with significant risks of data leakage as you handle production data directly. Such practices should be treated with caution and high priority around security.

00:09:05.340 Connecting to databases and implementing triggers can also be tricky, given that each environment may require different configurations. A database proxy may add some overhead but can offer real-time updating of data, which is advantageous for data management.

00:09:34.980 Now, I want to introduce my open-source project, which is compatible with MySQL and MariaDB. It is structured with YAML configuration settings that define how to anonymize specific fields, such as emails with unique constraints. You can utilize standard tools to create a dump and perform the anonymization simply by following the provided configuration guidelines.

00:10:55.960 The sample database provided for demonstration includes common schemas, customer names, phone numbers, and addresses. It also contains employee data, which is critical for anonymization scenarios. The configuration details specify the table and relevant column names, along with the type of anonymization needed.

00:12:19.560 After proceeding with the anonymization process, it is essential to validate the content. If you don’t specify a configuration, the existing entries remain unchanged. Compliance checks can be automated during this process for clarity and assurance.

00:12:46.740 Another element of my approach involves utilizing Test-Driven Development (TDD) principles. When I started developing this library, I was not initially familiar with TDD; however, it has proven valuable for detecting design issues early in the development stage.

00:14:25.260 I maintain thorough test coverage and have a structure in place that shows a logical flow, promoting reliability. With the use of Docker, I can create acceptance tests that are executed automatically in a CI/CD pipeline to ensure all versions function as expected.

00:15:12.720 In terms of implementation, the system is straightforward: it relies on configurations for mapping and processing data, generating anonymous dumps appropriately. Initially, I faced uncertainty regarding the process of passing configurations and SQL queries, but adhering to TDD facilitated overcoming these challenges.

00:16:20.160 As I delved into testing, I found various intricacies that required careful handling, such as SQL value sections. Regular expressions help with dynamically creating the structure to suit different datasets, and overcoming issues regarding escaping characters is critical. This reinforces the need for thorough testing during development.

00:18:06.240 Through continuous improvements and iterations, the anonymous solution has become more robust. The end goal is to ensure the library allows for effective anonymization across a variety of data contexts while maintaining compliance with data protection standards, such as GDPR.

00:20:07.620 Moving forward, I plan to expand the library's features to allow for more flexibility, such as supporting random or fake letters and implementing dynamic schemas to enhance data validation capabilities. This means extending the library's reach to accommodate broader contexts while ensuring ease of use and clear documentation.

00:21:22.260 I want to remind you how crucial your feedback is as you utilize this library. If you encounter any issues or have suggestions for improvement, I encourage you to share them with me. I believe in continuous improvement and would love to hear your experiences.

00:22:20.520 As we approach the end of my presentation, I would like to open the floor to any questions you may have. I appreciate your engagement during this talk and am here to clarify any points or delve deeper into specific areas you are curious about.

00:23:09.360 One attendee raises a question regarding GDPR compliance. The speaker responds, acknowledging the importance of GDPR and states that the project is designed with compliance in mind. They conclude affirmatively that they have succeeded in aligning the tool with GDPR standards.

00:24:08.880 To sum up, thank you for your time and engagement. The concepts I’ve shared are fundamental not just for this tool but for broader applications in ensuring data security and privacy in software development. I'm keen on receiving any feedback and insights you may want to share. Let's continue supporting one another as we navigate these important challenges in our field.