RailsConf 2022

Open the gate a little: strategies to protect and share data

Open the gate a little: strategies to protect and share data

by Fernando Petrales

In this talk titled "Open the gate a little: strategies to protect and share data," Fernando Petrales explores the challenges of granting access to production data while maintaining security and compliance with regulations like HIPAA. He emphasizes the importance of protecting personally identifiable information, especially in industries like healthcare, and shares strategies for safely sharing data when necessary. Key points discussed include:

  • Understanding Data Restrictions: The need to comprehend the specific reasons why someone requests access to production data to ensure only necessary information is shared.
  • Health Regulations: An overview of HIPAA and its implications, emphasizing its role in protecting health information.
  • Case Studies: Petrales highlights significant incidents involving unauthorized access to sensitive data, such as unencrypted laptops leading to severe fines.
  • Anonymization Techniques: He suggests using data anonymization to share subsets of information securely, introducing tools like Possible Synonymizer for PostgreSQL to mask sensitive data.
  • Static Masking and Dynamic Masking: Petrales explains the concepts of static and dynamic masking, demonstrating how to change or hide sensitive data based on user roles or needs.
  • Generalization: He illustrates how data can be generalized to protect individuals while still allowing for necessary analysis or research.
  • Key Takeaways: It is crucial to understand what data is actually needed before granting access and ensure that only minimum necessary data is shared to reduce risk.

Ultimately, Petrales emphasizes the need for careful consideration when handling sensitive data, underlining that data once out of production control can be hard to protect. Participants are reminded that being cautious with data distribution is paramount to maintaining privacy and compliance.

00:00:00.900 Can you name a more terrifying set of three words in software development than "HIPAA violation fines"? I bet you can't.
00:00:12.660 Let's start with a little bit more about me. My name is Fernando, or Fair, because I know we are all about productivity.
00:00:14.120 That name takes three times shorter, so that's good. My last name is Perales, which I came to realize one month ago is Spanish for "pear trees." I didn't like the Earth.
00:00:20.939 I'm coming from Guadalajara, Mexico, which is not very far from here. There’s a flight, a four-hour flight. It's a nice place, and I've been doing programming for pretty much eight years, mostly consulting.
00:00:32.340 Of those eight years, five months were spent in the U.S. working at a startup. I didn't like the startup life. I’m part of the Boost team, and I also host the Ruby MX community. You probably saw something in the schedule regarding meetups or community events, so we recorded that yesterday.
00:00:58.440 It was really cool to meet more people who happen to speak Spanish, and this is my fifth RailsConf as a speaker, so it's really important for me. The picture I have is not really a picture; it’s an illustration by Sarah. That's your Instagram, like your space. So, let’s do some warm-up questions.
00:01:28.020 Raise your hand if you have access to a production server or database. That’s interesting. Now raise your hand if you would feel more comfortable not having access to that production service. Yes, it's a big responsibility to have. Again, raise your hand if you're comfortable with the security measures your organization takes.
00:01:45.420 Then you can say, "Okay, I can sleep easy every night without a problem. I know if there's an issue, someone’s going to take care of it." Okay, not a lot of hands... that was expected. Regardless of your answer, this might not be the talk for you. There are more capable people who can help you or your organization prevent unauthorized access to your data services from outsiders, also known as hackers.
00:02:21.180 There are consulting companies that make a living doing this, and they are very good at letting you know what you can improve. Don't assume that, because you’re a small company, you are not a target of interest for hackers. However, raise your hand if you have a copy of production data on your machine.
00:02:49.680 Okay, that’s interesting. Raise your hand if someone from your organization has asked you for a copy of production data. Okay, interesting. Now raise your hand if you have provided a copy of production data to someone in your organization. No judgment, so feel free to raise your hands.
00:03:00.600 Last question: raise your hand if you are concerned about copies of production data being in someone’s hands. Yes, so if you answered yes to at least one question, this is a talk for you. The inspiration for this talk comes from some cases where I believe the reason you decided to attend this talk is due to something called HIPAA.
00:03:43.620 What is HIPAA? It’s the Health Insurance Portability and Accountability Act. It's a United States federal statute that was signed into law in 1996. It modernizes the flow of healthcare information and specifies how personally identifiable information maintained by healthcare and insurance industries should be protected.
00:04:31.260 In general, it prohibits healthcare providers and businesses from disclosing protected health information to anyone other than a patient or the patient-authorized individuals. This term is also related to protect health information, or PHI, which is identified as personal health information and includes demographic information, medical histories, test results, laboratory results, mental health conditions, and other data that healthcare professionals use to identify individuals and determine appropriate care.
00:05:57.300 What is considered protected information? Pretty much anything like names, addresses, dates of birth, dates of hospitalization, insurance numbers, fax numbers, email addresses, and social security numbers. All of that needs to be protected. Any unique identifying characteristics must also be protected, as that could lead to identification. One case that caught my attention is the situation where an encrypted stolen laptop led to over one million in fines because an employee's computer went missing with protected health information of about 20,000 records.
00:07:46.800 Last time I checked, I didn't have one million dollars to spare on fees. Thus, if you assume your machines are great and encrypted, there shouldn’t be an issue, right? However, if you have protected information and cannot document that the device was encrypted, you still need to meet the HIPAA requirements.
00:08:36.660 You might think you don’t have to worry if you don't have any health information in your hands; maybe you work in fintech or another kind of industry. Am I safe if my app is not health-related? One positive aspect of consulting, which I have been doing for the last eight years, is that you may work with clients from outside the states, where you must worry about local legislation.
00:10:01.979 In my country, Mexico, for example, we have a law called the Federal Law of Protection of Personal Data Held by Private Parties, which was approved in 2010. It aims to regulate the practice of informational self-determination. This means that various organizations, including banks, insurance companies, hospitals, schools, and telecommunications companies, are required to comply with its provisions.
00:12:01.200 This law is quite similar to HIPAA. The law has some interesting cases; I was affected along with 93.4 million Mexicans when our personal data was exposed on Amazon due to a vulnerability. A government entity regrettably provided a copy of a complete database to political parties, and one party uploaded that data to Amazon without protection, leading to the exposure of 132 gigabytes of sensitive information.
00:13:46.020 So one of the first lessons I want to impart here is: don’t give production copies to everyone. That could be pretty much the end of the talk. That’s the safest thing to do. But what if we could provide only what’s needed? A general term we can use is "anonymization of data." If you think about the reasons someone from an organization might want access to production data, most of the time they don’t require the whole dataset; they just need specific pieces of information.
00:14:32.640 Perhaps they want to do some research or calculations on specific datasets. It’s up to your organization, but we must do some data anonymization before providing production data. One tool that I have been using recently is called Possible Synonymizer. It works to mask or replace personally identifiable information or any commercially sensitive data.
00:16:25.140 This extention is specifically designed for Postgres. There’s a repository with a demo, so you don’t have to follow everything closely. I’m going to share the link on Twitter and Slack, so if you miss anything, you can visit the repo and see what’s going on. For this example, I have created a sample application with a table called users that includes an ID, first name, last name, street line one, street line two, zip code, email, and salary.
00:18:47.700 Let's say someone needs to get a copy of this data. Maybe they want to analyze the structure or perform calculations with salaries for determining bonuses or company finances. We can anonymize this data using the extension. We have to be careful when doing this in production to avoid messing anything up. I suggest altering the database, pre-loading the extension, and ensuring that we create the extension if it doesn't already exist.
00:20:37.020 As we proceed, we can define rules for static masking, such as applying masking rules and shortening columns. For instance, we may not need precise address data for certain calculations, so we can provide only the necessary information. We can shuffle columns or add noise to ensure any calculations yield the same result, but we do not disclose proprietary information. We should avoid using real data in most cases, while being aware that the statistics of salary data should remain intact.
00:23:49.680 While this method provides a good approach to data protection, we also have dynamic masking that allows us to adjust data Visibility based on user roles. With this framework, we can define who can see what data in a very controlled manner. Moreover, the Anonymous Database Tool offers an alternative for exporting masked data directly without needing separate processing.
00:27:03.480 To wrap up, the key takeaways from this talk are to first understand the reasons why someone needs data before saying yes or no. Always provide only what is needed without compromising user information. Lastly, regardless of the tools you use, be careful with the data you handle, especially once it leaves the server, as it becomes harder to ensure its protection.
00:27:32.460 Once the data is shared, remember that you lose control over it and what others do with that information. Thank you.