Sergey Sergyenko

Data Management with Ruby

wroc_love.rb 2022

00:00:15.599 Good morning, everyone! Wow, it's all set up. That's very unusual because this is the very first conference I am attending after the pandemic, and I haven't been giving any physical talks in front of real people. Usually, things like that never go smoothly. The first challenge is usually connecting your computer to the projector, which can take up to 30 minutes, but we managed to do it quickly today, which is cool. Thank you all for showing up; it's great to see so many people here!
00:01:00.000 I expected there to be maybe 10 or 15 people after yesterday's party, so I really appreciate your attendance. It makes me a little more nervous, but it's still very nice. Regarding the lighting and the talks, we're having lightning talks, and for me, this feels like getting struck by lightning while presenting to you all. I have been giving lectures for 10 years, mostly academic stuff for students, so I'm trying to make sure my talk isn’t boring or overly academic. I have a timer set so I don’t go over one minute per slide, and I have about 30 slides!
00:01:44.340 If you see me lingering over a slide for more than 10 minutes, please raise your hand and tell me to move on. The interesting stuff is usually toward the end. My name is Sergey Sergyenko, and I've been working with Ruby for 15 years. I want to highlight two projects that I am really proud of: one is the Belarus Ruby User Group, which is about 12 years old. It's still growing and hasn't matured completely, but I am hopeful to see the community developing further. The other is Ruby News, a news aggregator that we launched a year ago, which provides weekly updates.
00:02:21.900 If you're interested, check it out, and if you have ideas, feel free to talk to me about it. Currently, I work for Cybergeizer, a software consulting company that has been around for six years. We initially went fully remote because of the pandemic, and now we're trying to reopen our offices as we find new projects that inspire us. I will share some findings from one of those projects during my talk.
00:02:58.500 My talk today is about data management, which has become a buzzword in recent years, especially with terms like data science, machine learning, AI, and big data. Before we dive in, could I get a show of hands from anyone who works with data? Not too many? Okay, about 20 people, which means I must be mindful of how I present this. Data management is a broad topic, and it’s essential to clarify what it is not. It's not simply database management, which involves different concepts. It’s also not data governance, which is more high level, nor is it ETL (Extract, Transform, Load). Those who know what ETL is, can you raise your hands?
00:03:44.340 Ah, good! This means when you work with data, people often mistakenly equate ETL with data management. Although they intersect, they are not the same. In terms of data engineering, it's essential to emphasize that a data engineer is distinct from a database administrator, data analyst, or data scientist. Data management encompasses a wide range of disciplines, including structure, architecture, shaping data, protecting it, and transforming it.
00:05:02.760 If you search for job opportunities related to data, you’ll find an extensive variety with attractive salaries. There are roles like data engineer, database administrator, data architect, and data analyst, among others. Interestingly, Ruby has made a resurgence in this field, which raises the question: is Ruby still relevant for data management? If so, is it a mere zombie? On the contrary, I believe it opens new horizons for Ruby developers to work with data. In the past, Ruby engineers were required to handle every aspect of a project from choosing a framework to back-end and front-end work, infrastructure, and data handling. Nowadays, there’s more specialization, leading to the emergence of roles such as Ruby API engineers.
00:06:54.780 As a Ruby engineer today, you can focus more strictly on Ruby without delving into everything else. This has led to a division in responsibilities. For example, if you ask a Rails engineer if they use Ruby, you might be surprised to find out many of them don’t, as they work primarily with Rails without a solid grasp of Ruby itself. In the context of data, Ruby engineers can step up to fill emerging roles in data.
00:08:56.660 Let’s dive deeper into some practical aspects of Ruby and data management. You may need to work with databases, perform migrations, or prepare data from time to time. Expressing that you do ETL can be misleading, as many people don’t want to be seen as merely database administrators or confined to data management roles. When working with Ruby, you must consciously consider how your applications handle data, especially when designing your systems to be adaptable to future data needs.
00:10:57.960 When it comes to maintaining data integrity and performance, ask yourself: how many of you clean data regularly? Data management often becomes messy because instead of cleaning up, we tend to just add more data, thinking we might need it later. However, deleting unnecessary data is crucial for maintaining a healthy database, as excessive data can slow down systems and complicate analysis. Moreover, if you’ve ever migrated data from one database to another, you’ll know how significant that task can be, especially when involving sensitive information.
00:13:24.400 As part of my insights today, we also need to address compliance and security concerns regarding data. When you start building applications, many think of how they will address security issues later on. This reactive approach can lead to serious vulnerabilities if not handled correctly from the start. I'll discuss how Ruby can manage sensitive data and how emerging data trends impact Ruby engineers.
00:15:30.420 The great news is that Ruby can indeed be used for data management. Reports indicate that there's been an increased need for languages that can assist in data-related jobs, and Ruby is one of them! This has proven Ruby’s relevance in the data world and asserts that Ruby can keep up with trends in data science, machine learning, and other related technologies.
00:17:48.420 To sufficiently work in Ruby data management, knowing SQL is beneficial. While many recruiters might ask about SQL during interviews, it’s not always necessary for every task in your job. Still, knowing about ETL processes and tools for working with data is beneficial, as it helps you understand how to navigate and apply knowledge in data management effectively.
00:19:57.680 General tips for being a successful Ruby data engineer include avoiding the pitfalls of 'n+1 queries' which can hurt performance. Hence, improving database design from the start is essential, considering how data shapes the architecture of applications. Understanding and properly utilizing indexes can also drastically improve query performance. Thus, always ask questions, explore, and ensure your implementation is optimal.
00:22:36.540 Now, for practical examples, if you're dealing with healthcare applications that need to comply with regulations like HIPAA, you must ensure data security and privacy measures are in place, often leading to challenges when integrating with third-party vendors or tools for analytics. This requires a thorough understanding of compliance needs and best practices in handling sensitive user data while ensuring data usability.
00:24:30.780 Lastly, as we innovate, we must find ways to obfuscate data when legally necessary to protect sensitive information while still allowing access to realistic data for testing and development, keeping in mind the principles of data usability versus data visibility.
00:27:51.360 With that in place, I'd like to shift gears and illustrate a recent project where we utilized these approaches for managing healthcare data. Throughout the journey of this specific project, we encountered various hurdles, particularly with sensitive personal information. We noticed an underlying exposure risk that we had to address and manage appropriately.
00:30:58.680 We looked at using data obfuscation to comply with regulations and at the same time ensure that our data integrity remained intact. Techniques included employing libraries like Faker to generate realistic—yet fictitious—data, allowing user testing without exposing sensitive user information.
00:36:35.600 By creating a tool called Grazer that interacts with our application data, we tackled the challenge of refining our approach. With Grazer, we were able to systematically scan through our database models, apply obfuscation techniques, and ensure data was consistently anonymized, generating records while preserving structural integrity.
00:37:45.500 I appreciate your engagement and questions today, which are crucial as we work toward refining our practices in data management. Please go ahead, ask any questions you may have regarding the talk.
00:38:39.080 Thank you all for the attentive participation. Engaging in these discussions helps bridge knowledge gaps and creates a robust learning community. Questions are welcome—all topics related to Ruby, data management, and our experiences in applying them to real projects can contribute to our perspective.