Of Mice and Metrics

Scientific Method

Of Mice and Metrics

Kara Bernert

@beavz

#data-analysis

#software-development

Of Mice and Metrics

by Kara Bernert

The video "Of Mice and Metrics" features a talk by Kara Bernert at GORUCO 2015, where she draws parallels between her background in biomedical research and her experiences as a software developer. Bernert discusses the inherent dissatisfaction in the software development field and the pursuit of quality and efficiency. She emphasizes the crucial challenge developers face in turning data into actionable insights, akin to the scientific method used in research.

Key Points:

- Background: Kara Bernert introduces herself as a software developer with experience in biomedical research and a degree in chemistry.

- Developer Dissatisfaction: Developers possess a constant desire to improve processes and tools, struggling to connect questions with actionable data.

- Scientific Method: Bernert outlines the scientific method as a framework for rigorous data analysis and hypothesis testing, though she notes the limitations in strictly adhering to it.

- Data Hierarchy: She introduces the concept of a data hierarchy that ranges from basic observations to more structured, controlled data, asserting that valuable insights exist at every level.

- Case Study: Bernert shares a case study regarding user interaction with lists on Gust’s platform.

- Initial negative user feedback prompted a thorough analysis of user behavior using basic observational data such as usage logs.

- Middle-tier data collection assessed the perceived effectiveness of proposed features based on user activity.

- Higher-tier testing with a controlled rollout of new vs. old lists was considered but modified due to feasibility concerns.

- Feedback Analysis: Instead of rigorous experimental design, Bernert highlights utilizing user feedback before and after the lists' overhaul to gauge success, transforming qualitative feedback into quantifiable metrics for comparison against retention rates.

- Statistical Techniques: She mentioned techniques, such as analysis of covariance, to manage confounding variables and refine the understanding of the data's implications.

Conclusion: Kara Bernert's talk underscores the importance of adapting scientific principles to software development processes. By recognizing the diverse sources and levels of data, developers can leverage both qualitative and quantitative insights to improve software features effectively. This approach facilitates learning and iterative enhancements in the software ecosystem.

00:00:00.280 Hello, everyone.

00:00:04.310 My name is Kara Bernert, and I am a software developer. Until recently, I was working in biomedical research, and I hold a degree in chemistry. Today, I will share some insights I gained in the lab and discuss how I apply them to the way we collect data about our processes and the software we build.

00:00:18.540 I've been learning a lot about software development and about software developers themselves. One thing I've noticed is that developers are often perpetually dissatisfied. There is a constant drive to improve, optimize, and iterate. We want high quality, efficiency, usability, and reusability for both our users and our fellow engineers.

00:00:39.210 This pursuit leads us to many questions about the best methods to achieve these goals. On the other hand, software developers truly love good tools. We enjoy building, finding, and utilizing them. Our toolkit provides us with plenty of data; however, we often struggle to connect our questions with the data to find meaningful answers. This is essentially the challenge that scientists face continuously.

00:01:02.610 Scientists have developed a framework of best practices for addressing this challenge, known as the scientific method. It is probably familiar to many of you from grade school: make observations, formulate a hypothesis to explain those observations, and then, through a rigorous set of best practices for controlling experiments, test your hypothesis. The aim is to prove it, or at least support it, and continue iterating on this process. While effective, this method is often time-consuming, expensive, and can sometimes feel impossible. I found in my lab experience that only about ten percent of the data we collected strictly adhered to this scientific method, while getting a paper published typically requires that around ninety percent of data comes from such processes.

00:01:39.900 What, then, was the source of all the data we were producing that wasn’t included in publications? Were we simply bad scientists wasting our time? I would argue not. I believe this can be explained through the concept of a data hierarchy, which represents a continuum of data—from basic observations at the bottom, which ideally are objective, to more specific observations that may not have resulted from carefully controlled experiments at the top.

00:02:03.400 I contend that meaningful data can be obtained at every level of this hierarchy. The key is to acknowledge the origin of the data within the hierarchy to inform how we use it. There are all kinds of useful data, and I'd like to illustrate this with examples of how we utilized data from across the hierarchy to change lists at Gust. We had a substantial amount of data on our site, where users interact with various lists, but we received negative feedback about the lists being confusing. Many customers even resorted to exporting their data to Excel to avoid using them altogether. This prompted us to overhaul our lists.

00:02:35.300 As part of the overhaul, we wanted to incorporate some data collection to evaluate the success of the new features we introduced. The first type of data we gathered came from the bottom of the hierarchy—cheap and abundantly available. We logged everything users did with these lists: when they filtered, sorted, took actions, and how long they'd spent on the platform. This data was meticulously gathered and organized, although it did not specifically answer any targeted questions. Still, it was valuable for observing user behavior patterns.

00:03:03.700 We also engaged in some middle-of-the-hierarchy data collection when considering whether to implement a certain filter. We were uncertain about its effectiveness and worried that it might produce empty results for users if they filtered by it. Recognizing this as an assumption, we added a measure on our site to determine how many results users received from that filter. This addressed a specific question, but it was not a rigorous test. Instead, it provided insights that might inform future improvements or additional testing.

00:03:37.340 To evaluate whether the new lists were successful, we needed data from the top of the hierarchy. Our hypothesis was straightforward: if we improved the lists, users would not leave our site as frequently. We planned a simple test to validate this hypothesis. By rolling out the new lists to half of our users while keeping the other half on the old lists, we could compare the two groups without altering anything else on the site.

00:04:03.800 However, this approach would require a long wait for statistically significant results, which could lead to financial difficulties. Thankfully, even when the most rigorous scientific tests are impractical or inadvisable, we have many other tools at our disposal. We decided to explore user feedback as a potential indicator. It was reasonable to expect a correlation between positive feedback and user retention—if users felt positively about the changes, they would likely remain on the site.

00:04:36.320 We evaluated the feedback received prior to implementing the new lists against the feedback after implementation. However, a major challenge was that user feedback consists of qualitative data, which can be challenging to quantify. Fortunately, we can transform qualitative feedback into quantitative data using scoring systems often utilized in science, allowing us to statistically compare our feedback quantities with the customer retention rates.

00:05:10.080 While we can analyze user feedback quantitatively, it does present challenges: it tends to be a self-selecting sample, which might not accurately represent our entire population. To address this concern, we could solicit feedback through non-deterministic randomization, ensuring that our samples accurately represent the user base and control for numerous variables.

00:05:31.600 If confounding variables still affect our analysis despite these adjustments, we can address them via our analytical methods. Techniques like analysis of covariance allow us to tease out the effects of these variables. This, essentially, is how we leverage the scientific method to tackle our list issues.

00:06:08.970 I hope that through this process, we can learn significantly about features and their success looking ahead. Thank you for your time. Here are my online presence links: Twitter and GitHub.

GoRuCo 2015