Euruko 2017

Data-driven production apps

Data-driven production apps

by Sai Warang

In this talk at Euruko 2017, Sai Warang from Shopify discusses the evolution of data-driven production apps, specifically highlighting the creation of the Capital application aimed at supporting merchants with access to funding. Key points covered in the presentation include:

  • Introduction to Shopify: Sai Warang describes Shopify’s role as a platform for setting up online commerce and how it has evolved over the years.

  • Transition to Data-Driven Products: He explains how, over the past decade, Shopify has collected significant data from its merchants and transformed from a basic application to a data-driven platform.

  • Challenges of Legacy Applications: The presentation addresses the challenges developers face when working with legacy applications versus starting fresh on new projects.

  • Data’s Role in Success: Warang emphasizes the importance of understanding merchant data, especially regarding order processing, in determining which merchants to support financially.

  • Implementation of a Data Warehouse: To better leverage data for decision-making, the introduction of a data warehouse using the ETL (Extract, Transform, Load) principle is discussed, allowing for cleaner and more reliable datasets for analysis.

  • Dimensional Modeling: The talk covers the concept of dimensional modeling for better data organization, facilitating easier access to key metrics without complex SQL queries.

  • Automation and Predictive Analysis: Warang highlights the importance of automating financial insights and the complexities involved in ensuring data quality for reliable predictions.

  • Crucial Takeaways: Warang concludes that a structured approach to data is essential for sustainable development, enabling experimentation and continuous improvement in a legacy environment that fosters confidence in automated processes.

Overall, the talk emphasizes the journey of using existing data to create robust applications that enhance merchant experiences and streamline business processes in e-commerce.

00:00:08.040 I would like to introduce Sai Warang and give him a warm welcome.
00:00:15.000 Thank you.
00:00:26.340 I'm Sai, as it says on the slides. I work at a company called Shopify, and I came here from Ottawa, Canada. I’m really excited to be here; this is my first time in Hungary and my first time speaking in front of so many people.
00:00:43.860 To give you a little background about what I'm going to talk about, let me just get accustomed to the feedback.
00:01:11.250 I also want to give everyone some time to settle in before we dive into it. So, a little about Shopify: it's a platform that makes it easy for people to set up online commerce experiences, from online stores to point-of-sale systems at fairs.
00:01:26.140 I work as a developer on a product team called Capital. In the simplest term, what Capital does is find promising merchants who are selling on Shopify and provides them with cash so that they can expand their businesses. Entrepreneurship is quite challenging, and one of the biggest obstacles for ambitious people is access to capital.
00:01:48.150 Capital makes that happen, and it is entirely a data-driven product. That is how I ended up here in front of all of you.
00:02:06.130 Shopify has been around for about 10 to 12 years. It’s one of the oldest Rails applications. When it was first created, it wasn't a data-driven product; it was an application built to solve a specific problem. People wanted to set up online stores, process orders, and enable their customers to checkout seamlessly. That's what Shopify accomplished for a long time.
00:02:31.870 Over the past decade, Shopify has accumulated a lot of data about merchants doing real business on the platform. We now know what makes merchants successful and what leads to their failures, gaining front-row seats into the world of commerce, especially online.
00:02:58.870 With this knowledge, we've built a data-driven product called Capital, which has disbursed $100 million to about 5,000 merchants in just the past year. This rapid growth has increased our confidence in the data and what we've built.
00:03:31.290 This talk is primarily aimed at people working within a legacy application that has been solving real user problems for a long time. If you're starting a greenfield project from scratch, there are numerous tools available that can allow you to structure your application without facing the same constraints and challenges as someone maintaining an older app.
00:03:53.250 Many of the examples I'll discuss are inspired by true events, particularly the story of how Capital was developed over time. I have simplified these examples for clarity. So, let’s rewind to when Shopify was not in a position to build Capital. It was simply a platform that allowed many people to sell online with aspirations of streamlining the business setup process for merchants.
00:04:58.030 We wanted to leverage the data we already possessed because we had the intuition that we could create something valuable by enhancing the merchant experience on Shopify and offering them an unfair advantage.
00:05:39.789 We had a Rails application that allowed people to set up online stores and process orders. The reason I mention this is that a merchant’s success is often tied to the volume of orders they process. If we identify merchants likely to succeed, we can make investments in them.
00:06:11.679 To identify these successful merchants, we cared about aspects such as order processing. Shopify implemented a Model-View-Controller architecture, where models represent Ruby classes with specific behaviors mimicking real-world entities, such as users in a web app.
00:06:46.479 In the context of building Capital, we examined important models such as a shop model that organizes each shop's records and its associated orders. The controllers manage the logic that determines how data is presented to various consumers of the app, and the views, which are templates, deliver the data to users.
00:07:28.530 It's crucial because the separation ensures that views don't need to concern themselves with the data sources. This concept resonates with me—every time I discuss something seemingly obvious or grand, my skeptical cat pops up in my thoughts, reminding me to be cautious.
00:08:07.210 What’s the big deal? We now expect a lot from existing products. Platforms like Spotify should know your music preferences, Facebook knows your interests, and Google completes your searches for you. We desire the automation of repetitive tasks and value-added recommendations.
00:09:05.019 Our focus is on determining which merchants pose risks and which do not before providing them with money. This is crucial because giving money to unreliable candidates can lead to undesirable outcomes.
00:09:56.230 Once we identify a merchant as low-risk, the next question is how much capital we can provide them. This is similar to the classic banking underwriting process.
00:10:16.329 We want to automate the process rather than having a person manually evaluate each merchant's status and execute financial transactions. This is essentially a classic data problem.
00:10:38.010 One might imagine a system where you simply tell a data system to predict how much money can be allocated to a specific shop or merchant. But is that possible with the capabilities we had at the time—a typical Rails application?
00:11:06.210 If you want such predictions, you have some heavy lifting to perform, and it's not always clear where to begin. When faced with a problem, I often employ my thinking hat to brainstorm potential solutions and intriguing technologies like artificial intelligence and machine learning.
00:11:52.000 However, implementing these cutting-edge tools within a legacy environment can be a daunting endeavor. This realization leads to the question: are we overshooting our goals by aiming for a complete data-driven solution?
00:12:31.760 Let's consider a fundamental question: do we simply want to know how many orders a merchant processes? Are we content with basic aggregations, or do we seek more complex insights? While these calculations can be performed within a Rails app, growing apps often encounter challenges with performance.
00:13:41.990 As we pursue ambitious goals, running complex calculations could lead to dropping essential traffic from our established users. Alternatively, you might decide to create a presentation layer for your data, which does not require extensive additional development.
00:14:00.000 But what if the data cannot be trusted? Creating reliable dashboards for business processes—like financial reporting—demands high-quality data. It’s a frustrating experience when test data leaks into production systems, skewing reports with absurd values.
00:14:39.860 We need to evaluate our data quality and ensure we have mechanisms for complex analysis and real-time predictions. While datasets can certainly be flawed, we can aim for an exceptional standard by filtering out unwanted data and validating that the insights generated are sound.
00:15:22.040 Considering the state of our vanilla Rails app, we might find data quality to be acceptable, conditional on careful filtering. Complex analysis introduces performance challenges, and real-time predictions are rarely feasible without structured data.
00:16:06.029 To tackle these challenges, I propose implementing a data warehouse, which functions on the extract, transform, load (ETL) principle. You take production tables and populate another system, allowing for cleaner data that does not carry the burdens of legacy decisions.
00:16:31.110 In this setup, you can extract records that have been updated since the last checkpoint, thus maintaining up-to-date information in your data warehouse without causing stress on your production database.
00:17:08.700 You can filter out sensitive information and maintain compliance with regulations while still having accessible metrics and analysis capabilities. The result is a robust warehouse that allows you to effectively rule out unwanted noise and focus on reliable insights.
00:17:55.340 With the ETL process in place, you allow for periodic snapshots of fresh information, providing the clean datasets required for accurate analysis. It also enables experimentation with new algorithms and tools because the legacy constraints are lifted.
00:18:57.300 Dimensional modeling comes next. It allows you to consolidate data meaningfully based on business dimensions rather than the schemas of the production environment.
00:19:26.700 You aim to present facts—numerical values that measure business performance—without being burdened by the need to always perform joins in SQL.
00:19:59.040 When utilizing dimensional modeling, you end up creating an order dimension as a primary way to evaluate your data, allowing for the quick lookup of sales and trends over time.
00:20:36.950 The goal is to have data readily accessible for analysis. You can efficiently derive numerical values, such as how much a specific store sold daily, without constantly performing joins.
00:21:07.300 By establishing these dimensions and facts, you are structuring the findings in a way that can be reused across different analyses, leading to cleaner, more maintainable data architecture.
00:21:39.240 It's crucial to remember that these dimensions are denormalized representations, making it easy to access the information you need without complicated queries.
00:22:14.110 Maintaining simplicity in the data retrieval process means your calculations can run seamlessly, helping you to derive insights and experiment effectively.
00:22:49.100 Dimensional modeling enables clear definitions of metrics. It also encourages thinking through your data and metrics long-term, enabling informed decision-making based on stable and meaningful constructs.
00:23:28.130 When rapidly building features, the quick-and-dirty approach often results in complex data structures, like JSON blob data that frustrates efficient querying and understanding.
00:24:23.880 For example, in Shopify, it's possible for a merchant to leave or sell their business, making it crucial to have accurate tracking of their financial history. Relying solely on a generic event table with uncontrolled JSON data leads to difficulties in analysis.
00:25:12.860 Dimensional modeling helps you establish the correct pathways to capitalize on business changes and provides a structured approach to ensure your product is effectively designed for tracking key metrics.
00:25:54.790 Over time, you can build and use testing frameworks for algorithms that help identify the most effective models through experimentation—affording the opportunity for actionable insights without burdening your production environment.
00:26:41.480 Advanced validation mechanisms should run to ensure the automation of financial insights, preventing scenarios like inadvertently issuing substantial loans because of incorrect predictions from the system.
00:27:24.160 The structure of your design matters, so the architecture you've built must be reliable enough to withstand the complexities of enterprise-level operations, creating reliable boundaries in the processes.
00:27:53.140 A layered approach offers the flexibility necessary to encourage experimentation and help balance the risk of potential pitfalls. Resiliency should be built into your operations from the beginning, ensuring sustainable pathways to reach your key objectives.
00:28:36.930 As you navigate the integration of insights into your existing application, pairing custom algorithms with your interface provides a continuous feedback loop for enhancements.
00:29:20.349 Finally, the validation procedures can enhance confidence in the automation efforts, permitting growth and seamless interactions within the application. Experimentation is essential to continue evolving your data-driven platform, allowing fluid improvements and innovative outputs.
00:29:57.390 I loved discussing this topic, and while I could elaborate further, I appreciate your time. I would like to open the floor to any questions.