00:00:08.040
I would like to introduce Sai Warang and give him a warm welcome.
00:00:15.000
Thank you.
00:00:26.340
I'm Sai, as it says on the slides. I work at a company called Shopify, and I came here from Ottawa, Canada. I’m really excited to be here; this is my first time in Hungary and my first time speaking in front of so many people.
00:00:43.860
To give you a little background about what I'm going to talk about, let me just get accustomed to the feedback.
00:01:11.250
I also want to give everyone some time to settle in before we dive into it. So, a little about Shopify: it's a platform that makes it easy for people to set up online commerce experiences, from online stores to point-of-sale systems at fairs.
00:01:26.140
I work as a developer on a product team called Capital. In the simplest term, what Capital does is find promising merchants who are selling on Shopify and provides them with cash so that they can expand their businesses. Entrepreneurship is quite challenging, and one of the biggest obstacles for ambitious people is access to capital.
00:01:48.150
Capital makes that happen, and it is entirely a data-driven product. That is how I ended up here in front of all of you.
00:02:06.130
Shopify has been around for about 10 to 12 years. It’s one of the oldest Rails applications. When it was first created, it wasn't a data-driven product; it was an application built to solve a specific problem. People wanted to set up online stores, process orders, and enable their customers to checkout seamlessly. That's what Shopify accomplished for a long time.
00:02:31.870
Over the past decade, Shopify has accumulated a lot of data about merchants doing real business on the platform. We now know what makes merchants successful and what leads to their failures, gaining front-row seats into the world of commerce, especially online.
00:02:58.870
With this knowledge, we've built a data-driven product called Capital, which has disbursed $100 million to about 5,000 merchants in just the past year. This rapid growth has increased our confidence in the data and what we've built.
00:03:31.290
This talk is primarily aimed at people working within a legacy application that has been solving real user problems for a long time. If you're starting a greenfield project from scratch, there are numerous tools available that can allow you to structure your application without facing the same constraints and challenges as someone maintaining an older app.
00:03:53.250
Many of the examples I'll discuss are inspired by true events, particularly the story of how Capital was developed over time. I have simplified these examples for clarity. So, let’s rewind to when Shopify was not in a position to build Capital. It was simply a platform that allowed many people to sell online with aspirations of streamlining the business setup process for merchants.
00:04:58.030
We wanted to leverage the data we already possessed because we had the intuition that we could create something valuable by enhancing the merchant experience on Shopify and offering them an unfair advantage.
00:05:39.789
We had a Rails application that allowed people to set up online stores and process orders. The reason I mention this is that a merchant’s success is often tied to the volume of orders they process. If we identify merchants likely to succeed, we can make investments in them.
00:06:11.679
To identify these successful merchants, we cared about aspects such as order processing. Shopify implemented a Model-View-Controller architecture, where models represent Ruby classes with specific behaviors mimicking real-world entities, such as users in a web app.
00:06:46.479
In the context of building Capital, we examined important models such as a shop model that organizes each shop's records and its associated orders. The controllers manage the logic that determines how data is presented to various consumers of the app, and the views, which are templates, deliver the data to users.
00:07:28.530
It's crucial because the separation ensures that views don't need to concern themselves with the data sources. This concept resonates with me—every time I discuss something seemingly obvious or grand, my skeptical cat pops up in my thoughts, reminding me to be cautious.
00:08:07.210
What’s the big deal? We now expect a lot from existing products. Platforms like Spotify should know your music preferences, Facebook knows your interests, and Google completes your searches for you. We desire the automation of repetitive tasks and value-added recommendations.
00:09:05.019
Our focus is on determining which merchants pose risks and which do not before providing them with money. This is crucial because giving money to unreliable candidates can lead to undesirable outcomes.
00:09:56.230
Once we identify a merchant as low-risk, the next question is how much capital we can provide them. This is similar to the classic banking underwriting process.
00:10:16.329
We want to automate the process rather than having a person manually evaluate each merchant's status and execute financial transactions. This is essentially a classic data problem.
00:10:38.010
One might imagine a system where you simply tell a data system to predict how much money can be allocated to a specific shop or merchant. But is that possible with the capabilities we had at the time—a typical Rails application?
00:11:06.210
If you want such predictions, you have some heavy lifting to perform, and it's not always clear where to begin. When faced with a problem, I often employ my thinking hat to brainstorm potential solutions and intriguing technologies like artificial intelligence and machine learning.
00:11:52.000
However, implementing these cutting-edge tools within a legacy environment can be a daunting endeavor. This realization leads to the question: are we overshooting our goals by aiming for a complete data-driven solution?
00:12:31.760
Let's consider a fundamental question: do we simply want to know how many orders a merchant processes? Are we content with basic aggregations, or do we seek more complex insights? While these calculations can be performed within a Rails app, growing apps often encounter challenges with performance.
00:13:41.990
As we pursue ambitious goals, running complex calculations could lead to dropping essential traffic from our established users. Alternatively, you might decide to create a presentation layer for your data, which does not require extensive additional development.
00:14:00.000
But what if the data cannot be trusted? Creating reliable dashboards for business processes—like financial reporting—demands high-quality data. It’s a frustrating experience when test data leaks into production systems, skewing reports with absurd values.
00:14:39.860
We need to evaluate our data quality and ensure we have mechanisms for complex analysis and real-time predictions. While datasets can certainly be flawed, we can aim for an exceptional standard by filtering out unwanted data and validating that the insights generated are sound.
00:15:22.040
Considering the state of our vanilla Rails app, we might find data quality to be acceptable, conditional on careful filtering. Complex analysis introduces performance challenges, and real-time predictions are rarely feasible without structured data.
00:16:06.029
To tackle these challenges, I propose implementing a data warehouse, which functions on the extract, transform, load (ETL) principle. You take production tables and populate another system, allowing for cleaner data that does not carry the burdens of legacy decisions.
00:16:31.110
In this setup, you can extract records that have been updated since the last checkpoint, thus maintaining up-to-date information in your data warehouse without causing stress on your production database.
00:17:08.700
You can filter out sensitive information and maintain compliance with regulations while still having accessible metrics and analysis capabilities. The result is a robust warehouse that allows you to effectively rule out unwanted noise and focus on reliable insights.
00:17:55.340
With the ETL process in place, you allow for periodic snapshots of fresh information, providing the clean datasets required for accurate analysis. It also enables experimentation with new algorithms and tools because the legacy constraints are lifted.
00:18:57.300
Dimensional modeling comes next. It allows you to consolidate data meaningfully based on business dimensions rather than the schemas of the production environment.
00:19:26.700
You aim to present facts—numerical values that measure business performance—without being burdened by the need to always perform joins in SQL.
00:19:59.040
When utilizing dimensional modeling, you end up creating an order dimension as a primary way to evaluate your data, allowing for the quick lookup of sales and trends over time.
00:20:36.950
The goal is to have data readily accessible for analysis. You can efficiently derive numerical values, such as how much a specific store sold daily, without constantly performing joins.
00:21:07.300
By establishing these dimensions and facts, you are structuring the findings in a way that can be reused across different analyses, leading to cleaner, more maintainable data architecture.
00:21:39.240
It's crucial to remember that these dimensions are denormalized representations, making it easy to access the information you need without complicated queries.
00:22:14.110
Maintaining simplicity in the data retrieval process means your calculations can run seamlessly, helping you to derive insights and experiment effectively.
00:22:49.100
Dimensional modeling enables clear definitions of metrics. It also encourages thinking through your data and metrics long-term, enabling informed decision-making based on stable and meaningful constructs.
00:23:28.130
When rapidly building features, the quick-and-dirty approach often results in complex data structures, like JSON blob data that frustrates efficient querying and understanding.
00:24:23.880
For example, in Shopify, it's possible for a merchant to leave or sell their business, making it crucial to have accurate tracking of their financial history. Relying solely on a generic event table with uncontrolled JSON data leads to difficulties in analysis.
00:25:12.860
Dimensional modeling helps you establish the correct pathways to capitalize on business changes and provides a structured approach to ensure your product is effectively designed for tracking key metrics.
00:25:54.790
Over time, you can build and use testing frameworks for algorithms that help identify the most effective models through experimentation—affording the opportunity for actionable insights without burdening your production environment.
00:26:41.480
Advanced validation mechanisms should run to ensure the automation of financial insights, preventing scenarios like inadvertently issuing substantial loans because of incorrect predictions from the system.
00:27:24.160
The structure of your design matters, so the architecture you've built must be reliable enough to withstand the complexities of enterprise-level operations, creating reliable boundaries in the processes.
00:27:53.140
A layered approach offers the flexibility necessary to encourage experimentation and help balance the risk of potential pitfalls. Resiliency should be built into your operations from the beginning, ensuring sustainable pathways to reach your key objectives.
00:28:36.930
As you navigate the integration of insights into your existing application, pairing custom algorithms with your interface provides a continuous feedback loop for enhancements.
00:29:20.349
Finally, the validation procedures can enhance confidence in the automation efforts, permitting growth and seamless interactions within the application. Experimentation is essential to continue evolving your data-driven platform, allowing fluid improvements and innovative outputs.
00:29:57.390
I loved discussing this topic, and while I could elaborate further, I appreciate your time. I would like to open the floor to any questions.