Implementing the Web Speech API for Voice Data Entry

00:00:11.809 All right, so we're going to go ahead and get started. Thank you guys for coming. I know it's the end of the day and everyone's probably pretty tired, but I really appreciate you guys being here for this talk. I'm going to start out with a short poll. Can you guys raise your hand for me if you've ever used Siri, Google Voice, or any other type of voice dictation software? So, show of hands. Okay, great! Almost everyone in the room.

00:00:35.820 Now, I want you to raise your hand again if you would characterize that voice dictation software as 100% accurate all the time. Anyone? Okay, that's pretty much what I expected, and that was basically my impression of voice dictation as well. For example, when I was back in San Francisco in my apartment putting together my slides, I tried to get Siri to turn on the lights above my desk. I had to ask about four different times, and in the end, she didn't even turn on the correct light. But that being said, today we are going to talk about voice dictation specifically with a technology called the Web Speech API.

00:01:08.510 But first, introductions. Hi, I'm Cameron, and I build expert-use software for Stitch Fix in Ruby on Rails. You may have also heard expert-use software described as internal tools, so what this means is that I build applications for other Stitch Fix employees. I'm going to be spending most of our time today talking about a project that I worked on recently at Stitch Fix, so I thought it would be a good idea to give a brief overview of what the company does so everyone can level set and be on the same page.

00:01:41.250 Stitch Fix is an online personalization company currently focused on men's and women's apparel, accessories, and footwear. The way that it works is you go online, fill out a survey with your size and fit preferences, and then we match you with a personal stylist who puts together a box for you of five items. We pick the five items from our warehouse, send them to your house, you try them on at home, and you get to keep what you love and send the rest back free of charge.

00:02:21.180 The previous slide showed a picture of a typical Stitch Fix box, also known as a fix. Here, I want to show the lifecycle of one of those items in a fix. Before the item gets to the client, there are several different steps it goes through. At the very beginning, it involves a choice by the buyer to bring in the style to sell to our clients. The buyer places the order for the style to come in on a certain date, and then the vendor ships the items of the style to our warehouse on that date.

00:03:05.459 Next, the warehouse receives and puts the items into inventory, making them available for the stylist to send out to the client. Once the stylist picks the item to go into the client's fix, we are back at the warehouse, and the warehouse picks the items out of inventory, packs them up, and ships them to the client. Then, as I mentioned before, the client can try on the items at home, and the warehouse will process anything that the client returns.

00:03:19.019 Now that we've talked a little bit about Stitch Fix, here's a brief overview of what we're going to cover today. First, I'm going to go through a case study featuring data entry by associates at our warehouse. Then, I'll show you how you can get started with the Web Speech API to experiment with voice dictation on your own.

00:03:30.359 I'll discuss some voice dictation challenges that we ran into and the solutions we implemented, and I'll answer the question: Is voice the right solution for you? So jumping into the case study, like many retail companies, Stitch Fix takes measurements of the items that we bring into inventory to eventually sell to our clients. This is a diagram of a men's long-sleeved woven shirt, and you can see six marks across the shirt called points of measure. These are specific technical retail measurements we take on the shirt.

00:04:14.160 There are actually hundreds of measurements that can be taken, ranging from something as specific as the width of a button to as generic as the length or width of the shirt. For any given men's shirt, we typically take about 15 to 20 measurements, and the part of the process where we take these measurements, if we go back to the lifecycle of an item, is at our warehouse when we receive the inventory from the vendor. The way that we collect the measurements is simply with a basic sewing measuring tape.

00:04:38.880 Here, you can see one of the men's shirts laid out flat on a table as we measure across the shoulders with the measuring tape. When I started working on this project, the goal was to build an application to start capturing these measurements at our warehouses. The process was already in place before I began working on the project. Measurements were being taken and collected, but the team was using Google Sheets to record these measurements.

00:05:01.770 You'll see that this is a recurring theme in internal software: we're taking existing processes and making them more efficient and scalable by building software to support them, and that's exactly what we did in this project. Throughout the project, I had the opportunity to partner with my coworker on the user experience (UX) design team at Stitch Fix, and we worked together throughout the entire project, from the user research phase to the prototyping phase and the development phase.

00:05:27.389 Here's a picture from our initial user research session where we went to the warehouse to observe the current measurement process before determining what type of tool we would build to support the process. We had a couple of main takeaways from the first user research session. The pictures on the left and right show some handmade props that the warehouse associates created to aid them in the measuring process. We took inspiration from these props and carried that through into the application.

00:06:05.639 The middle photo shows one of the warehouse associates taking measurements, and our main takeaway was that they recorded these measurements on very small laptop screens. There was a lot of hunching over and shifting of body language between measuring the garment and entering data into the keyboard. Before I explain the rest of the process we went through to build this application, I wanted to give you some context so you can understand what we ended up building.

00:06:59.970 Here's a quick demo of the final solution that we came up with: 23, 18 + 1, 4, 9 and 3/4, 8 & a half, 4 and 7/8, 2 & 3, 4, 16 and 1/2, 2 & 5 / 8, save. In case you haven't figured it out already, you are at one of the JavaScript talks at RailsConf! We ended up going with voice dictation as our solution. Although this is a Rails app, all of the voice dictation is built on the front end.

00:07:44.729 But in all honesty, this isn't really a talk about JavaScript or even about voice dictation. It's a story about how to leverage the UX design process in engineering to build the best products for our users. So how did we do that? Well, let's finish the story. After our initial user research session, we focused on the fact that users were hunched over the small laptop and had to switch back and forth from measuring to entering measurements into the keyboard.

00:08:23.840 We wanted to test out measuring in pairs, so we asked the associates to pair up, allowing one to continuously measure and dictate the measurements aloud while the other typed into the laptop. The reasoning behind this was our hypothesis that if one associate could spend 100% of their time measuring, they wouldn't have to break their flow or concentration. They wouldn't have to reset their body language or their hand position on the measuring tape, making them more efficient.

00:09:03.540 What we found from this test was that the associates disliked the concept of measuring in pairs. The person typing on the laptop felt like they were just waiting around, not really doing much, and believed they could be more efficient grabbing another shirt to measure themselves. However, we did notice that the associate focused on measuring seemed to be more efficient and didn't have to break their flow. They could continuously measure without breaking their concentration or shifting their body language.

00:09:58.050 Due to this finding, we wanted to move forward with a voice usability study. These two screenshots show our initial prototypes that we brought to the warehouse. The one on the left is a basic keyboard entry, and the one on the right is the voice dictation prototype. They don't look that much different because this isn't so much of a UI change as it is an input change. But in the voice dictation prototype, there's a 'click to speak' button at the top for the associates to press when they're ready to start speaking into the application.

00:10:43.690 In this voice usability study, there were three main questions we wanted to answer. The first was around efficiency: how does voice entry affect the overall time to measure a garment? The second question was about accuracy. Our warehouses are pretty noisy environments; the associates often like to play music, talk aloud, or chat with their friends during their shifts. We were wondering if this would work for voice dictation or if it would be difficult to capture the input.

00:11:06.210 The last question concerned culture and workflow: how would the warehouse associates feel about voice entry? For context, any associate working on measurements usually does this for about a four-hour shift, which includes breaks. We didn't know if talking aloud for hours at a time would feel exhausting or if they would prefer to type into a keyboard instead.

00:11:55.440 So let's take a look at the results. Regarding efficiency, we tested these prototypes with two warehouse associates. Participant one saw a dramatic increase in efficiency, shaving about three minutes off his measurement time using voice data entry. Participant two also saw some lift in efficiency, but not as dramatic.

00:12:22.040 Interestingly, participant two was already quite experienced in measuring, so he was already very fast at taking measurements; that's why he didn't see quite the increase in efficiency as the less experienced associate. Yet, we thought these promising results indicated that there could be significant efficiency gains, especially when onboarding new people into the process.

00:12:49.589 The next thing we wanted to investigate was accuracy. We found that investing in the right headset was crucial for this, enabling us to mitigate the accuracy issues from the noisy environment. This is the headset we ended up purchasing for our warehouse associates. The microphone has a narrow input range, and importantly, the microphone can be flipped up into the headset to stop recording when flipped up.

00:13:15.140 This was important in maintaining the culture of the warehouse, allowing associates to seamlessly transition between measuring, singing along to music, and chatting with their friends, without feeling trapped by the voice dictation device. Lastly, we wanted to know how the associates felt about voice dictation.

00:13:51.579 Here are some photos: the left one shows the keyboard entry prototype, and the right one shows the voice dictation prototype. This is participant number one in the study, whose main comment was that the voice dictation felt much better for his back. In the voice dictation photo, you can see that he is standing up straighter and not hunched over the laptop.

00:14:22.330 This is participant two, the most experienced and already quite efficient at using the keyboard. His main comment was that he liked that he never had to remove his hands from the measuring tape. You can see in the photo that even when using keyboard entry, he has a one-handed approach to typing. As he is more experienced with measuring, he capitalized on the ability of voice dictation to allow him to use both hands at all times for measurements.

00:15:06.360 Now that you've seen how we utilized voice dictation with our warehouse associates, I want to talk a little bit about how you can get started with the Web Speech API on your own. First, here's a bit of CoffeeScript showing how to initialize the Web Speech API. The great thing about this API is that there’s no external library or anything you need to pull in. It is available as part of the JavaScript language if you're using the Chrome browser.

00:15:35.950 This means it is as simple as initializing the WebKit speech recognition. That said, this feature is only available in Chrome. This is why internal tools are great candidates for using the Web Speech API, as we can fully control our users' browsers. However, it might not be the best solution for something customer-facing where you need to support every browser. Below is a code snippet showing that we only initialize the speech recognition if it's defined.

00:16:38.630 Pretty much, the only other thing that you have to do is start the recognition and record voice results. You can also see a piece of code in the middle that restarts the recognition every time it ends. This ensures that the associates can continuously measure without having to click any buttons or actively turn on and off the voice recognition. They can move in and out of voice dictation just by flipping the microphone up on the headset.

00:17:22.160 The last step is getting results from the API and returning the transcript. It's a pretty straightforward setup. I now want to go into some of the challenges we encountered with voice dictation and the solutions we implemented. The first challenge was around contextual formatting. You may have noticed that the results from the Web Speech API come back as an array because it records context as the user speaks and returns snippets along with the final result.

00:18:38.140 Let’s look at two basic examples. In the first, the user starts speaking and says something like 'two and a half ice cream.' The API recognizes this as a sentence and returns the words as they were spoken, with no formatting issue because it made sense in context. However, in the second example, the user starts the same way but then pauses mid-sentence.

00:19:53.670 The problem is that in our application, our users only record numbers; hence the API assumes they are saying a number due to the lack of context. Consequently, it tries to transform the text into numeric format, but it doesn’t always work 100% of the time. I believe we experienced about a 50/50 success rate with this, which meant users would say 'two and a half,' but the text would appear as the words 'two and a half' instead of the number, creating confusion.

00:20:27.670 Fortunately, we could solve this issue relatively easily due to the structured data we expected—only numbers. We set up an object mapping words to their numeric counterparts, and every time we received transcripts back from the API, we iterated through the object, checking for matches, and replaced the words with their numeric values when we found them.

00:21:09.660 Another challenge we encountered was dictation errors from our users, which were more complex to solve. For example, if a user dictated '35 8,' it could return '35 over 8' as expected, whereas they might have actually meant '30 and 5/8' but just didn’t enunciate 'and' properly. This required training users on how the API would interpret their speech, as they recorded exactly what they said—not what they intended.

00:22:31.510 There's also the challenge that 'four quarters' returns 'four over four,' while they might mean '4 and 1/4'—again, valid outputs. These issues are challenging to catch since both results can be valid fractions, but we do have front-end validation that ensures users reduce their fractions. While we can catch some of these non-reduced fractions, other valid outputs for dictation errors can slip through.

00:23:09.550 The last challenge we faced with voice dictation was reliability. On the mvn documentation for the Web Speech API, you’ll see a notice indicating that this is an experimental technology with various caveats about breaks in compatibility. After a few weeks in production, we noticed some unexpected behavior, where users would get through about half a page of measurements, and the recording would stop working altogether.

00:23:57.740 This was tricky to debug as there weren't any errors in the JavaScript console to indicate a problem. We initially turned to hardware as a potential issue by testing various headsets but didn't find anything. We also tried different laptops to see if the problem persisted across devices, ensuring all users had updated versions of Chrome, but it was still inconclusive.

00:24:48.750 Nonetheless, we knew we had to have a fallback plan from the very beginning when working with an experimental technology. We never blocked users from simply entering the data into the form you saw in the demo, which is what they're using for the most part right now while we address the reliability issues. We wanted to implement voice dictation primarily for user comfort and positive body language.

00:25:25.620 Thus, we purchased monitors with large screens that could stand in the warehouse, allowing associates to see the measurement form clearly without having to hunch over. This provided a much better experience. I also want to highlight a different challenge regarding users entering data into the form.

00:26:09.730 If users are typing on the keyboard and intend to type '10,' they might accidentally slip and type an extra '0,' resulting in invalid data—like measuring '100' instead of '10.' It can be tricky to catch since both '100' and '10' are valid numbers. We implemented suggested ranges for each measurement; every single point of measure has a minimum and maximum value.

00:26:49.239 This allows us to issue front-end warnings if measurements are out of range. For example, we show an orange warning, but we only warn users without blocking them from submitting the form. It's possible for measurements to be out of range, and we want to flag extreme outliers that would never happen, such as '104' across the shoulder.

00:27:41.590 Now, talking about whether voice is the right solution for you, there are a few things to consider when looking into voice dictation as a solution. First, control for the Web Speech API. It’s only currently supported in Chrome, which allowed us to experiment more easily when developing internal tools compared to customer-facing solutions.

00:28:40.850 The structured data we worked with was beneficial regarding contextual formatting; I’m not sure how we would tackle that if the API returned unexpected data while allowing freeform input. It's also important to maintain a flexible user base and have a fallback plan in an environment involving experimental technology, building trust with your users to enable them to experiment alongside you.

00:29:19.610 Communicating that it might not be perfect, especially during the initial iterations, while ensuring users understand the fallback plan is essential. As I prepared my key takeaways, I realized this talk serves as a bit of a post-mortem for this project.

00:29:50.860 There’s a lot to learn here, but a couple things I hope you take away from this story center around UX and engineering collaboration. The first point is that the collaboration we had allowed us to empathetically develop expert-use software. This was my first experience considering users’ body language and comfort levels while using the app, and I hope to bring this focus into more of my applications in the future.

00:30:37.169 As well, the collaboration enabled us to create prototypes quickly. We quickly iterated and devised solutions during the process. The initial prototypes I showed at the beginning were made entirely in code through true UX and engineering collaboration. This approach allowed us to deliver realistic prototypes for user testing quickly, iterate directly in the code, and some elements of that code made it into our production version.

00:31:23.070 This collaboration helped us move through the process swiftly and examine the challenges from both engineering and user experience perspectives. Finally, I want to thank a few people for their collaboration on this project, especially Sarah Poon, my coworker on the UX design team, who was with me every step of the way on this.

00:31:53.159 Everyone else on this list was also instrumental in getting this project off the ground. And with that, thank you guys!