Talks

Building HAL: Running Ruby with Your Voice

Building HAL: Running Ruby with Your Voice

by Jonan Scheffler

In the talk titled "Building HAL: Running Ruby with Your Voice" delivered by Jonan Scheffler at RubyConf 2016, the speaker explores the innovative use of voice recognition technology in executing Ruby code, moving away from conventional input methods like keyboards and mice. The excitement of automating tasks using voice commands serves as the central theme of the presentation.

Key points discussed include:

- Personal Background and Automation: Jonan shares his journey in home automation, citing examples such as using Alexa to manage household tasks and deploying voice commands to trigger various functions.

- Understanding Speech Recognition: The speaker differentiates between speech recognition (understanding spoken language) and voice recognition (identifying a speaker’s voice), highlighting two main categories of speech recognition: isolated word recognition and continuous recognition.

- Concept of Sound: A simplified overview of sound generation, frequency, and amplitude is provided to help the audience grasp the technicalities behind how speech is processed.

- Phonemes and Vocabulary: Jonan explains the breakdown of utterances into phonemes, diphones, and triphones, emphasizing the complexity of language processing in recognition systems.

- Fourier Transforms and Markov Chains: Insights into technical concepts like Fourier transforms and their role in analyzing sound waves, as well as hidden Markov models, which serve as basis for modern speech recognition technologies.

- Live Demonstration: Jonan conducts a demonstration using an integration of Alexa and a deployment plug-in known as Cubot that allows voice commands to trigger deployments, showcasing the practical application of these technologies in a real-world scenario.

- Alternative Systems: The presentation wraps up with an introduction to Marvin, a robot utilizing CMU Sphinx for speech recognition, which operates locally and highlights different approaches to voice command systems.

In conclusion, the session emphasizes how voice technology can transform interaction with software systems, enabling developers to utilize voice commands for deployment and automation tasks. Jonan encourages audience engagement and future exploration of these technologies and their capabilities in both communication and operational contexts.

00:00:14.990 I'd like to welcome you all to this talk about how we're going to learn to run some Ruby code with our voices today. It's kind of a lot of fun, but also terribly difficult, so I hope you will forgive me if probably all of my demos fail. But maybe one of them will be funny to make up for it. I am Jonan, and I live in Portland, Oregon. I go by the Jonan Show on Twitter, and you can find me there. Technically, I don't live in Portland proper; I actually live in Beaverton, which is a very big deal if you're from Portland. Is anyone here from Portland? Except Jason? Yes, we have a couple of other Portlanders here. If you are not from Portland proper and live outside of the city limits in one of the suburbs, you are very likely to be forcibly tattooed with an owl by the Hipsters Union, so please, nobody tell them that I said that. I am actually from Beaverton, and I apologize.
00:01:13.500 I work for a company called Heroku. They invented the color purple; you may have heard of it. They're quite kind to me. They send me to fancy places like Cincinnati to talk to people and attend conferences, and I love it, so use Heroku and things. If you have questions about Heroku, I'm always happy to talk.
00:01:33.299 My wife and I, about a year ago, bought our first home. We're very proud of it, and one of the first things I wanted to do was automate things. The first thing I did was renovate the foyer. We have a ball pit here. I had to disarm people as they enter; once the drawbridge comes up, you need to calm them a little bit. I started to work on automating my home. I have smart light switches and Alexa, which I can use to trigger things throughout the house. For example, if I say, "Alexa, trigger movie mode," it sends that command to turn off all the lights in the house and start Netflix. I think it's about 10 AM; my wife is probably wondering what's going on.
00:02:00.359 When I started doing this, I wanted to explore home automation. I was writing a little bit of code here and there, but a lot of the stuff is already pre-built. I wanted to get more into it, and I definitely did not want to use one of those draconian interfaces—this terminal—to try and deploy my code to production. The idea is absurd! We live in 2016; this is the future! We can just use our voices. So I started to study speech recognition, and it was a terrible, terrible mistake. Just kidding! Actually, speech recognition is wonderful, but it is intensely complicated.
00:02:43.710 If I were to talk to you for the next 60 days without stopping, we wouldn't cover all of it. So I'm going to glance quickly over some concepts and start at a very low level—actually a very high level, in software terms. I'm going to talk very simply about sound and about voice. When we're talking about speech recognition, we're not talking about voice recognition. Voice recognition is when my voice is my password; it's about verifying someone's voice to unlock something, while we are talking specifically about speech recognition.
00:03:15.930 There are two major categories of speech recognition: isolated word recognition and continuous recognition. Isolated word recognition is this stuff you’ve probably dealt with for a long time. For instance, when you call an airline and say, "I would like to check a bag," and they respond with, "Did you say cancel all your flights for 2016?" That’s isolated word recognition, and it’s not very good but simple to implement—it just matches waveforms. Essentially, you try to get the other human to say the word exactly the same way you said it, and if the waveforms match, you have an isolated word recognition system.
00:03:47.520 Continuous recognition, on the other hand, is the sort of voice recognition that devices like Alexa and many others do today. But before we dive into that, I want to briefly discuss how speech is generated. Remember when you were a child swinging on a swing set, pumping your legs in the middle of that swing? That small movement resulted in a large outcome—you went back and forth, right? Now, imagine you are holding a marker out to the side and drawing on a piece of paper. As I'm dragging this up very quickly, that shape we’re making is a sine wave. It's not a perfect sine wave, but that shape will come out as sort of like that squiggly line we've all seen before.
00:04:41.040 The reason I mention the low amount of energy generated is that it’s the same principle behind our vocal cords. They close when you breathe out, and your exhale forces air through them, creating sound—kind of like when you fart. We produce that sound very quickly. If you want a sound example, saying low tones like "ah" starts to create these audible sounds. Those are the sinusoidal waves we’re talking about. Reviewing briefly for many of you, we’re going to talk about frequency today: frequency refers to how many of those bumps we have and how close together they are. When we listen to music, we refer to it in terms of what the human ear perceives as pitch.
00:05:22.760 We're also going to touch on amplitude, which we refer to as volume when discussing human auditory perception. Now that we have a little vocabulary about sound, let’s explore utterances. An utterance can be a word or a sequence of spoken language that a speech recognition system uses to understand commands. These utterances break down into their phonemes—44 different sounds exist in the English language. This may seem counterintuitive to those without a linguistics background; so, let me explain quickly how some of these sounds arise, specifically how even the same phoneme can sound different depending on context.
00:06:06.110 For example, if I say the word "bad" or "ban," the same 'a' sound appears to us as the same vowel. However, when I pronounced 'bad,' I actually made an 'a' sound, while in 'ban,' it's more of an 'æ' sound. These are called allophones in speech recognition. We are trying to break down the speech to phonemes, and even better, we can break down to diphones and triphones. A diphone might sound different depending on the preceding sound, and triphones take into account how two sounds interact with each other. Most modern systems use triphones to analyze speech.
00:07:22.710 Next, let’s review Fourier transforms. The Fourier transform allows us to represent a square waveform composed of several sine waves, and we can extract those sine waves that make up this square waveform. We then graph these sine waves without using time as a component, allowing us to describe segments of sound based purely on amplitude and frequency. When addressing speech recognition, we rely on short sample sizes, typically a few milliseconds, which allows the Fourier transform to analyze the speech effectively, linking it with pre-existing data.
00:08:03.090 Understanding that a wave of any shape or size can consist of multiple sine waves goes against intuition. Imagine an analogy where you are filling a sphere with golf balls until no more can fit, then using smaller grains of sand. Thus, we can realize any imaginable shape could theoretically be constructed from sine waves. The computation of Fourier transforms at large scales demands significant computing power. This computational technique wasn't truly viable until the late 1960s.
00:09:03.480 In fact, two gentlemen named Cooley and Tukey came up with a fast Fourier transform, dramatically enhancing our ability to process waveforms. The main benefit is that it effectively reduces the complexity and time needed to analyze signals from O(n²) to O(log n). I encourage you to research this further if you're interested, as it is a fascinating area.
00:09:32.100 Let’s talk about Markov chains real quick. This is a familiar concept where we can take English sentences and predict the next word based on the previous one. We assign probabilities to words and create a tree structure that allows us to build up sentences. But note, speech recognition typically utilizes hidden Markov models instead of simple Markov chains. In the hidden Markov model, the current state isn't visible; we deal only with transitions. This model is more representative of speech recognition systems.
00:10:30.050 The trees we are building aren't just about predicting weather models—they pertain to the phonemes we've discussed: triphones, diphones, and more, which guide us in identifying the sounds and matching them with words based on a fast Fourier transform. We can also define grammars to direct the recognition systems on what to expect, similar to how isolated word systems function. Newer systems, like Alexa, use input grammars that significantly inform and correct the direction of the recognition process.
00:11:39.780 Let’s move forward to creating something. I'll be using Alexa, which I previously used to trigger the lights to go off in my house, much to my wife's surprise. We're also utilizing Cubot—specifically, the deployment plug-in written by my colleague Atmos, a brilliant guy from Heroku. His work allows commands to control Cubot, handle deployments, and other functions with voice commands. I can say: "Cubot, deploy Steve to staging," and it will do that.
00:12:31.560 This system architecture I’m describing involves Amazon Alexa, the integration of the IFTTT platform with Slack, and GitHub webhook configurations. When a command operates, it relays a message through the defined channels to complete a deployment event and notify relevant end users through Slack, making for a seemingly seamless experience.
00:13:21.080 Hopefully, everything will go according to plan as we transition to a live demonstration. I’ll start off by showing the Slack channel, and if I tell Alexa, "Trigger deploy Steve to production," assuming it all connects correctly, we should see the intended deployment with logs reflecting the action in real-time. If all goes well, we will confirm that Steve is indeed deploying to production.
00:14:35.580 Moving back to the screens that report the deployment logs, we should see the details for Steve's deployment. It's worth noting that I debugged the network communication aspects before this demonstration to confirm everything would behave responsively. If you’re curious, I invite conversation about this process and any hurdles I faced along the way.
00:16:01.250 Thank you for your attention; I appreciate it while I managed to showcase some robots today. The idea is to allow for conversations with these systems. For example, if we connected another platform, we could engage these systems as interactive conversational interfaces, expanding their integrations well beyond deployment monitoring.
00:17:04.550 Finally, I want to introduce Marvin, who is another robot but one that works based on CMU Sphinx speech recognition. Marvin provides a new perspective on speech integration, adapting through keyword spotting for commands. Where Alexa runs all cloud-based, Marvin operates locally, providing us the chance to experiment with speech recognition and voice commands in an entirely different way.
00:18:37.100 Let’s explore how Marvin works. When asking Marvin to listen for a keyword, such as 'Marvin,' it will react accordingly. After recognizing the keyword, Marvin prepares to listen for commands to execute. The key here includes recognizing phrases to initiate specific actions or deploy commands. The flexibility of platforms such as Marvin means we can integrate various methods of recognition, expanding our toolsets.
00:19:43.500 In any event, deploying effectively within this systems architecture relies on establishing responsive communication lines via both cloud and edge processing. By utilizing instances of background jobs and WebSocket communication, for example, we instill fluid interactions while monitoring responses effectively. We can even implement speech synthesis directly within these integrated systems.
00:20:57.230 In summary, we explored how voice technology can support deployment processes by interacting with various systems via spoken commands. Analyzing concepts behind speech recognition, coupled with practical integrations such as Alexa or Marvin, showcases innovative solutions in today's development landscape. I encourage you all to think about how these technologies might change not just how we communicate, but how we interact and work in our environments.
00:22:38.970 If anyone has further questions about automating these systems, deploying with voice, or simply exploring where to go next with these technologies, feel free to reach out. I'm here and would love to discuss possibilities! Thank you all again for your time and attention.