Now Hear This! Putting Real-Time Voice, Video and Text into Rails

00:00:12.400 So clearly, you're in the session now about putting voice, video, and text into Rails. A quick introduction: My name is Ben Klang, and I'm actually very proud to be from the city of Atlanta. Welcome! I hope you all have enjoyed Atlanta so far.

00:00:24.000 You may also know me through some of my open-source contributions. Just a quick show of hands: has anyone heard of the Gear Awesome? Has anyone used it? A couple, alright, cool.

00:00:36.960 I'm not going to talk about Gear Awesome today, but I do want to just quickly mention it because it bears relevance to the talk. Gear Awesome is an open-source framework for voice applications. You can think of it as Rails, but for voice and real-time communications.

00:00:50.559 I'm also the founder of a company called Mojo, based here in Atlanta. This is what we do; we work with voiceover applications. We build them and scale them through usability, but this is a topic close to my heart: communications applications in particular.

00:01:02.160 Alright, today I want to tell you why the web is a lot like outer space because on the web, no one can hear you scream. So let me paint a scenario: back at work, you're working with your app, and all of a sudden something happens. You realize you need to speak with one of your customers.

00:01:16.080 Now, what most of you are going to do is pick up a telephone. The main problem with this is that when you pick up that phone, any communication that you have is now outside of your business process. It's not noted within the business application; it's not recorded.

00:01:31.519 The fact that you recall happened is often not reflected in the state of whatever you're doing for your customer. Additionally, the communication itself is fairly limited; you've got this narrowband audio signal to talk through.

00:01:42.399 You can't easily share pictures or links; you really don't have a very rich communication experience. Wouldn't it be cool if we could, instead of having that phone call happen outside of the app, put the communication right into the application itself?

00:02:00.240 So that brings us to something called WebRTC. Show of hands: has anyone heard of WebRTC? Cool, that number goes up every time I ask, which is absolutely a happy thing for me to see. Has anyone actually tried it? Okay, well hopefully, by the end of this talk, I'll have some resources for you that will inspire you and give you information on how you can try it.

00:02:31.519 For those who are familiar, WebRTC is fundamentally about audio, video, and data communication in the browser. It allows you to use the camera, microphone, and speaker without any plugins. This means that if you want to build a real-time communications app, you don't actually need Flash or Java.

00:02:48.560 All the bad things that come with having plugins, such as crashes, are eliminated because it's built right into the browser. WebRTC additionally has functionality built in to establish peer-to-peer connectivity between two or more parties, which is really interesting.

00:03:02.640 However, connectivity across the internet can be tricky with NAT firewalls and other issues. WebRTC has functionality built in to help traverse these connection environments. It provides a common set of codecs for exchanging high-definition media.

00:03:20.800 Opus, G711, H.264, and VP8 are codecs that make very high-quality audio and video possible. Opus, in particular, is an amazing product that comes from a lot of research, including significant contributions from Skype, which many of you have likely used.

00:03:38.560 Opus is designed to transmit voice efficiently, using the minimum amount of bandwidth while preserving quality. It can also scale up to transmit music, so it offers very high-quality audio. Opus is royalty-free.

00:03:53.040 H.264 and VP8 are competing standards for transmitting video. H.264 does have licensing fees, but Cisco has paid for licenses so that open-source software like Firefox and Chrome will support it. VP8, on the other hand, is fully open, as Google acquired a company and released its IP.

00:04:06.000 These built-in capabilities provide very high-quality audio and video experiences in the browser. There are several more protocols in the WebRTC standard, which are used for exchanging information about the session.

00:04:22.559 For example, SDP (Session Description Protocol) is the mechanism by which the two endpoints exchange information. ICE (Interactive Connectivity Establishment) helps traverse firewalls, and DTLS/SRTP provides encryption by default.

00:04:39.280 So, what is WebRTC? A lot of people get really excited about the idea of putting a telephone into a web browser. Please, if you take one thing away from today, do not take that away. It isn’t just about putting a telephone in a web browser because we can do so much more.

00:05:02.479 The web is a rich palette of user interface and possibilities, so think of it as communications.

00:05:14.320 A quick note on the relevance of WebRTC: I have a chart comparing the growth of WebRTC adoption. Dean Mobley put together this chart projecting the top keywords related to WebRTC. The gray line at the bottom represents browsers, and we're approaching a point where about a billion global desktop browser-based devices support WebRTC.

00:05:40.640 The interesting part is the growth of tablets and smartphones, because these communications options won't just be in the browser — they will also be embedded in devices, whether that's mobile web or native apps, and there will soon be a lot of WebRTC devices out there.

00:06:01.440 Now, before we dive further into WebRTC, I want to provide a quick background on how communications are facilitated today. Most of you pick up a phone, use services like AT&T, which manage the call by sending signals to connect people.

00:06:21.600 This is called a 'trapping design' and it relies on every subscriber having an established relationship with carriers. The advantage is that everybody can call everyone, general as long as all of the carriers are federated.

00:06:36.480 However, this system has significant drawbacks. The overhead of having to coordinate companies around the world often hinders innovation.

00:06:52.320 It's not particularly user-friendly, either. If you think about identity in the form of a cell phone, your identity is basically just a string of random digits assigned to you by your carrier.

00:07:09.440 The next kind of architecture is more like a triangle and Skype is a good example of this model. You have one centralized service and endpoints connect to it.

00:07:20.560 This arrangement allows for faster innovation as they control both the network and endpoints, leading to features like video calls and friendly usernames.

00:07:35.760 However, a significant drawback of this model is its walled garden nature. You can't easily build an app that integrates with Skype. It is not built into your business process.

00:07:49.360 With WebRTC, we actually get something that resembles more of an open architecture. The signaling goes back to the website without needing any plugins. You just go to the web application and it provides the tools you need.

00:08:06.400 The signaling and the media are separated. So when a call wants to be set up, Alice sends a request containing her information to the web service, and that service shares that information back to Bob.

00:08:21.200 If there's a firewall present, the media is exchanged behind the firewall, which has interesting implications for performance and quality, especially in low-bandwidth situations.

00:08:37.279 Some of you may have experienced instances where you've had to communicate effectively despite a poor connection. An example would be working off an island with poor connectivity but being able to communicate because media exchange happens locally.

00:08:56.559 Let’s take a look further into how a WebRTC session is set up. We start with Alice using Firefox; she sends a request to initiate communication with Bob.

00:09:09.200 This request contains something called an SDP, or Session Description Protocol. For practical purposes, think of it as an opaque blob of text that contains details like her contact information and a list of supported codecs.

00:09:22.560 Upon receiving that offer, Bob generates his own response with largely the same information and passes it back via the web service to Alice.

00:09:38.000 At this point, a series of packets start flying between Alice and Bob using ICE, STUN, and TURN protocols. ICE enumerates all the network interfaces available on a device.

00:09:55.639 It helps establish how Alice can be reached by Bob on the local network. If they can't make a direct connection due to firewalls, STUN is utilized, and in the worst-case scenario, TURN servers act as relay points.

00:10:09.680 The media conveyed between parties will be encrypted due to the exchanged private keys during the signaling process, ensuring that communication remains confidential, even when using TURN servers.

00:10:29.920 WebRTC includes strong security measures, making it difficult for outside entities to intercept conversations. When properly implemented, the actual media exchanged remains entirely private.

00:10:46.800 Using web servers for signaling is common, but it doesn’t have to be restricted to that. In fact, you could exploit services like Redis or even text files to pass SDP around for establishing a communication session.

00:11:05.360 That's enough about plumbing; what truly excites me about applications is how we utilize it. Over the past couple of years, I’ve pondered what it takes to build applications like this.

00:11:14.640 I came up with a few fundamental principles that you should consider when designing communications applications. A modern voice application should be adaptable, meaning it takes advantage of various device capabilities.

00:11:31.040 It should be fluid; this suggests it can transition across devices, time, and users while preserving the context of the conversation. Furthermore, it should be contextual, which is crucial as it enhances the communication experience.

00:11:49.000 Trustworthiness is also integral because the worst thing is to communicate something sensitive only to see it revealed to unintended recipients. Lastly, the application should be referenceable.

00:12:08.720 So what does it mean to be adaptable? For instance, if Alice is using Firefox, she has various input options available: a keyboard, a camera, and speakers for a rich communication interface.

00:12:26.400 If she’s communicating with someone on an iPad with similar input options, the application should seamlessly enable video, audio, and text conversation, accommodating all types of device capabilities.

00:12:43.679 Consider another participant using a smartphone, who may not have a camera or enough bandwidth. They can still join the conversation through text or by voice, allowing inclusive participation regardless of device limitations.

00:13:02.720 This adaptability allows the user experience to enhance or degrade gracefully, depending on the capabilities presented by the users.

00:13:21.280 Next, let’s discuss fluidity. Conversations often begin today with chat; when you want to reach someone, more often than not, you’ll start with a text message.

00:13:39.120 However, when chat becomes cumbersome, you should be able to switch to audio, allowing for smoother communication. For example, if a discussion needs to involve more participants, video can be incorporated seamlessly.

00:13:56.640 Once you're done with the video, you should easily revert back to chat without losing the overall context of the conversation.

00:14:10.960 Shifting devices during a conversation is also crucial. If you need to leave your desk, you should transition the call effortlessly to a mobile device so the conversation continues without interruptions.

00:14:28.720 Being contextual is my favorite aspect. A friend of mine, Jeff Pondworth, says in the future, communicating isn’t what you’ll be doing, it’s what you're doing while doing something else.

00:14:47.360 This highlights that our dedicated communications devices are becoming outdated since our phones, for instance, aren’t primarily communication tools but integrate many other functions.

00:15:06.640 Contextual communication involves integrating relevant information into conversations. For example, if you have a contact queue to check, knowing how many callers are waiting might be useful.

00:15:22.240 Also, features like easily adding your manager into a call or gaining insights based on current discussions improve the experience immensely by providing relevant context.

00:15:37.680 These applications should facilitate not only direct conversation participants but also other third-party services that add context to the discussion.

00:15:55.360 Additionally, making conversations referenceable is vital. Every conversation should ideally have a unique URL for accessing past discussions in their context.

00:16:09.840 For example, if a call is scheduled, there should be a link to that call, and upon completion, users should receive transcripts, enabling easy revisiting and reference to all shared information.

00:16:24.640 Now let me share some practical applications of these principles. One idea is a live anonymous matchmaking service, similar to Tinder but featuring video.

00:16:42.080 Imagine individuals in a video session who could see some information matching them but also retain their anonymity through playful filtering or concealment tools. This allows for safe introductions without needing to exchange phone numbers.

00:17:01.200 This setup eliminates friction and allows seamless joining without having to download apps or install plugins, allowing immediate communication capabilities.

00:17:22.080 Another example would be an instant response app, which would enable teams to discuss service outages effectively. We can enable communication through chat for participants to coordinate responses.

00:17:39.360 Third-party monitoring services would also provide real-time data into this communication thread, allowing situational awareness of the overall incident.

00:17:58.720 Moreover, the tool would also allow vendors to join the conversation through unique URLs, bypassing the need for user accounts and facilitating immediate engagement.

00:18:16.560 Lastly, a third example is for medical records and patient services. Imagine a simple-looking website allowing patients to review advice from their medical professionals.

00:18:35.680 Patients need not worry about tracking phone numbers or logging into accounts. They can connect with the doctors who already have access to their information.

00:18:54.560 Secure authentication methods can facilitate these connections, making the process secure and efficient. By logging into the website, patients transition seamlessly into voice conversations with their healthcare providers.

00:19:12.160 Once the discussion is complete, recordings and transcripts could be automatically stored in their records for future reference, leading to improved service quality.

00:19:31.040 I hope I haven't put you to sleep! Now, I’d like to show you a demo of this WebRTC concept to highlight how effective these communications can be.

00:19:52.400 Here we have a live demonstration where WebRTC requests permission from the camera. It's important to observe that sites are required to use HTTPS to gain access automatically.

00:20:06.560 If there isn’t HTTPS in place, users might be forced to approve permissions each time, affecting user experience. Running a secure environment makes this process smoother.

00:20:21.840 We can see the video feed from one browser being transmitted to another, demonstrating the core capabilities of WebRTC.

00:20:37.200 The audio transmission utilizes another JavaScript API which is exclusive to Chrome. This demonstrates the versatility of the technology.

00:20:54.000 By melding various capabilities, we can command devices using voice, connecting multiple components in an interesting way.

00:21:13.680 So, that’s my presentation on WebRTC. I want to share some resources with you for further learning.

00:21:32.280 The first link is an official set of samples from WebRTC.org, which serves as an excellent resource for creating demos.

00:21:51.760 The WebRTC.org site centralizes all resources related to WebRTC. There's also an initiative called the WebRTC Challenge aimed at getting 1 million developers using WebRTC by 2020.

00:22:12.160 Finally, if you're into Ruby, check out 'Hearing the Rails Light,' a framework for voice, and 'Ruby Speech' for speech recognition solutions.

00:22:31.640 You can reach me as Ben Klang on Twitter and GitHub. If you have any questions, I would love to answer them!