Ruby
Giving My Wife a Voice with Ruby and AI Cloning (+ Discussion on The AI and Ruby Landscape)
Summarized using AI

Giving My Wife a Voice with Ruby and AI Cloning (+ Discussion on The AI and Ruby Landscape)

by Kane Hooper

Giving My Wife a Voice with Ruby and AI Cloning highlights Kane Hooper's journey in developing an AI voice cloning application to help his wife, Peggy, regain her ability to communicate after losing 80% of her speech capacity due to a neurological condition. The talk takes place at RubyConf AU 2024 and tackles the intersection of AI technology and Ruby programming.

Key Points Discussed:

- Personal Story: Kane opens with personal anecdotes about his wife, a trained classical singer who was diagnosed with a voice-affecting condition. He shares emotional moments including a song recorded by her before she lost her voice.

- Technology Overview: The presentation progresses to discuss advancements in AI voice cloning technology. Kane utilizes 11 Labs' service, showcasing differences between AI-generated voices and genuine recordings. He emphasizes the availability of quality recordings for more accurate cloning results.

- Challenges and Developments: Kane shares ongoing development for an application that enables his wife to communicate by phone, particularly with call centers, demonstrating the potential of AI in real-life situations. He conducts a live demo that illustrates the voice cloning in action.

- Ethical Concerns of AI: The dark side of AI voice cloning is addressed, including risks associated with deep fakes and identity fraud. Kane recounts real-life examples where AI cloning caused significant losses to companies, stressing the importance of vigilance and validation from multiple sources.

- Technical Architecture: Kane elaborates on the technical details of his application, utilizing Ruby for backend development, integrating Twilio for calls and 11 Labs for voice processing. He simplifies the process to just an API call, making the technology accessible to Ruby developers.

- Open-Source and Future Prospects: The conversation expands to the broader AI landscape and the emergence of open-source tools, encouraging Ruby developers to embrace AI technologies for innovation and contribution.

- Prompting and Knowledge Retrieval in AI: Kane explains the nuances of prompting AI systems and how effective instructions yield precise responses. He discusses the potential of AI in context-based knowledge retrieval, enhancing user experiences in various applications.

- Conclusion: Kane concludes with the message that AI is within reach for Ruby developers, inviting collaboration to harness its capabilities for societal benefits. He offers consultations for developers looking to integrate AI into their projects, aiming to inspire a community-focused approach to innovation.

Overall, the presentation emphasizes a blend of personal narrative, technical insight, and ethical considerations in the rapidly evolving field of AI and its practical applications in programming with Ruby.

00:00:03.480 Good day, everyone! It's awesome to be here. Just a bit of housekeeping before we start: I do have a bit of a tremor, but that's not because I'm nervous; I just have a slight neurological condition. I'm actually really excited! I've been working on this project for the whole week, building demos every day. I even came up with a new idea for an AI demo, but I had to stop because it was getting out of hand.
00:00:14.280 Before I dive into today's topic, which begins with my wife and the situation we’ve had, I really want to talk about the AI ecosystem in relation to Ruby. There are things we can do now that we couldn't do a year ago, and I want to show that it’s actually pretty easy.
00:00:38.760 Before we get into that, we had our game out back that some of you participated in. You got your ducks and some received gift cards. We had some gift cards left, which I wanted to present to those who deserved it most over the last two days. I would like to present those gift cards to the Ruby Australia team.
00:00:56.440 As they come up, I’d like to play a quick song. It's the last song recorded by my wife before she lost her voice. It’s a rough recording, but it's very special to me. While they come up, I would like Errol to present them with the gift cards. Thank you all; it’s been an amazing two days, and you all did a fantastic job!
00:02:47.720 So, I thought, 'I need to do something about this,' which led me to the technology I’ll discuss today. I've always been interested in AI, but this situation pushed me down the rabbit hole to find solutions. I want to play that last bit again because it truly captures her talent.
00:03:41.659 I found a one-minute interview she did many years ago. Other than that and some outdated voice messages which were low quality, that was all I had. I'd like to play a segment of the interview so you can hear her voice. "Hi, I’m Peggy, and I’m a mom of three girls. I look after my daughters full time. Most of the time, I'll be helping them do their homework or they'll help me bake; they love baking cookies, cupcakes—all sorts of things."
00:04:07.600 I took that one-minute clip and used a service called 11 Labs, which is the leader in voice cloning technology. Let’s take a look at how it compares: "Hello Ruby Australia! I hope you are all having an amazing time this week. I'd like to introduce to you my husband, Kane. He has been working very hard on artificial intelligence for the past year. I hope you enjoy his talk today."
00:05:08.560 There's a slight British tang in the AI's voice. I’ve noticed that the AI models tend to take the Australian accent and British-ize it a little bit. If I had more time with audio, I'd be able to refine it further. Luckily, I also have spoken with the 11 Labs crew, and they have the original recordings with about an hour's worth of audio.
00:05:35.080 With this professional cloning service, we can train the AI extensively, and I believe that the results from an hour's worth of quality recordings will be highly impressive.
00:06:00.880 My philosophy through all of this has been: "When life gives you lemons, make an app." The next piece of AI I’m working on is just a phone call away because one of the biggest challenges my wife faces is communicating with call centers.
00:06:27.600 I’m currently developing something for that purpose, and it's still in progress. Caitlyn, I think you have a phone call coming in now. I'm breaking the cardinal rule of live presentations by doing a live demo.
00:06:43.760 If you can come up here, I'll run my script and show you something special. "Hello, CN! It's Matt calling all the way from Japan. Hello to all my Ruby friends from Australia! I hope you have been having an amazing time the last two days. Caitlyn, I heard you and Toby did a great job MC-ing the event, and the audience should give you a big round of applause!"
00:07:00.560 So, Caitlyn, there’s a gift card for you, and Toby already got one. Otherwise, I would have given you one as well.
00:07:44.360 Now, we do need to discuss the dark side of AI. Think about how easy it is to clone someone’s voice; it brings up a lot of questions about where this technology can lead us. AI is a double-edged sword, particularly with deep fakes, voice cloning, and the new lip-sync AI emerging. In six months, we may not be able to tell the difference between a real person online and a fake.
00:08:16.880 This is why it's important to remain vigilant and validate information from multiple sources, not just what we see in a single video. To illustrate this, I also cloned my own voice and created a message for my daughter and sent it to my accounts team.
00:08:57.360 Here’s an example of the message intended for my daughter: "Hi Emily, there has been an emergency. Mom is in hospital, and I can't pick you up from school. My friend Tom will pick you up, so please go to the IG near the school. He'll get you in his white van. Something has happened to Mom."
00:09:22.120 My daughter said she would have known the difference had she heard that as a voicemail. I tried another message for my accounts team, telling them about the urgency of a bill payment. If it works out, I’ll see how they respond to that.
00:09:54.160 This technology has real-world implications. A company in Hong Kong recently lost $10 million because an accounts person received a deep fake call from someone claiming to be the CFO. They were convinced by the AI-generated voice that they needed to transfer funds immediately.
00:10:36.560 This highlights the need for companies to innovate quickly, as technology is outpacing traditional methods. Regarding the architecture of what I built, I utilized Ruby for the backend, along with 11 Labs for voice cloning and Twilio for call transfers. It’s quite simple.
00:11:01.040 With just an API call to 11 Labs and some text input, it quickly returns audio as a binary file that is then processed through Twilio for the phone call. One hurdle I'm still trying to solve is real-time voice input through the web.
00:11:26.560 I’d like to achieve something like a Google Meet, where my wife can type and it translates things in real-time but getting the audio through the browser remains a challenge.
00:12:06.320 Now, I want to show you the code behind this technology, not necessarily for you to understand the specifics, but to prove that voice cloning is accessible. Once you’ve trained the model, you just make an API call and send the necessary data.
00:12:57.720 This ASP.NET code is straightforward, demonstrating how AI is not just a phone call away but simply an API call away. Previously, tasks requiring machine learning experts would take extensive training and effort, but now Ruby developers can accomplish this with ease.
00:13:37.440 One area I started exploring was the Vision API, known for its complexities. Creating a laundromat quality control demo took me only an hour. The concept involves taking photos before and after cleaning garments to leverage AI for quality checks.
00:14:51.720 In my demo, I show the AI running a script to determine if a stain is gone or still present, along with a quality score for the cleaning. I didn’t even train the AI model, and I managed to get good results in under an hour!
00:15:43.360 The architecture of the AI system for this demo involves a Ruby backend sending an image URL and prompt to the Vision API, which then returns a response. This process is straightforward and doesn't require extensive machine learning knowledge.
00:16:44.680 Now, I want to outline some basics about neural networks. While I won't go too deep, understanding how these models work is essential when discussing current AI technologies. My journey into AI started when I was around 15, working with a clunky chatbot that struggled to engage in meaningful conversation.
00:17:33.800 We also created a fintech model years ago that required thousands of lines of code. Today, you can accomplish the same results in about 20 lines. We've also been collaborating with the New South Wales State Library to translate oral histories using AI, achieving around 80% accuracy through community corrections.
00:18:31.720 Regarding AI models, you might hear about parameters—GPT-3 has around 500 billion parameters, while GPT-4 is rumored to reach up to 1.2 trillion. The more nodes you have with various weights and biases, the more complex patterns the model can determine.
00:19:15.120 Some models can classify images with just a few lines of code, while complex tasks require deep learning models. The power of AI lies within these models' abilities to analyze large data sets for patterns.
00:19:47.960 As AI continues to develop, it raises concerns, such as the uniformity of AI-generated content. AI often finds patterns among previous data, leading to a loss of unique voices in things like blogging. The rise of chatbots has made all AI-generated blogs sound similar, which is disheartening.
00:20:49.520 So, what I want to cover in the remaining time is how to get started with AI: prompting, knowledge retrieval, and utilizing open-source options. Prompting is more of an art and requires trial and error. Precise instructions yield precise responses while vague instructions lead to vague outputs.
00:21:40.720 When working on production-grade applications, take time to fine-tune your prompts—change words, try different phrasings, and optimize until you get the desired output.
00:22:44.120 Json schema allows you to define how the AI should respond using structured guidelines. This is particularly useful when you need precise data. Prompting has evolved; as AI models now handle vast context windows, we can explore intricate details.
00:23:42.800 In terms of knowledge retrieval, AI systems can pull data from your knowledge base. For instance, querying API documentation or customer info becomes seamless with AI assistance. This includes applications like Zendesk where you can quickly access specific tickets or past client interactions.
00:24:44.440 Overall, AI knows the meaning behind data, leading to more intuitive operations. Knowledge retrieval systems enhance traditional keyword searches by leveraging semantic understanding—this allows for better responses based on user questions.
00:25:39.760 Today, I quickly set up a reactive AI system leveraging our blog content—asking about running AI models returns accurate sources. Similar systems are emerging in places like AWS, where you can query endpoints directly to save time.
00:26:53.760 Vector embedding is vital in searching via meaning and context. Concepts in machine learning create a spatial relationship, allowing AI to determine similar meanings based on varying phrases. For example, it's able to realize the connection between words like 'cat,' 'kitten,' and 'dog'—useful for determining user intent and retrieving relevant documents.
00:28:15.880 When analyzing large datasets, we can define similarities and retrieve specific information quickly. AI is just a prompt and knowledge retrieval system away—efficiently querying databases based on contextual understanding.
00:29:36.640 As I conclude, I stress again that AI is just a prompt, knowledge retrieval, fine-tuning, and API call away. We're exploring exciting territory; that could pave the way for more tailored AI experiences catering to users on a personal level.
00:30:19.480 The final point I'd like to touch on is the importance of open-source AI models. Tools like Llama allow anyone with proprietary data to explore AI locally without needing to compromise sensitive information. Many of these models are progressing rapidly, leading me to believe open-source will dominate in the near future.
00:31:01.680 Platforms like AWS Bedrock streamline access to AI models, enabling users to switch seamlessly between different services with minimal effort. This process encourages exploration and integration of AI into various applications.
00:31:56.960 Now for the demonstration, I developed a mini call center in just a few hours. The goal is to improve AI assistance—Toby will be calling in to select an issue, demonstrating how AI can route him to the correct agent.
00:33:31.600 [Failed live demo recording] Welcome to Telra. Before we proceed, can I check your address? Thanks! How can I assist you today? I have a mobile issue. Perfect! I will connect you with Beverly, our mobile team specialist.
00:36:03.680 [Demo successful] The key takeaway is how we could revolutionize call centers with AI, providing targeted information to the right expert at the right time and freeing them up to focus on urgent human interactions where they'll have the highest impact.
00:36:49.760 To conclude, AI is waiting for Ruby to embrace it as extensively as Python has. We have an opportunity here within our community to innovate and contribute positively to society, enhancing user experiences and personal interactions.
00:37:08.320 If anyone is interested, I'm offering consultations over the next couple of weeks for those who want to integrate AI into their projects—this won’t be a sales pitch; I’m simply here to help.
00:37:40.880 Thank you all for this brilliant opportunity, and it has been an amazing experience interacting with such a fantastic community!
Explore all talks recorded at RubyConf AU 2024
+14