Artificial Intelligence

Voice-to-Voice AI: The how and why of automating Customer Support

Imagine calling a customer service line and speaking with an AI agent that sounds indistinguishable from a human representative. This agent can understand your queries, regardless of your accent or the complexity of your issue, and respond with natural-sounding speech that adapts to your tone and emotional state. That's the promise of voice-to-voice AI!

Voice-to-voice models are AI systems designed to process spoken language input and generate human-like speech output in real-time. These models can understand context, emotion, and nuance in spoken language, and respond with appropriate intonation, pacing, and accents!

Siri and Alexa have been around for years; why is this conversation in the light again? The key behind the paradigm shift in voice-to-voice models is their leveraging of large language models to be able to perform tasks much more complex and context dependent than setting reminders or playing music.

The mechanics behind this magic are broadly as follows-

  1. Speech Recognition: Speech-To-Text (STT) algorithms first convert the spoken input into text, enriching it with metadata about pitch variations, volume changes, pauses and hesitations, and other possible spectral characteristics that indicate emotional states.
  2. Natural Language Processing (NLP): The model analyzes the converted text to understand the context, intent, and sentiment of the input audio.
  3. Generating Responses: Based on conversation history and predefined goals, a most appropriate response is estimated and passed on.
  4. Text-To-Speech (TTS): This response is then synthesized into speech with special attention to prosody, intonation, and emotional cues, allowing the system to ‘speak’ the response back to the user.

Most commonly deployed voice-to-voice models- including ChatGPT-4o- work largely via in this framework. However, a new advancement is picking up pace in how voice-to-voice operates: textless NLP.

In this approach, models process speech signals directly, without the intermediate step of text conversion. The model learns to map the acoustic features of speech to meaning and generate responses- which are often more natural-sounding since the system works with the full richness of the speech signal throughout the process.

Till date, the textless NLP models have only seen research-first implementations, for instance Meta’s Wav2Vec and Wav2Vec 2.0 and Google’s Translatotron. This is because while impressive, deploying these at the scale and speed required for commercial use is still nearly infeasible because of the computational costs involved. Moreover, unlike chat-agents, these cannot simply augment the workflows of employees without complete takeovers since they are more end-to-end in nature.

This calls for some degree of component-wise monitoring and control, which is exactly what the STT-TTS voice-to-voice models allow. Text as an intermediate step allows for easier debugging and modification. For instance, it’s much easier to implement profanity filters on text-data and hardcode guardrails than on audio.

This is why these voice-to-voice models are being increasingly deployed as AI voice agents. Often branded as ‘voice-bots’- much like their textual counterparts- they find good use-cases in customer service. Be it providing step-by-step solutions in technical support, answering FAQ customer enquiries about company policy and offerings, or routing service-activation requests like managing subscriptions or appointments, voice agents can automate these calls and free up company resources. When personalized and chosen aptly, voice-to-voice AI solutions can also go beyond these somewhat mundane tasks.

Imagine a voice AI agent automatically identifying a potential lead and initiating a call. As the conversation begins, the agent’s natural tone immediately puts the customer at ease. It then inquires about the customer's needs, identifies suitable insurance plans, and pitches them effectively. Integrated with existing pipelines, the AI seamlessly sends the customer the desired details over WhatsApp. Noticing a hint of overwhelm in the customer's tone, the agent wisely concludes the conversation, ensuring a follow-up plan is communicated clearly.

Although it sounds fantastical, this is an achievable state for voice AI solutions today and actually describes one of our demo showcases from Nurix. These can be maximally efficient, working around the clock and ensuring timely assistance and follow-ups. Integration with existing CRMs and communication channels- like WhatsApp and Email- allow for information to be conveyed accurately and promptly. By recognizing emotional cues and adjusting their approach on-the-go, they can mimic the performance of employee agents- a key feature that even intelligent chatbots lack to provide in customer support.

However, being sophisticated end-to-end solutions, they do come with constraints. To start with, they can falter when faced with conversations quite different from their training data. Without some continuous updating, they may struggle to grasp context-dependent nuances or cultural references that humans easily understand. They need to be tested rigorously for performance as customer-facing services. As with all generative AI solutions, they need to be guard-railed for sensitive data-handling, language choice, and hallucinations.

Nevertheless, continuous improvements are expected over time. Researchers are focusing on enhancing contextual understanding, expanding cultural knowledge bases, and developing more sophisticated emotion recognition capabilities. Advances in few-shot and zero-shot learning may help these systems adapt more quickly to unfamiliar scenarios. Ongoing work in responsible AI is addressing concerns around data privacy, bias mitigation, and ethical deployment. While the current limitations are real, they are not insurmountable obstacles but rather opportunities for innovation and refinement.

With 42% customers still preferring calling (over texting or emailing) for support, the market for voice AI agents is alive and well; the gap to bridge is POC to production. A good number of the constraints can be worked around simply by carefully selecting the right models tailored to your specific use case. Contact us at hell@nurix.ai to learn more!

Written by
Anurav Singh
Created On
03 August, 2024

Start your AI journey
with Nurix today

Contact Us