The biggest gap to bridge between cold and awkward automated calls and a holistic take-over by voice to voice agents currently is emotional intelligence; if AI can react to my emotional cues in real-time and anchor a conversation accordingly, then it has a foot in the door. So, it’s worth at least getting a bird’s-eye view on what are the methods and frameworks which enable these agents to decode emotions purely from a user’s voice. As outlined in our voice-models blog, deployments possible today follow either a text-based or a textless approach to conversation and hence to emotion-detection.
Text-Based Approach
This class of voice models first converts audio to text, processes that text using NLP methods, and then synthesizes natural-sounding speech.
All the input the model gets in this case is textual, meaning we need to document emotional information extractable from the ongoing interaction. This entails first identifying what information needs to be documented. This often includes pitch variations, volume changes, pauses, and other spectral characteristics which indicate stress of relaxation.
Once all information is converted into text, it is analyzed for more types of cues. The lowest hanging-fruit is sentiment-keywords: if a customer mentions being ‘frustrated’, ‘confused’, ‘delighted’ etc. Then, sentence structures come into-play. Questions might indicate confusion, while short and clipped sentences could signal irritation. The spectral features extracted earlier are also mapped to several emotions using the learned patterns from model-training.
All of this information is then mapped onto an emotional landscape. The framework used for this varies, but the essential idea is using a minimal set of basic emotional traits to express every possible emotion. Commonly, this can look like representing emotions as vectors in a three-dimensional space of i) Valence (is the emotion positive or negative), ii) Arousal (how intense is the emotion), and iii) Dominance (power-dynamics of the emotion), shortened to VAD. Alternatively, complex emotional states can be expressed as combinations of the simple states found in Plutchik's Wheel of Emotions, like admiration, anger, amazement etc.
Impressively, this entire process of extraction, conversion, and analysis happens in real-time and continuously throughout an interaction. Achieving a conversational latency is mostly a feat of the model’s architecture and how well it can leverage available hardware. There is a lot to be said about both of these exploits; if you’re interested, check out our three-part blog on GPUs and how they help with latency here.
Textless Approach
Some recent voice-to-voice models, true to their name, skip the intermediate conversion to text entirely. Built upon transformer based architectures- the same tech which powers LLMs- these textless NLP models learn to map the acoustic features of speech to meaning directly.
To mimic emotional intelligence, they analyze prosody (the rhythm and cadence of speech), speech rate, energy, and more in addition to the features analyzed in the text-based approach.
The model is pre-trained on an annotated speech dataset. This just means exposing it to and making it learn patterns from a ton of data which maps particular variations in sound to interpretable emotional information; a crude example could be a fall in pitch being mapped to disappointment.
These have the upper hand of not losing out information which couldn’t be documented into text. All the nuances which can be heard in the customer’s voice are used to the maximum extent. Naturally, the overhead in time decreases because we skip 2 conversions (to and from text).
However, it can be beneficial to complement the input data with some text (transcriptions, chat logs, case-summaries) to increase available context. This can also help the AI agent navigate situations where how a customer sounds doesn’t accurately reflect their emotional state (sounding calm while expressing frustration via words). Therefore, the context-cost tradeoff is a deciding factor when building tailored solutions.
The scope of this technology is naturally quite vast. Any voice AI solution can benefit from being emotionally intelligent, be it healthcare where agents can make call-routing decisions more effectively upon detecting a distressed patient or agents with tutoring capabilities- like GPT’s 4o- changing their approach when the student sounds confused.
With this missing piece, businesses can also increase reliance upon agentic solutions for customer-service. Efficient rerouting of calls, matching tone and language to personalize interactions, adjusting voice to de-escalate tense situations- this is the future we are building at Nurix. Write to hello@nurix.ai to assist your customers with agents that can read the room!