Voice: Strategies for Hard to Understand Phrases

This article describes strategies to improve the speech-to-text recognition of information required for user verification and other processes. 

 

 

 

cognigy-vg-logo

audiocodes-logo_v2.png

 

In voice projects, it might be necessary to recognize statements that aren't part of the standard vernacular. In other words, phrases and statements are not recognized by the native STT systems. A few examples are:

  • Uncommon names or names which are spelled with an accent
  • Addresses, especially street names
  • Product names
  • Uncommon spelling in general

Often these need to be recognized as part of a user verification process. For example, the user must give their name and address to confirm their identity. 

Anyone who has tried setting this up knows that voice specifically can cause many problems in this regard, which is why we have developed different strategies to compensate for these difficulties. These implement both functionalities already available in the STT systems as well as additional strategies using Cognigy-specific functionality. 

Speech to Text Provider Information used in this article

The functionality described on this page uses Microsoft Azure as an example. However, similar strategies should be possible with Google and other speech providers.

 

Context Phrases/Hints

AudioCodes and Cognigy's native Voice Gateway allow you to use so-called "Context Phrases" or "Hints" to train the virtual agent to recognize certain phrases. 

In other words, you are telling the Speech to Text provider that it should expect a certain word that it otherwise might not recognize. In the below use cases, we will show you how.

 

Static Phrases within Context

Static Phrases: Cognigy Voice Gateway

The settings for this in Cognigy Voice Gateway can be found in the "Set Session Config" Node in the "Recognizer (STT)" section. You can simply add phrases to the "STT Hints" fields. In this example, we are using company names. These fields increase dynamically, and each phrase needs to be added to its own field.  

Cognigy_Voicegateway_STT_Azure_Config.png

If you already have a custom speech model in your Microsoft Azure instance, you can add these settings in the "Enable Advanced STT" section: 

mceclip0.png

Static Phrases: AudioCodes

This can also be done in AudioCodes. 

In the Set Session Parameters Node, go to the "Azure Configuration" section. 

Cognigy_Azure_STT_Azure_Config.png

Open this section and activate the "Use Azure Context Phrases" toggle. 

Cognigy_Azure_STT_Context_Phrases_Section.png

You can then add information to the "Azure STT Context Phrases" field to improve recognition of certain words. The number of fields will increase dynamically. Each phrase should be added to its field. Otherwise, they will not be recognized. 

With the value in the "Azure STT Context Boost" field, we determine how strong the recognition should be. This can be between 1 (low) and 20 (high). This is good if the given statement sounds similar to other more common words (such as `AudioCodes` vs. `Audio Codes`). 

Cognigy_Azure_STT_Context_Phrases_Fields.png

You can also add information if you already have a custom speech model set up in your Azure account. Contact your Customer Success Manager if you have more questions about this. 

 

Dynamic Phrases within Context

You might also have situations where you don't know what phrases need to be recognized ahead of time. For example, if you have a verification process and need a user's street address or name. 

If you can call up the user's information with, for example, an API, we can also change the phrases to be recognized dynamically.

In both cases, we assume you can create an array of phrases in a pattern similar to this, which can be saved to the context:

{
  "names":[
    "Heltewig",
    "Satoshi"
  ]
}

 

Dynamic Phrases: Cognigy Voice Gateway

In the native Cognigy Voice Gateway, we can use the  Recognizer (STT)

Cognigy_Set Session_Config_Dynamic_Hints.png

In this case, we assume we already have an array of phrases in the context under the variable "name".

 

Dynamic Phrases: AudioCodes

In AudioCodes, go into your "Set Session Parameters" node and open the "Advanced" section. You will find a field called "Additional Session Parameters", which can be used to add JSON information.

Cognigy_AudioCodes_STT__Advanced_Config.png

In this field, add the following JSON pattern: 

{
    "sttSpeechContexts":[
        {
            "phrases":"{{context.names}}",
            "boost":15
        }
    ]
}

The "phrases" variable is the list of phrases as an array. We assume the array is in the context of the session and is called "names", and we use CognigyScript to change this value dynamically. In the context itself, it should look something like this: 

{
  "names":[
    "Heltewig",
    "Satoshi"
  ]
}

Cognigy_Context.png

This will allow us to change the Context Phrases dynamically for each session. 

 

Regarding Names

Something to remember when recognizing common names with uncommon spellings (eg. Patrick - Patrik, Eddie - Edi): Even with the context phrases/hints defined, the more "common" spelling usually wins in the recognition. This means that you might need to use fuzzy matching (described below) in addition to context phrases/hints like this.

 

Fuzzy Search / Fuzzy Matching

If using context phrases still doesn't work for your use case, then it is also possible to use the Fuzzy Search Node and try to estimate the user input. In other words, to match what the STT understood compared to what is expected. 

The Fuzzy Search Node also expects an array with the values which it needs to match against in a similar pattern to the one's array we used for the context phrases: 

[
    "Heltewig",
    "Satoshi"
]

 

Fuzzy Search: Static Phrases

Fuzzy_Search.png

The node will compare the input and the array, return the values, and score how close the input and value are. The closer the score to "1", the closer the match. 

Fuzzy_Search_results.png

Of course, the value in "Search Pattern" can also be replaced by tokens or Cognigy script. 

 

Fuzzy Search: Dynamic Phrases

If you wish to add the values in the array dynamically, replace the array in the Source Data field with the following: 

{
  "$cs":{
    "script":"context.names",
    "type":"array"
  }
}

This will tell the search to look in the "names" array in the context object.

Fuzzy_Search_dynamic.png 

 

NLU Transformers

You can also use NLU transformers to dynamically change how certain phrases are recognized. Often NLU transformers are used to integrate external NLU systems into Cognigy. However, we can also use it to manipulate data directly in the Cognigy NLU. 
You can read more about transformers here.

A good example of when to use Transformers is numbers which aren't always transcribed properly by the STT. 

Here is a short example of using Pre-NLU transformers to recognize numbers such as 'Septante' and 'Nonante' in Belgian French, which oftentimes presents a challenge to STT systems.

preNlu:async({ text, data, language })=>{
        if(language ==='fr-FR'){ // Check if language is French
            data["suggestion"]="No suggestions found";
            const numbers ={ // Define numbers
                "zéro":0,
                "un":1,
                "deux":2,
                "trois":3,
                "quatre":4,
                "cinq":5,
                "six":6,
                "sept":7,
                "huit":8,
                "neuf":9,
                "septante":70,
                "nonante":90,
                "et":""
            };

            function convertToDigit(customerNumber){
                customerNumber = customerNumber.split(" ").map(customerNumber => customerNumber.toLowerCase());
                customerNumber.forEach((group, i)=>{
                    if(numbers[group] !== undefined){
                        customerNumber[i]= numbers[group].toString()
                    } else {
                        if(!isNaN(parseInt(group))){
                            let num = parseInt(group).toString().split("").join(" ")
                            customerNumber[i]= num
                        }
                    }
                })
                customerNumber = customerNumber.join(' ');
                return customerNumber;
            }
          data["suggestion"]= convertToDigit(text)
            text = convertToDigit(text)
        }
        return{
            data,
          text
        };
    }

 

This will manipulate the user input before it is sent to the virtual agent, which means the actual virtual agent will not notice the manipulation. If you need help setting this up, please contact a consultant of your trust or write to our support. We'll help you.

 

Lexicons

Lexicons can also be used to recognize phrases that sound similar. A great explanation can be found here: Voice Gateway – Handling Homophones.


Comments

0 comments

Article is closed for comments.

Was this article helpful?
0 out of 0 found this helpful