5   Using Multiple-Recognition

This chapter describes how the speech-recognition engine used by the BeVocal VoiceXML interpreter can provide multiple recognition results. This capability consist of two features: N-best recognition and multiple interpretations of spoken input.

This chapter describes:

 •  Multiple Recognition Features
 •  Using N-Best Recognition
 •  Using Multiple Interpretations
 •  Using Both Features Together

Multiple Recognition Features

A VoiceXML application typically obtains a single value as the result of each speech-recognition event. If the user's response was not clear, the result is the one utterance that the speech-recognition engine judges to be the most likely. This utterance may provide values for several slots in a grammar, but it is still a single recognized utterance. If the application uses an ambiguous grammar and the utterance matches more than one rule, slots would be filled according to an arbitrary one of those rules.

Two features improve recognition, providing multiple recognition results:

 •  N-best recognition: Instead of returning the single most likely utterance, the speech-recognition engine can return a list of the most likely utterances.
 •  Multiple interpretations: If any given utterance matches multiple grammar rules, the speech-recognition engine returns those alternative interpretations of the utterance.

Both these features are disabled by default. They can be enabled separately or jointly in applications that want to accept multiple recognition results.

N-Best Recognition

In some advanced voice applications, a single result may not be sufficient. For example, an airline reservation application might ask the user for destination and departure cities. If a speaker mumbles, the speech-recognition engine might not be able to distinguish between two possible utterances, "Austin" and "Boston." Ideally, the application would obtain both these possible results so that it could prompt for clarification, "Did you mean Austin, Texas; or Boston, Massachusetts?"

Using N-best recognition, the speech-recognition engine returns a list of different possible utterances whose confidence levels are high enough for consideration. See Using N-Best Recognition.

Multiple Interpretations

In some applications, a single recognized utterance may have multiple interpretations, indicating that the utterance is ambiguous. For example, an application might include a GSL grammar with two rules that match the utterance "Portland."

 Cities [
   ...
   (portland ?maine)  {<city Portland> <state ME>}
   (portland ?oregon) {<city Portland> <state OR>}
   ...
 ]

If the user clearly says "Portland," this utterance does not allow the speech-recognition engine to choose between the two possible interpretations. Ideally, the application would obtain both interpretations so it could prompt for more information, "Do you mean Portland, Maine; or Portland, Oregon?"

Multiple interpretations lets an application access the different interpretations for a given recognized utterance. If multiple grammar rules match the recognized utterance, all resulting interpretations are returned. See Using Multiple Interpretations.

Combining the Features

The two multiple-recognition features can be used together. If both features are enabled, each possible utterance may have multiple interpretations. For example, an airline reservation application might both features for a field whose ambiguous GSL grammar includes two rules that match the utterance "Austin."

 Cities [
   ...
   (austin ?texas)      {<city Austin> <state TX>}
   (austin ?california) {<city Austin> <state CA>}
   (boston ?massachusettes) {<city Boston> <state MA>}
   ...
 ]

If the user mutters something that sounds like either "Austin" and "Boston," the speech-recognition engine would find two possible results, "Austin" and "Boston." The first of these results would have two possible interpretations: "Austin, Texas" and "Austin, California." A sophisticated application could prompt the user, "Did you mean Austin, Texas; Austin, California; or Boston, Massachusetts?"

Combining both features provides the maximum flexibility. See Using Both Features Together.

Working with Multiple Recognition

An application can selectively enable the two multiple-recognition features, specifying the maximum number of results to be returned. The following VoiceXML language features support recognition of multiple results:

 •  The property maxnbest controls whether N-best recognition is enabled.
 •  The property bevocal.maxinterpretations controls whether maximum interpretations is enabled.
 •  The read-only variable application.lastresult$ is set by the speech-recognition engine. It contains information about the result of the most recent speech-recognition event. If multiple recognition was enabled, this variable may contain more than one result.
 
 •  If only N-best recognition is enabled, the results represent different recognized utterances, each with a single interpretation.
 •  If only multiple interpretations is enabled, the results represent a single recognized utterance with a number of different interpretations.
 •  If both features are enabled, the results can represent a number of different recognized utterances, some or all of which can have multiple interpretations.

The two properties maxnbest and bevocal.maxinterpretations control how many results are returned from speech recognition.

 •  If only N-best recognition is enabled, maxnbest is the maximum number of results to be returned.
 •  If only multiple interpretations is enabled, bevocal.maxinterpretations is the maximum number of results to be returned.
 •  When both features are enabled, the two properties are used together. You can set these properties either to limit the total number of results without distinguishing whether a particular result is a different utterance or a different interpretation of a given utterance. Alternatively, you can specify the maximum number of distinct utterances and the maximum number of distinct interpretations for any given utterance.

Using N-Best Recognition

N-best recognition can be invoked whenever the user's spoken input is matched against a grammar. By default, N-best recognition is disabled. If you want to use this feature, you must explicitly enable it. After recognition in which this feature is enabled, you check to see whether more than one result was recognized. If so, you can prompt the user to select among the possible results.

Enabling N-Best Recognition

You enable N-best recognition by setting the maxnbest property to a value greater than one. If only N-best recognition is enabled, the value is the maximum number of distinct utterances that the speech-recognition engine should return. If multiple interpretations is also is enabled, the interpretation of value for maxnbest depends on the value of the bevocal.maxinterpretations property. This section describes using N-best recognition alone. Using Both Features Together describes how to combine N-best recognition with multiple interpretations.

By default, when you set maxnbest to a number greater than one, you enable both N-best recognition and multiple interpretations. To disable multiple interpretations, set the bevocal.maxinterpretations property to 1.

When N-best recognition is enabled, the speech-recognition engine may find multiple utterances. The most common use of N-best recognition is in recognizing input in <field> and <initial> elements. It also can be used in recognizing input that matches a <link> or <choice> grammar.

N-best recognition slows down the recognition process; you should enable this feature only when you need it. For example, you might enable it for a particular field or form in which you anticipate that user inputs might sound similar to more than one expected response. You should set maxnbest to a fairly small number and your application should be able to handle the specified number of results.

Checking for Multiple Utterances

After speech recognition occurs while N-best recognition is enabled, you should check whether multiple likely utterances were found.

 •  If recognition occurs in a <field> element, you check the results in the <filled> element of that field.
 •  If recognition occurs in a <initial> element, you check the results in the <filled> element of the containing form.
 •  If recognition occurs in a <link> or <choice> element, you check the results in a <block> at the top of the dialog or document to which the <link> or <choice> element sent you.

In the most common case, you check for multiple utterances to decide how to set input variables following speech recognition in a <field> or <initial> element. Whether or not N-best recognition is enabled, the most likely recognized utterance is used to set relevant input variables. If the most likely utterance matches more than one grammar rule, the relevant input variables are set according to an arbitrary one of those rules.

To check whether more than one result was found, you examine the application.lastresult$ array, which may contain up to maxnbest elements; in most cases, fewer results are returned.

The application.lastresult$ array contains at least one element, namely, application.lastresult$[0]. You can check application.lastresult$.length to see how many elements are in the array. For a given index i, application.lastresult$[i] is undefined if the array contains no object at that index.

If you find that only one result was returned, you do not need to take special actions; you can use the results in the input variables just as if N-best recognition were disabled. Otherwise, you can ask the user to select among the various results.

Selecting an Utterance

Once you have determined that the application.lastresult$ array contains more than one result, your application can interact with the user to determine which result was intended. Each object in the array corresponds to one likely result; its utterance property is the recognized utterance and its interpretation property is the interpretation of that utterance.

You can ask the user to select among the possible results. Each result corresponds to a different possible utterance; the utterances are ordered by speech-recognition engine confidence level.

After the user selects a result, you can set input variables accordingly. For example, if the user selects the third recognition result, the interpretation is in:

 application.lastresult$[2].interpretation

The interpretation has a property for each slot that is filled in by the matching grammar rule. You can access these properties to get the values for input variables. Typically, a slot name is identical to the name of an input variable. For example, the value for the city field of the interpretation is in:

 application.lastresult$[2].interpretation.city

If speech recognition occurs in a <link> or <choice> element, you typically don't use the selected result to set input variables. Instead, you use it to decide which dialog or document to visit.

Example

This application allows a user to schedule a visit with one of the company's offices, identified by the city where the office is located. The grammar includes three cities whose names have somewhat similar sounds: Austin, Boston, and Houston. To allow for the situation in which the speech-recognition engine cannot distinguish among those names, the maxnbest property for the office field is set to 3. Note that N-best recognition is enabled only during interpretation of the office field.

If the application receives more than one recognition result for the office field, it prompts the user to select a number corresponding to one of the possible utterances. It also lets the user start over (in case none of the possible utterances is correct).

The application keeps track of the number of recognized utterances. If the user gives an inappropriate number when asked for clarification, the application prompts again.

Sample Interactions

In this interaction, the user's answer is clear.

Application:

Which office would you like to visit?

User:

Denver.

Application:

Scheduling a visit to the Denver office.

In this interaction, the application cannot distinguish among possible responses.

Application:

Which office would you like to visit?

User:

(Garbled) estin.

Application:

Please answer 1 if you said Austin; 2 if you said Boston; 3 if you said Houston; if you want to start over, answer 0.

User:

Two.

Application:

Scheduling a visit to the Boston office.

In this interaction, the user wants to give the city again instead of selecting one of the options.

Application:

Which office would you like to visit?

User:

(Garbled) ahstin.

Application:

Please answer 1 if you said Boston; 2 if you said Austin; if you want to start over, answer 0.

User:

Zero.

Application:

Which office would you like to visit?

User:

Houston.

Application:

Scheduling a visit to the Houston office.

In this interaction, the user enters an invalid selection when asked for clarification.

Application:

Which office would you like to visit?

User:

(Garbled) ahstin.

Application:

Please answer 1 if you said Boston; 2 if you said Austin; if you want to start over, answer 0.

User:

Three.

Application:

Unrecognized option.

Please answer 1 if you said Boston; 2 if you said Austin; if you want to start over, answer 0.

User:

Two.

Application:

Scheduling a visit to the Austin office.

Application Code

 <?xml version="1.0" ?>
 <!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 2.0//EN"
  "http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd">
 <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
   <form>
     <var name="myResults"/> <!-- Array of recognition results -->
     <var name="choicePrompt"/> <!-- Prompt for clarification -->
     <!-- The listResults function creates a prompt for clarification. -->
     <!-- The findResult function returns the utterance chosen by the user. -->
     <script>
       <![CDATA[
         function listResults(allResults) {
           var promptmsg = "Please answer ";
           var promptIndex = 1;
           for (var i = 0; i < allResults.length; i++) {
             promptmsg = promptmsg + promptIndex + " if you said " +
                   allResults[i].utterance + "; ";
             ++promptIndex;
           }
           promptmsg = promptmsg + "If you want to start over, answer 0."
           return promptmsg;
         }
         function findResult(allResults, strindex) {
           return allResults[strindex - 1].utterance;
         }
       ]]>
     </script>
     <field name="office" >
       <property name="maxnbest" value="3"/>
       <property name="bevocal.maxinterpretations" value="1"/>
       <grammar type="application/x-nuance-gsl">
         <![CDATA[([
           ( austin )  { <office austin> }
           ( boston )  { <office boston> }
           ( chicago ) { <office chicago> }
           ( denver )  { <office denver> }
           ( houston ) { <office houston> }
         ])]]>
       </grammar>
       <prompt>Which office would you like to visit?</prompt>
       <filled> 
         <if cond="application.lastresult$.length &gt; 1">
           <!-- More than one recognition result was returned. -->
           <assign name="myResults" expr="application.lastresult$"/>
           <!-- Construct prompt from the recognized utterances. -->
           <assign name="choicePrompt" expr="listResults(myResults)"/>
         <else/>
           <!-- Only one result; skip the "choice" field. -->
           <assign name="choice" expr="0"/>
         </if>
       </filled>
     </field>
     <field name="choice" type="digits">
       <prompt> <value expr="choicePrompt"/> </prompt>
       <filled>
         <if cond="choice == 0">
           <!-- Start over. -->
           <clear/>
         <elseif cond="choice &gt; myResults.length"/>
           <!-- No such utterance; prompt again. -->
           <prompt>Unrecognized option.</prompt>
           <clear namelist="choice"/>
         <else/>
           <assign name="office" expr="findResult(myResults, choice)"/>
         </if>
       </filled>
     </field>
     <block>
       <prompt>Scheduling a visit with the <value expr="office"/> 
office</prompt>
     </block>
   </form>
 </vxml>  

Using Multiple Interpretations

If the grammar that is used for a particular field or form is ambiguous, you can enable multiple interpretations when the user's spoken input is matched against that grammar. By default, multiple interpretations is disabled. If you want to use this feature, you must explicitly enable it.

After recognition in which this feature is enabled, you check to see whether more than one interpretation was found. If so, you can prompt the user to select among the possible interpretations.

Enabling Multiple Interpretations

You enable multiple interpretations by setting the bevocal.maxinterpretations property to a value other than one. If only multiple interpretations is enabled, the value is the maximum number of distinct interpretations that the speech-recognition engine should return. If N-best recognition is also is enabled, the value for bevocal.maxinterpretations is used in conjunction with the value of the maxnbest property to determine how many results to return. This chapter describes using multiple interpretations recognition alone. Using Both Features Together describes how to combine N-best recognition with multiple interpretations.

When multiple interpretations is enabled, if the user's utterance match more than one rule in an ambiguous grammar, all corresponding interpretations are included in the recognition results.

The most common use of ambiguous grammars is in recognizing input in <field> and <initial> elements. You typically should avoid using an ambiguous <link> or <choice> grammar. The remainder of this document assumes that multiple interpretations are found for spoken input in a <field> or <initial> element.

You should enable multiple interpretations only when you need it--namely in a particular field or form in which you use an ambiguous grammar. Your application should be able to handle the specified number of interpretations.

Checking for Multiple Interpretations

You should check for multiple interpretations in the <filled> element of a field or mixed-initiative form that has an ambiguous grammar. If the user's response matched more than one grammar rule, the relevant input variables are set according to an arbitrary one of those rules.

If multiple interpretations is enabled, you should check whether additional results were returned. The number of results returned from the speech-recognition engine does not necessary equal bevocal.maxinterpretations; in most cases, fewer results are returned.

To check whether more than one result was found, you examine the application.lastresult$ array, which may contain up to bevocal.maxinterpretations elements; in most cases, fewer results are returned.

The application.lastresult$ array contains at least one element, namely, application.lastresult$[0]. You can check application.lastresult$.length to see how many elements are in the array. For a given index i, application.lastresult$[i] is undefined if the array contains no object at that index.

If you find that only one result was returned, you do not need to take special actions; you can use the values of the input variables just as if multiple interpretations were disabled. Otherwise, you can ask the user to select among the various interpretations.

Selecting an Interpretation

Once you have determined that the application.lastresult$ array contains more than one result, your application can interact with the user to determine which result was intended. Each object in the array corresponds to one likely result; its utterance property is the recognized utterance and its interpretation property is the interpretation of that utterance.

You can ask the user to select among the possible interpretations. Each result corresponds to a different interpretations of the most likely utterance; the different interpretations are in an undefined order.

After the user selects a result, you can set input variables accordingly. For example, if the user selects the third recognition result, the interpretation is in:

 application.lastresult$[2].interpretation

The interpretation has a property for each slot that is filled in by the matching grammar rule. You can access these properties to get the values for input variables. Typically, a slot name is identical to the name of an input variable. For example, the value for the city field of the interpretation is in:

 application.lastresult$[2].interpretation.city

Example

This application prompts the user for an employee. The grammar allows the user to identify an employee by first name only, by first name and last name, by nickname, or by nickname and last name. The first name "Robert" is ambiguous: it could mean either Bob Smith or Rob Black. To allow for an ambiguous answer, the bevocal.maxinterpretations property for the employee field is set to 2. Multiple interpretations is enabled only during interpretation of the employee field.

Sample Interactions

In this interaction, the user's answer is unambiguous.

Application:

Which employee do you want to call?

User:

Alice.

Application:

Placing call to Alice Brown.

In this interaction, the user gives an ambiguous name.

Application:

Which employee do you want to call?

User:

Robert.

Application:

Please say 1 if you mean Robert Smith; 2 if you mean Robert Black.

User:

One.

Application:

Placing call to Robert Smith.

In this interaction, the user enters an invalid selection when asked for clarification.

Application:

Which employee do you want to call?

User:

Robert.

Application:

Please say 1 if you mean Robert Smith; 2 if you mean Robert Black.

User:

Three.

Application:

Unrecognized option.

Please say 1 if you mean Robert Smith; 2 if you mean Robert Black.

User:

Two.

Application:

Placing call to Robert Black.

Application Code

 <?xml version="1.0" ?>
 <!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 2.0//EN"
  "http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd">
 
 <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
   <form>
   
     <var name="myResults"/>     <!-- Array of recognition results -->
     <var name="nInterps"/>      <!-- Number of interpretations to consider -->
     <var name="choicePrompt"/>  <!-- Prompt for clarification -->
 
     <script>
       <![CDATA[
   
         // Create a prompt for clarification, asking the user to choose
         // an interpretation of the most likely utterance
         
         function listInterps(allInterps) {
           var promptmsg = "Please say ";
           var promptIndex = 0;
           for (var i = 0; i < allInterps.length; i++) {
             if (allInterps[i].utterance == allInterps.utterance) { 
               promptmsg = promptmsg + ++promptIndex + " if you mean " +
               allInterps[i].interpretation.employee + "; ";
             }
           }
           nInterps = promptIndex;
           return promptmsg;    
         }
 
         // Return the interpretation chosen by the user
         
         function findInterp(allInterps, strindex) {
           return allInterps[strindex - 1].interpretation.employee;
         }
 
         // Count the number of recognition results whose utterance matches 
         // the most likely utterance
         
         function countInterps(allResults) {
           var i, c = 0;
           for (i = 0; i < allResults.length; i++) {
             if (allResults[i].utterance == allResults.utterance) {
               c++;
             }
           }
           return c; 
         }
       ]]>
     </script>
 
     <field name="employee" >
       <property name="bevocal.maxinterpretations" value="2"/>
 
       <!-- Note that grammar is ambiguous if the user says "Robert". -->
       <grammar type="application/x-nuance-gsl">
         <![CDATA[([
           ( alice  ?brown ) {<employee "alice brown"> }
           ( robert ?smith ) {<employee "robert smith"> }
           ( bob    ?smith ) {<employee "robert smith"> }
           ( robert ?black ) {<employee "robert black"> }
           ( rob    ?black ) {<employee "robert black"> }
           ( joe    ?jones ) {<employee "joseph jones"> }
           ( joseph ?jones ) {<employee "joseph jones"> }
         ])]]>
       </grammar>
       <prompt>Which employee do you want to call?</prompt> 
       <filled>      
         <var name="count" expr="countInterps(application.lastresult$)"/>
         <if cond="count &gt; 1">
           <!-- More than one recognition result was returned. -->
           <assign name="myResults" expr="application.lastresult$"/>
           <!-- Construct prompt from the possible interpretations. -->
           <assign name="choicePrompt" expr="listInterps(myResults)"/>
         <else/>
           <!-- Only one interpretation; skip the "choice" field. -->
           <assign name="choice" expr="0"/>
         </if>
       </filled>
     </field>
 
     <field name="choice" type="digits">
       <prompt> <value expr="choicePrompt"/> </prompt>
       <filled>
         <if cond="choice == 0 || choice &gt; nInterps">
           <!-- No such interpretation; try again. -->
           <prompt>Unrecognized option.</prompt>
           <clear namelist="choice"/>
         <else/>
           <assign name="employee" expr="findInterp(myResults, choice)"/>
         </if>
       </filled>
     </field>
 
     <block>
       <prompt>Placing call to <value expr="employee"/></prompt>
     </block>
   </form>
 </vxml>   

Using Both Features Together

You can enable both multiple-recognition features for a particular field or form that has an ambiguous grammar and in which you anticipate that user inputs might sound similar to more than one expected response.

Enabling Both Features

You enable N-best recognition by setting the maxnbest property to a value greater than one. You enable multiple interpretations by setting the bevocal.maxinterpretations property to a value other than one. Speech-recognition results are returned in the application.lastresult$ array. Each element corresponds to one interpretation of one likely utterance. The same utterance may have different interpretations, and two or more different utterances may have a common interpretation.

The values of the two properties are used together to limit the number of results that are returned by the speech-recognition engine.

 •  If bevocal.maxinterpretations is undefined or less than one, up to maxnbest results are returned.
 •  If bevocal.maxinterpretations is greater than one, up to maxnbest distinct utterances are returned, each of which can have up to bevocal.maxinterpretations distinct interpretations. The maximum number of results is the product of maxnbest and bevocal.maxinterpretations.

For example, if maxnbest is 3 and bevocal.maxinterpretations is 0, a maximum of 3 results can be returned; if maxnbest is 3 and bevocal.maxinterpretations is 2, a maximum of 6 results can be returned.

Checking for Multiple Results

You should check for multiple results in the <filled> element of a field or mixed-initiative form for which both multiple-recognition features are enabled. As always, the most likely recognized utterance is used to set relevant input variables. If the most likely utterance matches more than one grammar rule, the relevant input variables are set according to an arbitrary one of those rules

To check whether more than one result was found, you examine the application.lastresult$ array.

The application.lastresult$ array contains at least one element, namely, application.lastresult$[0]. You can check application.lastresult$.length to see how many elements are in the array. For a given index i, application.lastresult$[i] is undefined if the array contains no object at that index.

If you find that only one result was returned, you do not need to take special actions; you can use the results in the input variables just as if multiple recognition were disabled. Otherwise, you can ask the user to select among the various results.

Selecting a Result

Once you have determined that the application.lastresult$ array contains more than one result, your application can interact with the user to determine which result was intended. Each object in the array corresponds to one likely result; its utterance property is the recognized utterance and its interpretation property is the interpretation of that utterance.

You can ask the user to select among the possible result. You should assume that the different recognition results may correspond to different possible utterances as well as different interpretations of some utterances. Elements for different possible utterances are ordered by speech-recognition engine confidence level; elements for the different interpretations of a given utterance are in an undefined order.

After the user selects a result, you can set input variables accordingly. For example, if the user selects the third recognition result, the interpretation is in:

 application.lastresult$[2].interpretation

The interpretation has a property for each slot that is filled in by the matching grammar rule. You can access these properties to get the values for input variables. Typically, a slot name is identical to the name of an input variable. For example, the value for the city field of the interpretation is in:

 application.lastresult$[2].interpretation.city

Simple Example

This application allows a user to schedule a visit with one of the company's offices, identified by the city where the office is located. The grammar includes three cities whose names have somewhat similar sounds: Austin, Boston, and Houston. The grammar allows the user to identify an office by city only or by city and state. The city "Austin" is ambiguous; the company has offices in both Austin, Texas and Austin, California.

To allow for the situation in which the speech-recognition engine cannot distinguish among similar sounding names, and/or it recognizes an ambiguous answer the maxnbest property for the office field is set to 4. The bevocal.maxinterpretations is not set; it is undefined by default, so multiple interpretations is also enabled and the maximum number of results in 4. Multiple recognition is enabled only during interpretation of the office field.

If the application receives more than one recognition result for the office field, it prompts the user to select a number corresponding to one of the possible results. Any result whose confidence level is within 0.3 of the highest confidence level is considered.

If the grammar included rules in which different similar-sounding utterances could produce the same interpretation, the application could ensure that it only asks the user about unique interpretations.

Sample Interactions

In this interaction, the application cannot distinguish between two possible responses, one of which is ambiguous.

Application:

Which office would you like to visit?

User:

(Garbled) ahstin.

Application:

Please say 1 if you mean Boston Massachusetts; 2 if you mean Austin Texas; 3 if you mean Austin California. If you want to start over, answer 0.

User:

Two.

Application:

Scheduling a visit with the Austin Texas office.

Application Code

 <?xml version="1.0" ?>
 <!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 2.0//EN"
  "http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd">
 <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
   <form>
     <var name="myResults"/>     <!-- Array of recognition results -->
     <var name="nInterps"/>      <!-- Number of interpretations to consider -->
     <var name="choicePrompt"/>  <!-- Prompt for clarification -->
     <!-- The listInterps function creates a prompt for clarification,   -->
     <!-- asking the user to choose an interpretation of one of the      -->
     <!-- likely utterances. It considers any utterance whose confidence -->
     <!-- is within 0.3 of the highest confidence level (that of the     -->
     <!-- first array element). The findInterp function returns the      -->
     <!-- interpretation chosen by the user.                             -->
     <script>
       <![CDATA[
         function listInterps(allInterps) {
           var promptmsg = "Please say ";
           var index = 0;
           var promptIndex = 1;
           var maxConfidence = allInterps[0].confidence;
           while (allInterps[index] != undefined &&
               maxConfidence - allInterps[index].confidence < 0.3) {
             promptmsg = promptmsg + promptIndex + " if you mean " +
                   allInterps[index].interpretation.office + "; ";
               ++index;
               ++promptIndex;
           }
           nInterps = index;
           promptmsg = promptmsg + "if you want to start over, say 0."
           return promptmsg;
         }
         function findInterp(allInterps, strindex) {
           return allInterps[strindex - 1].interpretation.office;
         }
       ]]>
     </script>
 
     <field name="office" >
       <property name="maxnbest" value="4"/>
       <!-- Note that grammar is ambiguous if the user says "Austin". -->
       <grammar type="application/x-nuance-gsl">
         <![CDATA[([
           ( austin  ?texas )         {<office "austin texas"> }
           ( austin  ?california )    {<office "austin california"> }
           ( boston  ?massachusetts ) {<office "boston massachusetts"> }
           ( chicago ?illinois )      {<office "chicago illinois"> }
           ( denver  ?colorado )      {<office "denver colorado"> }
           ( houston ?texas )         {<office "houston texas"> }
         ])]]>
       </grammar>
       <prompt>Which office would you like to visit?</prompt>
       <filled> 
         <if cond="application.lastresult$.length &gt; 1 &amp;&amp;
             application.lastresult$[0].confidence - 
             application.lastresult$[1].confidence &lt; 0.3">
           <!-- More than one likely recognition result was returned. -->
           <assign name="myResults" expr="application.lastresult$"/>
           <!-- Construct prompt from the possible interpretations. -->
           <assign name="choicePrompt" expr="listInterps(myResults)"/>
         <else/>
           <!-- Only one result; skip the "choice" field. -->
           <assign name="choice" expr="0"/>
         </if>
       </filled>
     </field>
     <field name="choice" type="digits">
       <prompt> <value expr="choicePrompt"/> </prompt>
       <filled>
         <if cond="choice == 0">
           <!-- Start over. -->
           <clear/>
         <elseif cond="choice &gt; nInterps"/>
           <!-- No such utterance; prompt again. -->
           <prompt>Unrecognized option.</prompt>
           <clear namelist="choice"/>
         <else/>
           <assign name="office" expr="findInterp(myResults, choice)"/>
         </if>
       </filled>
     </field>
     <block>
       <prompt>Scheduling a visit with the <value expr="office"/> 
office</prompt>
     </block>
   </form>
 </vxml>  

Generating a Subdialog

The preceding examples asked the user to enter a number corresponding to the intended response. You can produce a more sophisticated interaction by generating a subdialog from the value of application.lastresult$ and using the subdialog to request disambiguation.

This application consists of a mixed-initiative form that prompts the user for a city and state. As in the preceding example, the grammar is ambiguous and some possible city names sound similar. After the user has filled in the city and state, the application checks application.lastresult$ to see whether multiple recognition results were found and, if so, whether the confidence levels of the first two results are within 0.3. If so, the application calls a subdialog, which is generated from the value of application.lastresult$ by a perl script. The perl script receives the array of recognition results as a POST parameter named results.

Sample Interaction

In this interaction, the application cannot distinguish between two possible responses, both of which are ambiguous.

Application:

Please name a city and state.

User:

(Garbled) ahstin.

Application:

I didn't quite get that. Please say "that one" when you hear the city you want.

Boston, Massachusetts. (Pause)

Boston, Maine. (Pause)

Austin, Texas.

User:

Yes.

Application:

You chose Austin Texas.

Application Code

 <?xml version="1.0" ?>
 <!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 2.0//EN"
  "http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd">
 
 <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
   <form>
     <property name="maxnbest" value="5"/>
     <var name="results"/>
 
     <!-- This mixed-initiative form's grammar does the       -->
     <!-- initial job of filling in the city and state fields -->
     <grammar mode="voice" type="application/x-nuance-gsl">
       <![CDATA[([
         ( austin  ?california )    { <city austin> <state california> }
         ( austin  ?texas )         { <city austin> <state texas> }
         ( boston  ?maine )         { <city boston> <state maine> }
         ( boston  ?massachusetts ) { <city boston> <state massachusetts> }
         ( chicago ?illinois )      { <city chicago> <state illinois> }
         ( denver  ?colorado )      { <city denver> <state colorado> }
         ( houston ?texas )         { <city houston> <state texas> }
       ])]]>
     </grammar>
     <initial>
       <prompt>Please name a city and state</prompt>
     </initial>
     <field name="city">
       Choose a city
       <grammar type="application/x-nuance-gsl">
         [ austin boston chicago denver houston]
       </grammar>
     </field>
     <field name="state">
       Which state?
       <grammar type="application/x-nuance-gsl">
         [ california colorado illinois maine massachussetts texas]
       </grammar>
     </field>
     <filled namelist="city state" mode="all">
       <!-- Both city and state have been filled in. Check  -->
       <!-- whether multiple recognition results need to be -->
       <!-- disambiguated.  If so, execute subdialog.       -->
       <log>Last result is <value expr="application.lastresult$"/> </log>
 
       <if cond="application.lastresult$.length &gt; 1 &amp;&amp;
           application.lastresult$[0].confidence - 
           application.lastresult$[1].confidence &lt; 0.3">
         <assign name="results" expr="application.lastresult$"/>
         <goto nextitem="disambig"/>
       </if>
     </filled>
 
     <!-- This subdialog, which is generated by a Perl script -->
     <!-- from value of results, does the actual work         -->
     <!-- of disambiguation.  Since its has cond="false", it  -->
     <!-- is never executed unless there is an explicit       -->
     <!-- transition to it, as in the <filled> element above. -->
     <subdialog name="disambig" cond="false"
         src="disambig.pl" 
         namelist="results" method="post" >
       <filled>
         <!-- The subdialog is finished; put its return   -->
         <!-- values in the field instance variables.     -->
         <assign name="city"  expr="disambig.city"/>
         <assign name="state" expr="disambig.state"/>
       </filled>
     </subdialog>
       <block>
       <prompt>You chose <value expr="city"/>, <value expr="state"/>.</prompt>
     </block>
   </form>
 </vxml>  

Perl Script

 #!/usr/local/bin/perl5
 
 # This sample Perl-based CGI demonstrates a server-side technique
 # for disambiguating the multiple utterances or multiple interpretations
 # from speech recognition.  This script assumes the recognition was
 # from a grammar that filled two slots: "city" and "state".
 #
 # The expected HTTP request parameters are:
 #   results.length  - The number of results from the recognition.
 #
 #   For i from 0 to results.length - 1:
 #   results[i].confidence           - The confidence of this result, 0 to 1
 #   results[i].interpretation.city  - The city recognized for this result
 #   results[i].interpretation.state - The state recognized for this result
 
 use CGI;
 
 # Print out the XML and VoiceXML headers
 #
 print "\n";
 print "<?xml version=\"1.0\"?>\n\n";
 print "<!DOCTYPE vxml PUBLIC \"-//BeVocal Inc//VoiceXML 2.0//EN\"\n";
 print "\"http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd\">\n\n";
 print "<vxml version=\"2.0\">\n";
 
 # Print the beginning of the disambiguation dialog
 #
 print "  <form>\n";
 print "    <block>\n";
 print "      I didn't quite get that.\n";
 print "      Please say 'that one' when you hear the city you want.\n";
 print "    </block>\n";
 
 # Create two variables to store the disambiguation results
 #
 print "    <var name=\"city\"/>\n";
 print "    <var name=\"state\"/>\n";
 
 # Figure out how many results we have to deal with
 $length = CGI::param("results.length");
 
 # Save the maximum confidence of any result (which we know is always the first 
one) 
 #
 $maxConfidence =  ( CGI::param("results[0].confidence") );
 
 for ($i = 0; $i < $length; $i++) {
 
   my ($fname) = ( "f$i" );  # field name
   my ($city)  = ( CGI::param("results[$i].interpretation.city") );
   my ($state) = ( CGI::param("results[$i].interpretation.state") );
   my ($level) = ( CGI::param("results[$i].confidence") );
 
   # If the confidence of this result is close enough to the maximum
   # one, then use it in the disambiguation
   #
   if (($maxConfidence - $conf) < 0.3) {
 
     # Generate a field that prompts this city/state name
     # and then pauses briefly to let the user say "that one".
     # If the user remains silent, our custom <noinput> handler
     # simply goes to the next field.
     #
     print "    <field name=\"$fname\">\n";
     print "      <grammar>[ yes ( that one ) ]</grammar>\n";
     print "      <prompt timeout=\"0.75s\">\n";
     print "        $city, $state\n";
     print "      </prompt>\n";
 
     # If the user said "that one" (or "yes"), fill in the result variables.
     # The form-level <filled> block below will then return them.
     #
     print "      <filled>\n";
     print "        <assign name=\"city\"  expr=\"'$city'\"/>\n";
     print "        <assign name=\"state\" expr=\"'$state'\"/>\n";
     print "      </filled>\n";
 
     # The user didn't say anything.  Tell the interpreter to go to the
     # next field by setting this field's form item variable to true.
     #
     print "      <noinput>\n";
     print "        <reprompt/>\n";
     print "        <assign name=\"$fname\" expr=\"true\"/>\n";
     print "      </noinput>\n";
     print "    </field>\n";
   } # End if 
 }; # End for 
 
 # If we get here, it means the user didn't say "that one" on any of the fields.
 # Clear them all out and try again.
 #
 print "    <block>\n";
 print "      <clear/>\n";
 print "    </block>\n";
 
 # This form-level filled block returns the city and state variables
 # as soon as any of the fields are filled by the user saying "that one"
 #
 print "    <filled mode=\"any\">\n";
 print "      <return namelist=\"city state\"/>\n";
 print "    </filled>\n";
 
 print "  </form>\n";
 print "</vxml>\n";
 print "\n";
 
 exit 0;
   


[Show Frames]   [FIRST] [PREVIOUS] [NEXT]
BeVocal, Inc. Café Home | Developer Agreement | Privacy Policy | Site Map | Terms & Conditions
Part No. 520-0001-02 | © 1999-2007, BeVocal, Inc. All rights reserved | 1.877.33.VOCAL