The BeVocal VoiceXML Voice Enrollment facility allows the interpreter to convert the user's utterance directly to an ABNF grammar, without going through the intermediate text representation, and to recognize utterances against that grammar. This chapter describes:
| | Text-Based Grammars |
| | Limitations of Text-based Grammars |
| | Enrollment Basics |
| | Usage Model |
| | Recognizing Enrolled Grammars |
| | Deleting Enrolled Grammars |
Note: The Voice Enrollment facility is an experimental extension to VoiceXML; its implementation and behavior are subject to change. The current BeVocal VoiceXML implementation contains the feature before it has been standardized so that developers may provide feedback. If this capability becomes a standard part of a future version of VoiceXML, the BeVocal VoiceXML implementation will change as necessary to match the VoiceXML standard.
When a VoiceXML application interacts with its user, the application typically plays some prompts, waits for the user to speak, and then recognizes the user's utterance against one or more active grammars. As described earlier, these grammars can be explicitly specified by you in one of the standard grammar languages or can be implicitly provided by a built-in field or grammar type. For example, the following snippet of VoiceXML uses an inline ABNF grammar followed by a built-in number grammar.
<form>
<field>
Say a color
<grammar>
#ABNF 1.0 en-US;
root $color;
$color = red | green | blue;
</grammar>
<field>
<field>
Say a number
<grammar src="builtin:grammar/number"/>
</field>
</form>
These types of text-based grammars give enough flexibility for the vast majority of applications. Because grammars can be loaded from external URLs and those external URLs can point to JSPs or CGIs that generate the grammars dynamically, you can write an application even when the exact grammars that are needed won't be known until runtime.
The one limitation to text-based VoiceXML grammars is that they can't handle recognition of phrases for which you do not have a text representation. For example, imagine the following user interaction:
| User | |
| Computer | |
| User | |
| Computer | |
| User | |
| User | |
| |
In this application, the user's address book grammar needs to be generated dynamically at runtime. However, there's a catch. The application can't simply generate a grammar with the text "jane smyth" in it, because it doesn't know that's the text it should use.
The application could use the <record> tag to get a WAV file containing the name of the person to be added to the address book, but there is still no way to transcribe the sounds in the WAV file into the text to insert in a grammar. In general, VoiceXML provides no speech-to-text facility for converting arbitrary recordings into text.
What is needed is a way to convert the user's utterance directly to a grammar, without going through the intermediate text representation.
You may know that speech grammars are statistical models of low-level phonemes matching phrases which the user is expected to say. Similar models can be created in response to a user's utterance. The Voice Enrollment facility puts these pieces together and lets you create grammars based on user utterances.
To use voice-generated grammars, you first create a grammar in your application. When the application runs, you have the user create (or enroll) a phrase in the grammar. (The phrase is often a name, for example an address book entry.) At that time, the interpreter collects two or more utterances of each phrase from the user and uses those utterances to build a statistical model of the phrase. The interpreter associates a phrase id with each phrase; the phrase id identifies the phrase if it is recognized later when the grammar is activated.
You use the <bevocal:enroll> tag to enroll phrases in a grammar:
<bevocal:enroll name="enroll"
grammarname="addresses"
speakeridexpr="'1234'"
phraseidexpr="'firstname'">
<prompt count="1">
Please say the name you want to add</prompt>
<prompt count="2">Please say the name again</prompt>
</bevocal:enroll>
With this snippet, the system will prompt the user until it gets enough utterances to have a good statistical model of the phrase. If the model doesn't converge, an error will eventually be thrown.
When you want to perform recognition with a grammar containing enrolled phrases, you refer to the grammar using a special syntax in the ABNF grammar format. (Currently, enrolled grammar access is supported only in ABNF; in a future release, the interpreter will support the XML grammar format.) For example:
<field>
Who do you want to call?
<grammar>
<![CDATA[
#ABNF 1.0 en-US;
root $call;
$call = call $<enrolled:/addresses?speaker=1234>
[on [his | her] (cell | home) phone];
]]>
</grammar>
</field>
This ABNF grammar contains a reference to the enrolled grammar named "addresses". In addition, it identifies the current speaker's ID as 1234. Because enrolled phrases are speaker-dependent, the speaker ID is a required parameter. For more information, see Recognizing Enrolled Grammars.
See the <bevocal:enroll> tag in the VoiceXML Programmer's Guide for details of the syntax, attribute definitions, and definitions of shadow variables and exceptions for this tag.
The <bevocal:enroll> tag must collect several utterances until it has a consistent statistical model of the phrase the user is trying to enroll. In this it is unlike other VoiceXML input items, which only collect a single piece of input from the user. This could have been implemented as one atomic operation; that is, the tag would begin execution, collect as many utterances as it needs, and then return control to the Form Interpretation Algorithm. Instead, the implementation causes collection of each utterance to be done with a separate iteration of the FIA.
The first time a <bevocal:enroll> item is visited, it will collect one utterance and then return control to the FIA. However, unless that single utterance is sufficient for a consistent enrollment (the minimum value of minconsistencies is 2), the input variable will not be set. Therefore, the FIA's next iteration will select the same <bevocal:enroll> item, which will then collect a second utterance. This behavior will continue until a consistent enrollment is achieved, maxtries is reached, or an error occurs.
A major advantage of this approach is that it gives you very fine-grained control over the enrollment behavior. In particular, you can use the <bevocal:enroll> item's prompt counter to supply tapered prompts for different iterations. For example:
<bevocal:enroll ... maxtries="5">
<prompt count="1">
Please say the name you want to add</prompt>
<prompt count="2">Please say the name again</prompt>
<prompt count="5">
I'm having trouble understanding you.
Please try one more time.</prompt>
<catch event="error.enrollment.max_tries">
...
</catch>
</bevocal:enroll>
You need to be aware that a <bevocal:enroll> item will typically be executed several times before it is successful. You will have to take this behavior into account if you want to manipulate the item's variable yourself, or if you use the item's cond attribute to control when it is executed.
Once enrollment succeeds, the input variable is filled with the audio from one of the user's consistent utterances. Your application can send the audio to its back-end server using <submit> or <data> and store the audio in a database or file for later use in the user interface. Since there is no way to retrieve a text representation of an enrolled phrase, the audio recording is very useful for user interface purposes, for example in messages like "Now calling Jane Smyth".
When an enroll utterance clashes with an existing phrase in the enrolled grammar, an error.enrollment.clash event is thrown. You can use the <bevocal:enroll> tag's shadow variables, name$.clash and name$.clashedPhraseIds to get information about the number of clashes and which phrase IDs the enroll utterance clashed with.
The bevocal.security.key property controls access to enrollment grammars. In this case, a security key can be thought of as a namespace that qualifies the grammarname attribute of the <bevocal:enroll> tag. Applications using one security key cannot access enrollment grammars created by an application using a second key, because their grammars live in separate namespaces. When you develop applications for one of BeVocal's commercial hosting services such as Enterprise Hosting, you will need a security key in order to use enrollment.
When you develop on Café, you can use enrollment without a key; however there are limitations. First, there will be an implied key derived from your Café account number. This means that even if you use the same enrollment grammar name from two different Café accounts, you will not be able to access the same enrolled phrases. Even though the grammar names will appear to be the same, they will be two separate grammars in two separate namespaces. Second, when you are using enrollment in Café without a security key, each grammar is limited to 10 enrolled phrases. Attempting to enroll more than 10 phrases will cause an error.noauthorization event to be thrown.
Here is a more complex example of the <bevocal:enroll> tag:
<?xml version="1.0" ?>
<!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 2.0//EN"
"http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd">
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
<form id="enroll_names">
<block>
Welcome to the address book demo.
Let's add some names to the address book.
</block>
<!-- This event is thrown when there is a clash with one -->
<!-- of the existing phrases in the enrolled grammar. -->
<catch event="error.enrollment.clash">
Oops! There was a clash for the enrollment sample.
<exit/>
</catch>
<!-- This event is thrown when then minimum number of -->
<!-- consistent utterances are not obtained within maxtries. -->
<catch event="error.enrollment.max_tries">
Maximum tries reached. Please try again.
<exit/>
</catch>
<catch event="error.noauthorization">
Maximum phrases enrolled.
<exit/>
</catch>
<catch event="noinput">
<prompt> In the noinput handler </prompt>
<reprompt/>
</catch>
<!-- Prompts for a phrase to be enrolled. -->
<!-- Executes this item at least twice to get 2 consistent -->
<!-- samples for the phrase; that value is controlled by. -->
<!-- minconsistencies. The grammarname and speakeridexpr uniquely -->
<!-- identify an enrollment grammar. The phraseidexpr uniquely -->
<!-- identifies a phrase in enrolled grammar and is returned -->
<!-- when recognized against the enrollment grammar. -->
<bevocal:enroll name="en1"
minconsistencies="2" maxtries="4"
grammarname="ADDRESSBOOK" speakeridexpr="'speaker10'"
phraseidexpr="'tom'" type="audio/wav">
<prompt count="1"> Say a name </prompt>
<prompt count="2"> Say the name again. </prompt>
<prompt count="3"> Please say the name again. </prompt>
<filled>
The enrolled phrase is <value expr="en1"/>
</filled>
</bevocal:enroll>
<bevocal:enroll name="en2"
minconsistencies="2" maxtries="4"
grammarname="ADDRESSBOOK" speakeridexpr="'speaker10'"
phraseidexpr="'jackson'" type="audio/wav">
<prompt count="1"> Say a name </prompt>
<prompt count="2"> Say the name again. </prompt>
<prompt count="3"> Please say the name again. </prompt>
<filled>
The enrolled phrase is <value expr="en2$.enrollAudio"/>
</filled>
</bevocal:enroll>
</form>
</vxml>
Once you have enrolled phrases in a grammar, the next step is to perform speech recognition using that grammar. Currently, this is done by inserting a reference to the enrollment grammar in an ABNF grammar. For example:
<grammar>
<![CDATA[
#ABNF 1.0 en-US;
root $call;
$call = call $<enrolled:/addresses?speaker=1234>
[on [his | her] (cell | home) phone]
]]>
</grammar>
$<enrolled:/grammarname?speaker=speakerid;key=securitykey>
When a grammar that refers to an enrolled phrase is matched, a slot whose name is the same as the enrollment grammar name will be filled with the phrase ID of the phrase that was recognized. When the grammar is used as a field grammar and no other slots are defined, this slot value will be used to fill the field's item variable.
As an alternative, you can use the grammar as a form grammar and perform the recognition using mixed initiative. This way, each slot returned by the recognition will be used to fill a field whose name (or slot attribute) matches a slot name in the grammar. This lets you use enrollment in complex grammars where you want to recognize not only an enrolled name but also other actions (for example, an action to perform, a modifier for the action, and so on).
Finally, note that in all other respects, grammars using enrollment behave just like any other grammar. Fields containing an enrollment grammar can also contain other grammars. When such a field is active, grammars in the enclosing form, document, and application are also active unless you have explicitly set the field's modal attribute to true. This gives you the flexibility to enable universal commands during your applications that use enrollment. In the address book example, in a single recognition state all of the following utterances might be valid:
| Utterance | Recognizing Grammar |
The following example performs recognition with the grammar that was defined in the example in Enrollment Example. Since the enrollment grammar is used inside a field and defines no other slots, the phrase ID of the recognized enrollment entry is used to fill the field f1 in the example.
<?xml version="1.0" ?>
<!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 2.0//EN"
"http://cafe.bevocal.com/libraries/dtd/vxml-bevocal.dtd">
<vxml version="2.0" xmlns:bevocal="http://www.bevocal.com/">
<link next="addressbook.vxml">
<grammar>
#ABNF 1.0;
root $abook;
$abook = address book;
</grammar>
</link>
<form id="recognize_names">
<field name="f1">
Let's recognize the enrolled names.
Say one of the enrolled names
<grammar>
<![CDATA[
#ABNF 1.0;
root $call;
$call= [call] $<enrolled:/ADDRESSBOOK?speaker=speaker10>
[on|at] [his|her|its|else] [home|work|cell] [phone];
]]>
</grammar>
<filled>
Calling <value expr="f1"/>
Thanks for using address book. Good Bye.
</filled>
</field>
</form>
</vxml>
If you use enrollment to maintain a voice address book or other dynamic lookup mechanism, you need to be able to delete phrases from the grammar in addition to adding them. The BeVocal interpreter provides support for this via a JavaScript function that you can use to delete enrolled phrases.
bevocal.enroll.removeEnrolledPhrase(grammar,speakerid,phraseid,key)
| Café Home |
Developer Agreement |
Privacy Policy |
Site Map |
Terms & Conditions Part No. 520-0004-02 | © 1999-2007, BeVocal, Inc. All rights reserved | 1.877.33.VOCAL |