Over our many years working with VoiceXML 2.0/2.1, we at Nu Echo have found a number of annoyances in the specification that we would very much like to be addressed in the upcoming VoiceXML 3.0. These are not far fetched, difficult to implement things. But they would certainly let us implement more easily some frequent requirements from customers and yield much better VUIs in the end.
And fixing these issues is not in contradiction to the new direction that VoiceXML 3.0 is taking.
(I could have entitled this post “some complaints about VoiceXML 2.1″, but I decided to turn it into one with a more positive bias.)
So here is a first attempt at a VoiceXML wishlist:
- Flag indicating use of DTMF termchar. Many customers ask us to enforce the use of a termination character like ‘#’ at some point in the dialog (the PIN number, for example). If we simply use the built-in grammars and specify a term char property, it is not possible to know if the key has been pressed at all when we get the result. Of course, the application can use a custom DTMF grammar, but it will have to explicitly strip the term char from the returned DTMF sequence. And custom DTMF grammars sometimes require the use of a speech recognition engine to work so they must be provisioned even for DTMF-only applications (which is a non-sense). Note that this feature exists for the <record> element.
- DTMF nomatch or speech nomatch? When a nomatch event occurs in a form allowing either speech or DTMF input, it not possible to know whether it’s the result of a wrong DTMF sequence entered or some speech not matching one of the active grammars. Such information would lead to better reprompting. For instance, suppose you activate some grammars for universal commands, then hitting the wrong key would lead to a prompt like “Invalid command. Please say …” instead of the more generic (and speech-specific) “I didn’t understand. Please say …”. VoiceXML 2.0 specifies that in the case of a nomatch (and a noinput as well), the value of application.lastresult$ is set, but the values are platform-dependent.
- DTMF sequence entered on nomatch. When a DTMF nomatch event occurs, the application does not know what DTMF sequence was entered. Having such information would also lead to more precise reprompting. (Well, some platforms may already provide it through the application.lastresult$ object. But that’s highly platform-specific. See point above)
- Mark information on hangup. In VoiceXML 2.1, mark elements can be interspersed with the application prompts so the application can know during which segment the caller barged-in. When callers hangup, however, the application cannot know the last mark reached. This information would improve reporting, letting us know whether people actually listen messages before hanging up.
- Better support for RESTful services in data element. VoiceXML 2.1 only mandates support for GET and POST as the HTTP method in data elements. It should also support the other HTTP verbs as well (like PUT, DELETE) to enable the integration of VoiceXML applications to RESTful services. (Some platforms, like Voxeo Prophecy, already offer that kind of support.) Also, it would nice if data returned from web services be in JSON format instead of XML. (XML is so 2009, right? just kidding.) VoiceXML interpreters already embed an ECMAScript interpreter and it would be much more convenient to manipulate a JSON object than an XML object.
- Dynamic array of grammars. In VoiceXML 2.1, it is possible to build a sophisticated list of prompts using the foreach element. It would be handy to be able to do the same with grammars. Of course, one can generate a base grammar dynamically that references those grammars, but experience showed us that, with certain ASR engines, speech recognition performs differently on such grammars compared to parallel grammars. (This one would certainly be more difficult to specify, as grammars are not part of executable code, and they can appear at different scopes - document, form, field. So it’s a bit more far-fetched.) The main use case for such a feature is the writing of AJAX-like applications in VoiceXML.
- Barge-in modality. In some cases, it’d be nice to control the modality of barge-in, like allowing one to barge-in in DTMF, but not in speech. It would be as simple as specifying the barge-in as
“voice”, “dtmf”, “voice dtmf”, or “none” (instead of only “true” or “false”). - Flushing the prompt queue. Being able to explicitly flush the prompt queue would really be handy for prompts like “one moment please” <flush>.
That’s it for now. We have a couple more in store, but I wanted to keep the list short.
What do you think? Are there other small issues that you would like to be addressed by the upcoming VoiceXML 3.0?
I would like to thank my colleague Jean-Philippe Gariépy for bringing most of these issues to my attention. He’s the one who has to deal the most with the VoiceXML code that our internal dialog framework generates.