September 2nd, 2010 1 Comment

by Dominique Boucher

A wishlist for VoiceXML 3.0

Over our many years working with VoiceXML 2.0/2.1, we at Nu Echo have found a number of annoyances in the specification that we would very much like to be addressed in the upcoming VoiceXML 3.0. These are not far fetched, difficult to implement things. But they would certainly let us implement more easily some frequent requirements from customers and yield much better VUIs in the end.

And fixing these issues is not in contradiction to the new direction that VoiceXML 3.0 is taking.

(I could have entitled this post “some complaints about VoiceXML 2.1″, but I decided to turn it into one with a more positive bias.)

So here is a first attempt at a VoiceXML wishlist:

  1. Flag indicating use of DTMF termchar. Many customers ask us to enforce the use of a termination character like ‘#’  at some point in the dialog (the PIN number, for example). If we simply use the built-in grammars and specify a term char property, it is not possible to know if the key has been pressed at all when we get the result. Of course, the application can use a custom DTMF grammar, but it will have to explicitly strip the term char from the returned DTMF sequence. And custom DTMF grammars sometimes require the use of a speech recognition engine to work so they must be provisioned even for DTMF-only applications (which is a non-sense). Note that this feature exists for the <record> element.
  2. DTMF nomatch or speech nomatch? When a nomatch event occurs in a form allowing either speech or DTMF input, it not possible to know whether it’s the result of a wrong DTMF sequence entered or some speech not matching one of the active grammars. Such information would lead to better reprompting. For instance, suppose you activate some grammars for universal commands, then hitting the wrong key would lead to a prompt like “Invalid command. Please say …” instead of the more generic (and speech-specific) “I didn’t understand. Please say …”. VoiceXML 2.0 specifies that in the case of a nomatch (and a noinput as well), the value of application.lastresult$ is set, but the values are platform-dependent.
  3. DTMF sequence entered on nomatch. When a DTMF nomatch event occurs, the application does not know what DTMF sequence was entered. Having such information would also lead to more precise reprompting. (Well, some platforms may already provide it through the application.lastresult$ object. But that’s highly platform-specific. See point above)
  4. Mark information on hangup. In VoiceXML 2.1, mark elements can be interspersed with the application prompts so the application can know during which segment the caller barged-in. When callers hangup, however, the application cannot know the last mark reached. This information would improve reporting, letting us know whether people actually listen messages before hanging up.
  5. Better support for RESTful services in data element. VoiceXML 2.1 only mandates support for GET and POST as the HTTP method in data elements. It should also support the other HTTP verbs as well (like PUT, DELETE) to enable the integration of VoiceXML applications to RESTful services. (Some platforms, like Voxeo Prophecy, already offer that kind of support.) Also, it would nice if data returned from web services be in JSON format instead of XML. (XML is so 2009, right? just kidding.) VoiceXML interpreters already embed an ECMAScript interpreter and it would be much more convenient to manipulate a JSON object than an XML object.
  6. Dynamic array of grammars. In VoiceXML 2.1, it is possible to build a sophisticated list of prompts using the foreach element. It would be handy to be able to do the same with grammars. Of course, one can generate a base grammar dynamically that references those grammars, but experience showed us that, with certain ASR engines, speech recognition performs differently on such grammars compared to parallel grammars. (This one would certainly be more difficult to specify, as grammars are not part of executable code, and they can appear at different scopes - document, form, field. So it’s a bit more far-fetched.) The main use case for such a feature is the writing of AJAX-like applications in VoiceXML.
  7. Barge-in modality. In some cases, it’d be nice to control the modality of barge-in, like allowing one to barge-in in DTMF, but not in speech. It would be as simple as specifying the barge-in as
    “voice”, “dtmf”, “voice dtmf”, or “none” (instead of only “true” or “false”).
  8. Flushing the prompt queue. Being able to explicitly flush the prompt queue would really be handy for prompts like “one moment please” <flush>.

That’s it for now. We have a couple more in store, but I wanted to keep the list short.

What do you think? Are there other small issues that you would like to be addressed by the upcoming VoiceXML 3.0?

I would like to thank my colleague Jean-Philippe Gariépy for bringing most of these issues to my attention. He’s the one who has to deal the most with the VoiceXML code that our internal dialog framework generates.

January 25th, 2010 7 Comments

by Dominique Boucher

Voice APIs: back to basics

We definitely live in interesting times. After years of pushing hard on VoiceXML (2.0 and 2.1), the industry comes up regularly with new approaches departing significantly from the newly proposed VoiceXML 3.0. And these approaches sometimes come from companies working hard on the VoiceXML standardization effort.

For instance, last week Voxeo announced a new interface to its Tropo platform, called Tropo WebAPI. To build a communications application, one has simply to write a web service/application producing JSON documents. These documents contain simple instructions for the communications platform like: play this prompt, ask a question, transfer the call, etc. Very simple instructions, indeed. The results are then sent server-side to the application for further processing and deciding what to do next.

This approach reminds me of TwiML, Twilio’s own markup language for implementing voice applications, and (to a certain extent) FastAGI, the Asterisk way of developing server-side voice applications (the preferred way of deploying applications on the Cloudvox platform).

What do these approaches have in common? Well, they all offer a much simpler programming model than VoiceXML. In VoiceXML, there is the form-filling algorithm which tries to fill slots in a form automatically. VoiceXML applications can also contain a fair amount of scripting (in ECMAScript) with many scoping rules for variables. It also provides some exception mechanisms (with catch and throw elements), a root document for storing data, etc. No wonder most development environments targeting VoiceXML platforms only make use of a limited subset of VoiceXML.

In fact, the new approaches are not programming models, they essentially provide low-level instructions for the various voice platforms. Much like a virtual machine. It’s up to the user of the platform to implement its own programming model on top of these instruction sets. And this is a very attractive offer, as this will most certainly ignite the development of new application programming environments and frameworks, some of which will be platform agnostic.

We lived a somewhat similar period at the end of the last century. There were many non-interoperable proprietary IVR platforms, and the industry came up with a solution: VoiceXML. Will we see something similar happen with these new approaches? I doubt it. I think that all these approaches are sufficiently similar that a good abstraction layer on the application side can suffice to support them all easily. In the 90’s, porting an application to a new platform was plainly impossible without a complete rewrite.

Strangely, the programming languages community lived something similar a few years ago. From around 1997 to the start of the century, the craze for  Java almost killed research in the field of object-oriented programming language design not targeting Java or the JVM. Then, in 2003 or so, some leading researchers decided consciously that it was time to start a post-Java era. And it’s at about that time that many programming languages started flourishing and that we saw a greater acceptance for dynamic/scripting languages (on the JVM or not). This period also coincided with the rise of the Web 2.0 and a new culture of entrepreneurship, thanks to Paul Graham Y Combinator.

I think we are living something similar today in the communications industry, though a few years later. We see young entrepreneurs and new startups with innovative ideas enter the market. By the way, a few of them presented their ideas at StartupCamp Telephony last week, an event sponsored by Twilio and PhoneTag as part of the ITExpo conference.

The years to come promise to be very exciting.

There are many free hosted VoiceXML platforms out there to try out new ideas, prototype applications, etc. I use one of them on a regular basis. Unfortunately, each time I need dynamically generated grammars in my application, I’m stuck. I have to roll my own solution (typically by launching a Web server on my machine, opening a temporary port in our firewall …). Ouch!

All of this is no longer necessary, thanks to our new NuGram Hosted Server, which we launched two weeks ago at SpeechTEK. In this post, I will show how to add dynamic grammars to a standard, VoiceXML 2.1 compliant application. You won’t need to install or deploy any Web server technology. All you’ll need is:

  • Eclipse 3.2 or higher with NuGram IDE installed;
  • an account on grammarserver.com;
  • an account on Evolution Developer Portal to deploy and test the VoiceXML application. (You can use any VoiceXML 2.1 platform, of course, but the example uses some non-standard objects exposed in ECMAScript by the Evolution VoiceXML interpreter.)

The sample application

I will illustrate the whole process of adding dynamic grammars to a VoiceXML application by developing a very simple-minded voice-activated auto-attendant-like application. The application will simply ask for a name and tell you the associated extension number.

Step 1 - Edit your grammar

You first need to create a new file in NuGram IDE to edit the grammar. We’ll call it name.abnf. (The actual name and location of the file in your workspace doesn’t really matter as we will be able to choose a different name when publishing it on the grammar server.) The file should have the following content:

#ABNF 1.0 ISO-8859-1;

language en-US;
tag-format <semantics/1.0>;
root $name;

public $name =
  [$pre_filler] $directoryEntry [$post_filler]
  {out.extension = rules.directoryEntry.extension;}
;

$directoryEntry =
  @alt
      @for (entry : entries)
        ( [ @word employee.firstname ]
          @word employee.lastname
          @tag "out.extension = '" entry.extension "';" @end
        )
      @end
  @end
;

$post_filler = please;
$pre_filler  =  I would like to speak with  | can I talk to;

As you can see, this is mainly ABNF with some extensions for the dynamic parts of the grammar.

Step 2 - Publish your grammar

In the ABNF editor, press Alt-Ctrl-Shift-P or right-click in the editor and select the Publish menu item in the contextual menu. This will open a dialog box in which you enter the grammar name on NuGram Server. (Of course, you first need to configure the publishing feature appropriately in the Eclipse Preferences. You’ll need to specify the server address, which is http://www.grammarserver.com:8082, your user name, and password). Since this is an English grammar, we’ll call it en/name.abnf.

That’s it! We are now ready to write our VoiceXML application.

Step 3 - Add the grammar to your VoiceXML application

Dynamic grammars are instantiated by sending instantiation contexts to NuGram Server, together with the name of the grammar. An instantiation context is simply a set of key/value pairs encoded as a JSON object. The context is passed to NuGram Server using a very simple HTTP-based interface. In VoiceXML, we’ll use the data element for this. Once the dynamic grammar is instantiated, the URI of the generated grammar is returned to the VoiceXML application for use in a grammar element.

To simplify the application code, I wrote a few ECMAScript helper functions. You can get them here. They must be put in a file named gsapi.js in the same folder as the VoiceXML application itself. Note that some of these functions rely on global objects provided by the Voxeo VoiceXML interpreter.

Now let’s start writing the VoiceXML document. We must begin with the usual XML header and the root element and a script element to include the ECMAScript helper functions:

<?xml version="1.0"?>
<vxml xmlns="http://www.w3.org/2001/vxml"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/2001/vxml
                          http://www.w3.org/TR/voicexml21/vxml.xsd"
      version="2.1">

  <script src="gsapi.js"/>

The next step is to set up the connection with NuGram Server:

  <script>
    var grammarUri = null;
    setupGrammarServer('www.grammarserver.com:8082', 'UserName', 'Password');
  </script>

This only assigns values to a few of variables. No magic here. The interesting part follows. We must now create a session on NuGram Server and instantiate the dynamic grammar. We will do this inside a form element:

  <form id="start">
   <block>
    <script>
      initiateSessionCreation();
    </script>
    <data name="createSessionResponse" srcexpr="serverUrl()"
          method="post" namelist="account password operation resource"/>
    <script>
      setupSessionId(createSessionResponse);
    </script>

The first script element sets up a number of variables, while the second one extracts the session ID from the response to the data element.

The instantiation context is then sent to NuGram Server in the same way:

    <script><![CDATA[
      initiateInstantiation('en/name.abnf',
                            {"entries":[{"firstname":"dominique",
                                         "lastname":"boucher",
                                         "extension":"4231"},
                                        {"firstname":"yves",
                                         "lastname":"normandin",
                                         "extension":"4225"}]});

    ]]></script>
    <data name="createGrammarResponse" srcexpr="serverUrl()"
          method="post" namelist="account password operation resource context"/>
    <script>
      grammarUri = getGrammarUri(createGrammarResponse);
    </script>
    <goto next="#ask"/>
   </block>
  </form>

Of course, the context is hard-coded here. In a real application, it would probably be the result of a request to a database or a web service.

The initiateInstantiation function sets a few variables. In particular, the context variable is set to a JSON representation of the seconod argument to initiateInstantiation. (The Voxeo VoiceXML interpreter provides the JSON object, which can be used to serialize and deserialize JSON strings.)

The XML document returned by the data element will contain, upon successful completion, the URI of the generated grammar. The getGrammarUri function simply extracts this URI. We can now use this URI in a grammar element:

  <form id="ask">
    <field name="name">
      <prompt>Please say the name of the person you would like to reach.</prompt>
      <grammar srcexpr="grammarUrl(grammarUri)  "/>
      <filled>
       <prompt>
         The extension is
         <value expr="application.lastresult$.interpretation.extension"/>.
       </prompt>
       <goto next="#end"/>
      </filled>
      <catch event="connection.disconnect.hangup">
         <goto next="#end"/>
      </catch>
      <catch event=".">
        Sorry. I did not understand.
        <goto next="#end"/>
      </catch>
    </field>
  </form>

The final step is to release the session on NuGram Server:

  <form id="end">
    <block>
      <script>
       initiateSessionDestroy();
      </script>
      <data name="deleteSessionResponse" srcexpr="serverUrl()"
            method="post" namelist="account password operation resource"/>
      <prompt>Bye Bye!</prompt>
      <disconnect/>
    </block>
  </form>
</vxml>

This is it! Plain VoiceXML 2.1 compliant code, no web application to deploy! You are ready to test the application.

Advantages

The advantages of this approach are manifold. They are explained in more depth in our latest whitepaper, but let me summarize them:

  • No web server to deploy, which means shorter development times;
  • Dynamic grammars can be tested and debugged using the same, very sophisticated IDE used for static grammars;
  • Static grammars can seamlessly evolve to dynamic grammars without sacrificing debugging and tuning capabilities.
  • Generated grammars can be output in various formats (ABNF, GrXML, Nuance GSL). You thus have a technology that is engine-agnostic (NuGram IDE fully supports the most popular semantic interpretation tags, like SISR, Nuance OSR, and Nuance 8.5).

What do you think? Let us know! Our NuGram Beta Program is an opportunity for you to help us enhance our offering and make sure that your needs will be fulfilled.