A proven yet simple grammar conversion process

Grammar Conversion Process

As old speech recognition engines are being replaced by newer ones, we see more and more organizations having to convert their old grammars to standard formats. Given the right process and set of tools, converting grammars from one engine to another should be a straightforward task with mostly no risk of breaking the associated IVR application.

The issues

There are several issues associated with the conversion of grammars:

  • Syntax. First, there is the syntax of the grammar itself. If we are converting a grammar bewteen two engines that support GrXML or ABNF, then there’s not much else to say. But if we are converting from Nuance GSL to GrXML or ABNF, that’s a different story. GSL has very different operator precedences than ABNF, for instance. We have to be careful.
  • Semantics. The second issue is the language used inside the semantic tags. Again, if both engines support SISR, we have nothing to do. But if we convert from GSL to ABNF+SISR, we may have a harder time. For example, SISR does not support the concept of a top-level slot that can be assigned from anywhere in the grammar (using the <slot value> syntax).
  • Pronunciation lexicons. Almost all speech engines use a different format for lexicons. Not to mention that even different versions of the same engine sometimes support different phonetic alphabets.

A proven process

If you follow a rigorous process, the first two issues above can be easily mitigated. Here is one that has proven very effective:

  1. A coverage test set is produced from the original grammar. The test set should ideally ensure that all semantic tags are executed at least once (this is not always sufficient if the semantic tags contain conditional code, but that’s a good starting point.)
  2. The grammar is converted to the new format.
  3. The converted grammar is tested against the coverage test set of the original grammar (and problems are fixed, if any, until all tests pass).

Some tools

Some ASR engines already provide tools to convert grammars from old proprietary formats to the new standard ones. For instance, Nuance ships a tool to automatically convert GSL grammars to GrXML + SISR. It does not support all features of GSL as some of them have no equivalence in GrXML and SISR. And one of the problems with this converter is that the semantic tags produced are not easily maintainable.

NuGram IDE also provides some tools to help with the above process. In particular, it offers:

  • Great support for creating and running coverage tests.
  • A sophisticated sentence generation tool. The tags coverage strategy, for instance, is very effective when converting grammars as it helps generating sentences that will cover all semantic tags.
  • Support for all major semantic tags formats (GSL, Nuance OSR extensions, IBM and Microsoft, etc.).

Of course, to use NuGram effectively, your grammars will need to be converted to ABNF first. No problem! NuGram provides GSL and GrXML to ABNF converters to help you, as well as converters from ABNF to GSL or GrXML. That means all you have to worry about is really the conversion of the semantic tags. In this case, the whole process now becomes:

  1. Grammars are first imported in ABNF.
  2. A coverage test set is produced from the original grammar.
  3. Semantic tags are converted.
  4. The converted grammars are checked for errors by running the coverage tests of the original grammars. In case of errors, they are fixed and all tests are re-run.
  5. Convert the grammars to the desired target format.

What about pronunciation lexicons?

Unfortunately, converting phonetic dictionaries is still a manual and error-prone process, for which there are no good solutions as of this writing. And this task is more part of the tuning process that follows the grammar conversion process anyway. In most cases, a grammar’s pronunciation lexicon is used to fix incorrect or missing pronunciations in the ASR engine’s own dictionary for very specific words. The phonetic dictionary of the target ASR engine may not have the same limitations or deficiencies. At best, the original grammar’s pronunciation lexicon can act as an inspiration for the creation of the new pronunciation lexicon.

Simple HTML-based IM widget with Phono

The Voxeo team unveiled Phono just a few weeks ago. Phono is an open-source SDK to embed telephones and instant messaging services in your web pages. With Phono, you can (among other things) add click-to-call buttons to your call center or your applications hosted on one of Voxeo’s platforms (and you will eventually be able to host your own server-side infrastructure soon).

To get a feel of how easy it is to embed these services on a web page, I decided to create an IM-like interface and a click-to-call button to a simple prototype application I developed some time ago. The voice application is VoiceXML-based and runs on Voxeo’s Evolution platform, while the IM application runs on IMified and uses grammarserver.com to interpret textual sentences. (Both applications are run by the same code, except for the presentation part. But what the applications does is not relevant to what follows.)

In a matter of minutes, I was able to come up with an IM widget that looks like this:

(Please, no comments on the UI. If I was trying to be paid for my artistic skills, I’d be bankrupt by now…)

The code

The HTML code for this page is quite straightforward:

<html>
  <head>
    <title>Phono Demo</title>
    <link type="text/css" rel="stylesheet" href="im.css" />
    <script src="http://code.jquery.com/jquery-1.4.2.min.js"></script>
    <script src="http://s.phono.com/releases/0.1/jquery.phono.js"></script>
    <script src="im.js"></script>
  </head>
  <body>
    <div id="im_box">
      <h3 align=center>SiteWatch Application</h3>
      <div id="dialog" style="">
      </div>
      <input type="text" id="message" onkeydown="processKey(event)" disabled="true">
    </div>
    <button id="click2call" disabled="true" onclick="callApp()">Call Me!</button>
  </body>
</html>

The real meat is in im.js, which I will now describe.

The file starts with the following code:

var PHONO_OPTIONS = {
    apiKey: MY_API_KEY,
    onReady: function() {
        $("#message")[0].disabled = false;
        $("#click2call")[0].disabled = false;
        $("#message").focus();
    },
    messaging: {
        onMessage: function(event) {
           addInteraction('SiteWatch', event.message.body, 'server');
        }
    }
};

var phono = $.phono(PHONO_OPTIONS);

This code takes care of initializing the phono object. Once it is properly initialized and ready, the input area and the click-to-call button are both enabled. If something goes wrong during the initialization, they will remain disabled.

The next section handles the sending of messages to the application:

function processKey(event) {
  if (event.keyCode == 13) {
      sendMessage();
      return false;
  } else {
      return true;
  }
}

function sendMessage() {
    var msg = $('#message').val();
    if (msg == '') return;

    $('#message').val("");
    phono.messaging.send(MY_IM_ADDRESS, msg);
    addInteraction('Me', msg, 'client');
}

The processKey function is called when a key is pressed in the text input area. If the return key is pressed, the message is sent to the remote IM address (here MY_IM_ADDRESS).

The function that displays both the outgoing and the incoming messages is :

function addInteraction(who, msg, cls) {
    var dialogDiv = $('#dialog');
    dialogDiv.append($("<div class='msg_" + cls + "'><b>"
                       + who + ":</b> " + msg + "</div>"));
    dialogDiv[0].scrollTop = dialogDiv[0].scrollHeight;
}

Finally, the code that takes care of the click-to-call button is here:

var call = null;
function callApp() {
    if (call == null) {
	call = phono.phone.dial("app:MY_VXML_APP", {
		onAnswer: function() {
		    $("#click2call").text("Hangup");
		},
		onHangup: function() {
		    call = null;
		    $("#click2call").text("Call me!");
		}
	    });
    } else {
	call.hangup();
    }
}

Can that be simpler?

Ok, I must admit my first version was not as clean as this one. But it was functionally equivalent. And it was only 50 lines long (HTML and JavaScript included).

Conclusion

The Phono API is simple, yet powerful. The fact that it does not provide a standard UI is a very good thing too. This will let developers innovate and use Phono in very creative ways.

But there’s a single feature that IMHO will drive a lot of innovation: the session ID. Once the Phono object is properly initialized, it has a unique ID that is used as the caller ID when a phone call is made from the web page. And the ID is accessible from the phono object. Which means that the ID can be sent to the application server so both the HTML and the voice applications can share some data and communicate with each other. I’m pretty sure this will open the door to some really cool multi-channel AND multi-modal applications.

Update: Here is the CSS file that I used to create the screen shot above:

#im_box {
    width: 300px;
    border: 1px solid black;
    padding: 0px;
    background-color: #ddd;
}

#dialog {
    height: 300;
    overflow: auto;
    font-size: 10pt;
    font-family: Arial;
    padding: 5px;
}

#message {
    width: 300px;
    border: 1px solid;
    margin: 1px;
}

h3 {
    border-bottom: 1px solid;
    margin-top: 0px;
    background-color: #eee;
}

.msg_client {
    margin: 5px;
    border: 1px solid;
    padding: 3px;
    background-color: white;
}

.msg_client b {
    color: blue;
}

.msg_server {
    margin: 5px;
    margin-left: 15px;
    border: 1px solid;
    padding: 3px;
    background-color: white;
}

.msg_server b {
    color: red;
}

Update 2 (2010/11/01) : The whole source code is available on github.

Putting a VoiceXML gateway behind Asterisk

I’m a big fan of both Asterisk and VoiceXML. Each has its own sweet spot. Asterisk is great for building complete telephony systems (dial plans, conference calls, queues, voicemail, etc.), while VoiceXML is the standard way to develop full-blown telephony applications for large organisations.

But what if you want to bridge the two? There are situations where that would make sense. Consider a company using Asterisk as their front PBX. Now if they want to add a speech-enabled auto-attendant or some other self-service application, they could use a VoiceXML platform to run it instead of coding it in the Asterisk dialplan language. Of course, one could do the same using the Asterisk Gateway Interface (AGI) protocol, but he would be limited to the capabilities of the Asterisk dialplan language. (For instance, the generic speech recognizer API only returns the matched text of each NBest, not the semantic interpretation. This can be ok for some trivial applications, but that’s clearly inadequate for serious speech application development.)

The other day, I decided to test this idea and try using a VoiceXML gateway (Voxeo Prophecy in this case) from behind Asterisk. Here is how I made things work.

Machine setup

My setup consists of a laptop running Ubuntu 9.04 with Asterisk 1.4.21. Since Prophecy is only supported on CentOS and RedHat Enterprise Edition, I decided to run Prophecy on CentOS 5.5 inside a VMware virtual machine. The guest machine is configured to use a dedicated network between the guest and the host (the Host-only network configuration):

VMware guest network configuration

VMware guest network configuration

Asterisk configuration

On the Ubuntu (host) machine, in /etc/asterisk/sip.conf, I added the following entry:

[prophecy]
type=friend
username=prophecy
host=dynamic
canreinvite=yes
insecure=port,invite
qualify=yes
context=proph
auth=prophecy:none@asterisk

In /etc/asterisk/extensions.conf, I created a context proph with a dialplan that redirects all incoming calls to Prophecy:

[proph]
exten => _[A-Za-z].,1,Dial(SIP/prophecy/${EXTEN})
exten => _[A-Za-z].,n,Hangup

Configuring Prophecy

On the guest CentOS machine, in /opt/voxeo/prophecy/config/config.xml, I added the following lines in the VoIPCT category:

<category name="VoIPCT">
 ...
  <category name="Registrations">
    <category name="asterisk">
      <item name="Username">prophecy</item>
      <item name="AuthUsername">prophecy</item>
      <item name="Password">none</item>
      <item name="Domain">192.168.151.1</item>
      <item name="ContactIP">192.168.151.128:5060</item>
      <item name="ExpirationTimeout" type="int">3600</item>
      <item name="Registrar">192.168.151.1</item>
      <item name="ResolveRegistrar" type="int">0</item>
    </category>
  </category>
 ...
</category>

Here, the IP address 192.168.151.128 is the address assigned automatically by VMware to the guest, while 192.168.151.1 is the address of the host.

To call an application, I use SFLphone, an open-source softphone. One particularly appealing feature of this phone is its support for both the SIP and the IAX protocols. It is thus well suited for use with Asterisk.

VoilĂ ! I am now able to make calls to VoiceXML applications from the comfort of my Ubuntu machine using only free/open-source solutions.

Get two NuGram IDE Pro licenses free when you purchase a grammar development course

Learn how to systematically deliver high-quality, high performance grammars by fully leveraging the features and tools available in NuGram IDE. Supported by hands-on exercises and numerous examples, Effective Grammar Development with NuGram IDE provides a breadth of knowledge, best practices, and tips and tricks that have shown their effectiveness at addressing the main challenges of grammar development and at delivering better grammars faster.

And if you order our on-site grammar development course before October 31st, you will get two licenses of NuGram IDE Professional Edition entirely free! There is only one catch: course must be given before December 31st, 2010. Contact us for details.

Testing an Intervoice InVision app with Voxeo Prophecy

I’ve just started working on a DTMF-only VoiceXML application for one of our customers. The application is developed using Intervoice InVsion Studio 3.1 (the native Windows version) and will be deployed on the Intervoice Voice Portal 5. The challenge in this project is three-fold:

  • Development is done in Nu Echo’s premises.
  • Nu Echo does not have IVP5 in its lab.
  • The only way to test the application is to connect to the customer’s network using VPN/pcAnywhere, deploy the application there and test using a local phone number.

Fortunately, except for all the VoiceXML code that handles attached data and transfers to the PBX, everything else can be easily tested on my own machine using only freely available tools.

The VoiceXML platform

InVision Studio is a tool that provides a graphical editor that maps an IVR call-flow to completely static, standards-compliant VoiceXML code (at least it’s the cased for the application I have to develop). Once the application successfully passes the validation tests, it can be exported to VoiceXML code that can then be deployed on any web server.

InVision Studio

InVision Studio

Since the resulting code does not depend on any proprietary extension, I decided to use Voxeo Prophecy to test it. It comes with a really decent ASR engine as well as a good TTS engine, both only for US English. The application is DTMF-only, so the ASR is not needed in my case, but TTS is handy when you don’t want to record all the application prompts (with InVision Studio, you have to specify a text to all the prompts you define).

After installing Prophecy, I had to use Prophecy Commander, the web-based management console, to configure the application and the route to reach the application. The route is used to associate a number to call with the application. In my case, the app is CustomerApp and the route is test-customer-app:

Routing rules in Prophecy Commander

Routing rules in Prophecy Commander

To call the application, I simply use the SIP phone that comes with Prophecy and dial test-customer-app.

Prophecy SIP phone

Prophecy SIP phone

The Web server

For the web server, I use Yaws. It’s a web server written in Erlang. But it could have been Apache, or Tomcat, Jetty, IIS, or any other web server. I chose Yaws mainly because I do some Erlang programming on my spare time and happen to know Yaws a bit more than the alternatives.

I configured Yaws to server static files on port 8080 from the Runtime directory of my InVision project. So whenever I export the VoiceXML code for the project, I just take the SIP phone and make a call to test the application. The Yaws configuration for the virtual server is:

<server localhost>
        port = 8080
        listen = 0.0.0.0
        docroot = "C:/InvisionProjects/CustomerApp/Runtime"
</server>

Extensive logging

First off, let me say that when it comes to debugging an app, the Prophecy logviewer is of tremendous help. I was first a bit overwhelmed by the vast quantity of information logged by the various parts that compose Prophecy, but the filtering capabilities make it easy to focus on only a fraction of it. (I have seen the logs of many VoiceXML platforms, and these ones are certainly among the most comprehensible.)

I’m writing this because I had to use the logviewer at the minute I started testing the application interactively. Why don’t I just listen to the prompts? Well, the problem is that the prompt texts are in French, while the TTS is in English. That’s plainly and simply incomprehensible and trying to figure out where I am in the application is really painful and annoying. So I decided to add VoiceXML log elements extensively in the application, all starting with a very specific pattern: [CustomerApp].

Logging elements in application

Logging elements in application

It is then very easy to filter the logs based on this pattern and see only the progress of the application:

Prophecy Logviewer

Prophecy Logviewer

A final remark

Yes, I could use the debugger that comes with InVision Studio. But frankly, I do not find it very intuitive to use. I prefer making calls and test the user experience at once.