Monthly Archives: January 2010

Voice APIs: back to basics

We definitely live in interesting times. After years of pushing hard on VoiceXML (2.0 and 2.1), the industry comes up regularly with new approaches departing significantly from the newly proposed VoiceXML 3.0. And these approaches sometimes come from companies working hard on the VoiceXML standardization effort.

For instance, last week Voxeo announced a new interface to its Tropo platform, called Tropo WebAPI. To build a communications application, one has simply to write a web service/application producing JSON documents. These documents contain simple instructions for the communications platform like: play this prompt, ask a question, transfer the call, etc. Very simple instructions, indeed. The results are then sent server-side to the application for further processing and deciding what to do next.

This approach reminds me of TwiML, Twilio‘s own markup language for implementing voice applications, and (to a certain extent) FastAGI, the Asterisk way of developing server-side voice applications (the preferred way of deploying applications on the Cloudvox platform).

What do these approaches have in common? Well, they all offer a much simpler programming model than VoiceXML. In VoiceXML, there is the form-filling algorithm which tries to fill slots in a form automatically. VoiceXML applications can also contain a fair amount of scripting (in ECMAScript) with many scoping rules for variables. It also provides some exception mechanisms (with catch and throw elements), a root document for storing data, etc. No wonder most development environments targeting VoiceXML platforms only make use of a limited subset of VoiceXML.

In fact, the new approaches are not programming models, they essentially provide low-level instructions for the various voice platforms. Much like a virtual machine. It’s up to the user of the platform to implement its own programming model on top of these instruction sets. And this is a very attractive offer, as this will most certainly ignite the development of new application programming environments and frameworks, some of which will be platform agnostic.

We lived a somewhat similar period at the end of the last century. There were many non-interoperable proprietary IVR platforms, and the industry came up with a solution: VoiceXML. Will we see something similar happen with these new approaches? I doubt it. I think that all these approaches are sufficiently similar that a good abstraction layer on the application side can suffice to support them all easily. In the 90′s, porting an application to a new platform was plainly impossible without a complete rewrite.

Strangely, the programming languages community lived something similar a few years ago. From around 1997 to the start of the century, the craze for  Java almost killed research in the field of object-oriented programming language design not targeting Java or the JVM. Then, in 2003 or so, some leading researchers decided consciously that it was time to start a post-Java era. And it’s at about that time that many programming languages started flourishing and that we saw a greater acceptance for dynamic/scripting languages (on the JVM or not). This period also coincided with the rise of the Web 2.0 and a new culture of entrepreneurship, thanks to Paul Graham Y Combinator.

I think we are living something similar today in the communications industry, though a few years later. We see young entrepreneurs and new startups with innovative ideas enter the market. By the way, a few of them presented their ideas at StartupCamp Telephony last week, an event sponsored by Twilio and PhoneTag as part of the ITExpo conference.

The years to come promise to be very exciting.

Reducing false accepts with decoys

As discussed in a previous post, one of the unfortunate consequences of out-of-grammar utterances is that they can cause many false accepts that may seriously degrade application performance and user experience. In order to illustrate this, let’s use the simple example of a small menu where callers must choose among three options: “validate”, “repeat”, and “cancel”.

We use a test set of 5042 field utterances distributed as follows:

Menu choice Number of utterances Proportion of test corpus
cancel 367 7.28%
repeat 896 17.77%
validate 3478 68.98%
OOG 301 5.97%

As we can see, this is a fairly clean test set with only about 6% of out-of-grammar utterances. As usual, these include background speech, various noises, side conversations, some common OOG utterances (“yes”, “no”, “okay”, “options”, “oh”, etc.), as well as a wide variety of rambling responses of different kinds.

Naturally, since the grammar can only recognize one of the three keywords (and legitimate variants), most of these OOG utterances are misrecognized as one of the keywords. That wouldn’t be a problem if the corresponding confidence scores were low and we could safely reject them, but that’s not always the case. In fact, many of these have a confidence score over 0.9, resulting in damaging false accepts.

An effective way to reduce false accepts is to add decoys to the grammar. For instance, you would normally want to start by adding common OOG responses, on the ground that it’s easier to reject an OOG utterance if you can recognize it correctly. You could also add more “general” decoys, for instance a phoneme loop, to help reject hard to predict OOG utterances. There are more advanced techniques that can be used in order to come up with “optimal” decoys for a given grammar, but I won’t go into them now.

In all cases, it is of course absolutely necessary to evaluate, on a large enough test corpus, the impact of these decoys since they could easily end up reducing recognition accuracy, sometimes significantly. In particular, one should be careful not to add decoys that could be confused with legitimate sentences or keywords.

Let’s illustrate the impact of decoys using the set of field utterances described above. The graph below compares the performance of a grammar without decoys (red curve) to that of the same grammar to which appropriate decoys were added (blue curve).

As can be seen, even for for a fairly clean test corpus with a low OOG rate, the addition of decoys can significantly improve performance. For instance:

  • For a False Accept rate of 0.5%, the Correct Accept rate increases from 95% to over 97.5%, which is equivalent to reducing the error rate by more than 50%.
  • For a Correct Accept rate of 97.5%, the addition of decoys decreases the False Accept rate from 1.5% to around 0.3%. That’s one fifth the False Accept rate for the same Correct Accept rate.

Another interesting observation is the impact of decoys on confidence thresholds. Let’s say we want to have a False Accept Rate of 0.5%. Then, we would need to use a confidence threshold of 0.73 for the grammar without decoys, but only  0.25 for the grammar with decoys. That’s quite a difference! This clearly shows that using “default” threshold values may sometimes produce results that are quite inadequate.

All of this once again demonstrates how important it is to pay close attention to out-of-grammar utterances in a tuning process and how decoys can provide an effective tool for containing the negative impact of such utterances on application performance.

Unit testing in the IVR world

Last summer, for a demo I gave at SpeechTEK, I wrote a prototype dialog-based application framework in Erlang. The framework features a synchronous API to write dialog applications that could be accessed via instant messaging (using either IMified or XMPP) or the phone (through a VoiceXML gateway). What do I mean by synchronous API? Well, an API giving you the illusion that your program simply has to ask a question using a procedure call, and the result of the call is a representation of the answer from the user of the application.

Too abstract a definition? Look at the following Java code (this is a rough and simplified translation of some Erlang code):

void askPin() {
    Answer answer = dialogController.ask("What is your pin?");
    if (answer instanceOf DTMFAnswer)  {
       dialogController.play("Thanks, your pin is "
                                 + ((DTMAnswer) answer).getDigits();
       dialogController.hangup();
       finish();
    }
    else if (answer instanceOf NoInputAnswer) {
       retryPin();
    }
    else if (answer instanceOf Hangup) {
       finish();
    }
}

The askPin method calls the dialogController.play method to play a TTS string to the caller, waits for an answer, and processes it by either calling the dialogController.play function or the finish function on hangup.

This is essentially what platforms like Tropo, voicephp, and a few others provide to help develop telephony applications. This approach is very interesting for a number of reasons. For instance, it lets us use the abstraction mechanisms we are most familiar with: functions, classes, etc. And we can still use our favorite authoring tool. But more importantly, we don’t have to learn a new programming model, like VoiceXML (although the framework could itself produce VoiceXML in order to be executed on a standard VoiceXML platform, which is the approach taken in my prototype).

Dialog unit testing

An interesting feature of the prototype is its immediate support for dialog unit testing, due to its model-view-controller (MVC) architecture. Unit testing is an great technique for building robust software. (Unfortunately, the idea is not that widespread in the IVR world.)

To illustrate, here is an excerpt from a unit test for the code above:

    dialogController.send(Answer.Next);
    nextInteraction = dialogController.getInteraction(dialog);
    assertPrompts(nextInteraction, new String[]{
        "What is your pin?"
    });
    assertGrammars(nextInteraction, new String[]{ "pin.abnf" });

    dialogController.send(Answer.NoInput);
    nextInteraction = dialogController.getInteraction(dialog);
    assertPrompts(nextInteraction, new String[]{
        "Please answer the question.",
        "What is your pin?"
    });
    assertGrammars(nextInteraction, new String[]{ "pin.abnf" });

    dialogController.send(new DTMFAnswer("123456");
    nextInteraction = dialogController.getInteraction(dialog);
    assertPrompts(nextInteraction, new String[]{
          "Thanks, your pin is 1 2 3 4 5 6"
    });
    assertGrammars(nextInteraction, new String[]{});

In this example, the dialogController.send method call simulates an answer from the caller, while the call to dialogController.getInteraction retrieves the next actions taken by the application. The result of the latter is then checked against the expected action.

At Nu Echo, we are compulsive about tests. So we have developed a practice around dialog unit testing that we try to apply whenever we can. Let me share some of thoughts on the subject.

The “what”

There are a number of questions that arise when we start writing unit tests. The first is obviously: what do we want to test?

In the case of a dialog unit test, we’ll want to test the observable behaviour of the application, regardless of the way the code is organized. For example, we won’t want to test that the code is organized into classes and methods, that the application goes through a state X, etc. Doing so would make the tests more fragile to code reorganization (and we, as developers, do this all the time, right?). In fact, such dialog unit tests make us more confident in the application after refactoring parts of the application.

So we usually test:

  • which prompts are played,
  • which grammars are active,
  • the interaction properties (timeouts, maximum number of n-bests, etc.),
  • the attached data (when possible).

Stubs

Another interesting question is: what do we do with back-end calls (databases, web-services, etc.)? In principle, unit tests should be replicable, and ideally independant of the runtime environment.

Here the answer is simple: we stub everything. We completely simulate the back-end. However, in some cases this can be relatively difficult to do when there are complex relationships between the various pieces of information manipulated by the application.

The value of unit testing

Given a good framework, dialog unit tests should be very easy to write to encourage their development. It can take a few more minutes to code a unit test than to call the application directly. But this cost is soon amortized as we run the test. Each run of the test will take a fraction of a second, much faster than taking the phone. This means we can run hundreds of tests in a matter of seconds.

Moreover, some tests are very hard to replicate, especially when we introduce speech recognition in the equation. If the application has several thresholds for a given question, how do we test each case systematically to make sure the application behaves as intended in the specification? Unit tests are invaluable in this case.

Again, the use of a programming language is very helpful in creating lots of unit tests in very few lines of code. Repeated parts of some tests can be abstracted away in methods/functions used many times. For example, testing that a sequence of DTMF inputs leads to the call being transfered to a given extension with some data attached to it can be encoded in a function. This function can then be called for each path in the menu tree, like this:

  test_path(["1","3","2"],           # DTMF sequence
            "4231",                  # Extension
            {"reason" => "support",  # Data
             "product" => "nugram",
             "language" => "french"})

But the real value of a good unit test suite is that once you have it, you are not afraid anymore of inadvertently introducing a bug when you implement a new request for modification. Of course, it is not a panacea. At one point, you’ll have to take the phone to test some functional aspects of the application (usability, recorded prompts content, etc). But hopefully, you’ll not stumble upon trivial bugs that should have been caught much earlier in the development process by your unit tests.

So let me ask you: how do you test your application? What techniques do you employ?