January 25th, 2010 5 Comments

by Dominique Boucher

Voice APIs: back to basics

We definitely live in interesting times. After years of pushing hard on VoiceXML (2.0 and 2.1), the industry comes up regularly with new approaches departing significantly from the newly proposed VoiceXML 3.0. And these approaches sometimes come from companies working hard on the VoiceXML standardization effort.

For instance, last week Voxeo announced a new interface to its Tropo platform, called Tropo WebAPI. To build a communications application, one has simply to write a web service/application producing JSON documents. These documents contain simple instructions for the communications platform like: play this prompt, ask a question, transfer the call, etc. Very simple instructions, indeed. The results are then sent server-side to the application for further processing and deciding what to do next.

This approach reminds me of TwiML, Twilio’s own markup language for implementing voice applications, and (to a certain extent) FastAGI, the Asterisk way of developing server-side voice applications (the preferred way of deploying applications on the Cloudvox platform).

What do these approaches have in common? Well, they all offer a much simpler programming model than VoiceXML. In VoiceXML, there is the form-filling algorithm which tries to fill slots in a form automatically. VoiceXML applications can also contain a fair amount of scripting (in ECMAScript) with many scoping rules for variables. It also provides some exception mechanisms (with catch and throw elements), a root document for storing data, etc. No wonder most development environments targeting VoiceXML platforms only make use of a limited subset of VoiceXML.

In fact, the new approaches are not programming models, they essentially provide low-level instructions for the various voice platforms. Much like a virtual machine. It’s up to the user of the platform to implement its own programming model on top of these instruction sets. And this is a very attractive offer, as this will most certainly ignite the development of new application programming environments and frameworks, some of which will be platform agnostic.

We lived a somewhat similar period at the end of the last century. There were many non-interoperable proprietary IVR platforms, and the industry came up with a solution: VoiceXML. Will we see something similar happen with these new approaches? I doubt it. I think that all these approaches are sufficiently similar that a good abstraction layer on the application side can suffice to support them all easily. In the 90’s, porting an application to a new platform was plainly impossible without a complete rewrite.

Strangely, the programming languages community lived something similar a few years ago. From around 1997 to the start of the century, the craze for  Java almost killed research in the field of object-oriented programming language design not targeting Java or the JVM. Then, in 2003 or so, some leading researchers decided consciously that it was time to start a post-Java era. And it’s at about that time that many programming languages started flourishing and that we saw a greater acceptance for dynamic/scripting languages (on the JVM or not). This period also coincided with the rise of the Web 2.0 and a new culture of entrepreneurship, thanks to Paul Graham Y Combinator.

I think we are living something similar today in the communications industry, though a few years later. We see young entrepreneurs and new startups with innovative ideas enter the market. By the way, a few of them presented their ideas at StartupCamp Telephony last week, an event sponsored by Twilio and PhoneTag as part of the ITExpo conference.

The years to come promise to be very exciting.

January 12th, 2010 No Comments

by Yves Normandin

Reducing false accepts with decoys

As discussed in a previous post, one of the unfortunate consequences of out-of-grammar utterances is that they can cause many false accepts that may seriously degrade application performance and user experience. In order to illustrate this, let’s use the simple example of a small menu where callers must choose among three options: “validate”, “repeat”, and “cancel”.

We use a test set of 5042 field utterances distributed as follows:

Menu choice Number of utterances Proportion of test corpus
cancel 367 7.28%
repeat 896 17.77%
validate 3478 68.98%
OOG 301 5.97%

As we can see, this is a fairly clean test set with only about 6% of out-of-grammar utterances. As usual, these include background speech, various noises, side conversations, some common OOG utterances (”yes”, “no”, “okay”, “options”, “oh”, etc.), as well as a wide variety of rambling responses of different kinds.

Naturally, since the grammar can only recognize one of the three keywords (and legitimate variants), most of these OOG utterances are misrecognized as one of the keywords. That wouldn’t be a problem if the corresponding confidence scores were low and we could safely reject them, but that’s not always the case. In fact, many of these have a confidence score over 0.9, resulting in damaging false accepts.

An effective way to reduce false accepts is to add decoys to the grammar. For instance, you would normally want to start by adding common OOG responses, on the ground that it’s easier to reject an OOG utterance if you can recognize it correctly. You could also add more “general” decoys, for instance a phoneme loop, to help reject hard to predict OOG utterances. There are more advanced techniques that can be used in order to come up with “optimal” decoys for a given grammar, but I won’t go into them now.

In all cases, it is of course absolutely necessary to evaluate, on a large enough test corpus, the impact of these decoys since they could easily end up reducing recognition accuracy, sometimes significantly. In particular, one should be careful not to add decoys that could be confused with legitimate sentences or keywords.

Let’s illustrate the impact of decoys using the set of field utterances described above. The graph below compares the performance of a grammar without decoys (red curve) to that of the same grammar to which appropriate decoys were added (blue curve).

As can be seen, even for for a fairly clean test corpus with a low OOG rate, the addition of decoys can significantly improve performance. For instance:

  • For a False Accept rate of 0.5%, the Correct Accept rate increases from 95% to over 97.5%, which is equivalent to reducing the error rate by more than 50%.
  • For a Correct Accept rate of 97.5%, the addition of decoys decreases the False Accept rate from 1.5% to around 0.3%. That’s one fifth the False Accept rate for the same Correct Accept rate.

Another interesting observation is the impact of decoys on confidence thresholds. Let’s say we want to have a False Accept Rate of 0.5%. Then, we would need to use a confidence threshold of 0.73 for the grammar without decoys, but only  0.25 for the grammar with decoys. That’s quite a difference! This clearly shows that using “default” threshold values may sometimes produce results that are quite inadequate.

All of this once again demonstrates how important it is to pay close attention to out-of-grammar utterances in a tuning process and how decoys can provide an effective tool for containing the negative impact of such utterances on application performance.

January 5th, 2010 2 Comments

by Dominique Boucher

Unit testing in the IVR world

Last summer, for a demo I gave at SpeechTEK, I wrote a prototype dialog-based application framework in Erlang. The framework features a synchronous API to write dialog applications that could be accessed via instant messaging (using either IMified or XMPP) or the phone (through a VoiceXML gateway). What do I mean by synchronous API? Well, an API giving you the illusion that your program simply has to ask a question using a procedure call, and the result of the call is a representation of the answer from the user of the application.

Too abstract a definition? Look at the following Java code (this is a rough and simplified translation of some Erlang code):

void askPin() {
    Answer answer = dialogController.ask("What is your pin?");
    if (answer instanceOf DTMFAnswer)  {
       dialogController.play("Thanks, your pin is "
                                 + ((DTMAnswer) answer).getDigits();
       dialogController.hangup();
       finish();
    }
    else if (answer instanceOf NoInputAnswer) {
       retryPin();
    }
    else if (answer instanceOf Hangup) {
       finish();
    }
}

The askPin method calls the dialogController.play method to play a TTS string to the caller, waits for an answer, and processes it by either calling the dialogController.play function or the finish function on hangup.

This is essentially what platforms like Tropo, voicephp, and a few others provide to help develop telephony applications. This approach is very interesting for a number of reasons. For instance, it lets us use the abstraction mechanisms we are most familiar with: functions, classes, etc. And we can still use our favorite authoring tool. But more importantly, we don’t have to learn a new programming model, like VoiceXML (although the framework could itself produce VoiceXML in order to be executed on a standard VoiceXML platform, which is the approach taken in my prototype).

Dialog unit testing

An interesting feature of the prototype is its immediate support for dialog unit testing, due to its model-view-controller (MVC) architecture. Unit testing is an great technique for building robust software. (Unfortunately, the idea is not that widespread in the IVR world.)

To illustrate, here is an excerpt from a unit test for the code above:

    dialogController.send(Answer.Next);
    nextInteraction = dialogController.getInteraction(dialog);
    assertPrompts(nextInteraction, new String[]{
        "What is your pin?"
    });
    assertGrammars(nextInteraction, new String[]{ "pin.abnf" });

    dialogController.send(Answer.NoInput);
    nextInteraction = dialogController.getInteraction(dialog);
    assertPrompts(nextInteraction, new String[]{
        "Please answer the question.",
        "What is your pin?"
    });
    assertGrammars(nextInteraction, new String[]{ "pin.abnf" });

    dialogController.send(new DTMFAnswer("123456");
    nextInteraction = dialogController.getInteraction(dialog);
    assertPrompts(nextInteraction, new String[]{
          "Thanks, your pin is 1 2 3 4 5 6"
    });
    assertGrammars(nextInteraction, new String[]{});

In this example, the dialogController.send method call simulates an answer from the caller, while the call to dialogController.getInteraction retrieves the next actions taken by the application. The result of the latter is then checked against the expected action.

At Nu Echo, we are compulsive about tests. So we have developed a practice around dialog unit testing that we try to apply whenever we can. Let me share some of thoughts on the subject.

The “what”

There are a number of questions that arise when we start writing unit tests. The first is obviously: what do we want to test?

In the case of a dialog unit test, we’ll want to test the observable behaviour of the application, regardless of the way the code is organized. For example, we won’t want to test that the code is organized into classes and methods, that the application goes through a state X, etc. Doing so would make the tests more fragile to code reorganization (and we, as developers, do this all the time, right?). In fact, such dialog unit tests make us more confident in the application after refactoring parts of the application.

So we usually test:

  • which prompts are played,
  • which grammars are active,
  • the interaction properties (timeouts, maximum number of n-bests, etc.),
  • the attached data (when possible).

Stubs

Another interesting question is: what do we do with back-end calls (databases, web-services, etc.)? In principle, unit tests should be replicable, and ideally independant of the runtime environment.

Here the answer is simple: we stub everything. We completely simulate the back-end. However, in some cases this can be relatively difficult to do when there are complex relationships between the various pieces of information manipulated by the application.

The value of unit testing

Given a good framework, dialog unit tests should be very easy to write to encourage their development. It can take a few more minutes to code a unit test than to call the application directly. But this cost is soon amortized as we run the test. Each run of the test will take a fraction of a second, much faster than taking the phone. This means we can run hundreds of tests in a matter of seconds.

Moreover, some tests are very hard to replicate, especially when we introduce speech recognition in the equation. If the application has several thresholds for a given question, how do we test each case systematically to make sure the application behaves as intended in the specification? Unit tests are invaluable in this case.

Again, the use of a programming language is very helpful in creating lots of unit tests in very few lines of code. Repeated parts of some tests can be abstracted away in methods/functions used many times. For example, testing that a sequence of DTMF inputs leads to the call being transfered to a given extension with some data attached to it can be encoded in a function. This function can then be called for each path in the menu tree, like this:

  test_path(["1","3","2"],           # DTMF sequence
            "4231",                  # Extension
            {"reason" => "support",  # Data
             "product" => "nugram",
             "language" => "french"})

But the real value of a good unit test suite is that once you have it, you are not afraid anymore of inadvertently introducing a bug when you implement a new request for modification. Of course, it is not a panacea. At one point, you’ll have to take the phone to test some functional aspects of the application (usability, recorded prompts content, etc). But hopefully, you’ll not stumble upon trivial bugs that should have been caught much earlier in the development process by your unit tests.

So let me ask you: how do you test your application? What techniques do you employ?

Ever wanted to get access to an easy to use and affordable platform for your automated IVR tests?

Today, the Nu Echo team is delighted to announce the general availability of the NuBot Platform to the developer community, following a successful beta period that ended on November 30th.

With the NuBot Platform, you can:

  • Get a free copy of the NuBot Integrated Testing Environment (ITE), a powerful Eclipse-based environment for developing test scripts of any complexity, managing tests, and performing extensive analysis of test results.
  • Run tests, small or large, using the NuBot Hosted Service, and only have to pay for your actual use of the service.

While hosted IVR testing services have been available for some time, NuBot is unique in giving away all the tools required in order to be in complete control of the entire testing process. This provides you with the best of both worlds: Complete autonomy and access to an on-demand platform to execute your tests.

To quote Andreas Volmer, Presales Manager EMEA at Voxeo VoiceObjects:

“I found NuBot easy to master and a very powerful addition to my automated testing portfolio. I can only recommend to get your hands on it and try it; it’s about time that we take automated testing more seriously in the IVR application business.”

Sounds interesting? Make sure to visit the product page at http://www.nuecho.com/nubot.

December 9th, 2009 No Comments

by Dominique Boucher

Grammar problem #1 - repeated tokens

It is quite easy to write a speech recognition grammar. After all, it’s only a text file. And with the help of a good editor, we can expect the grammar to be free of syntax errors, i.e. to conform to the SRGS specification (ABNF or XML).

The real challenge is in making sure that the grammar does not contain any “error” from the point-of-view of the set of sentences it accepts, and the semantic values it associates to these sentences. We also have to make sure that it does not over-generate (accept sentences that are not part of the usual spoken language or cannot be uttered in the context of the question asked).

This post is the first in a series that will show the most common types of errors made when developing grammars and how they can be found and fixed using the advanced features of NuGram IDE. The examples I’ll use are all variants of grammars we’ve seen in the course of our grammar developments projects, either internally or for our customers.

Problem 1: Repeated Tokens

We’ll start this series with a very simple one. Suppose I’m editing a long list of ordered tokens. It is very tempting to copy one of the items and paste it as many times as needed and edit the copies. I do this all the time. It’s such a common pattern in text editing (and programming, unfortunately…) Of course, it is very easy to forget editing one of the copies.

For example, let’s say I want to write a number grammar. I’ll start writing something like:

public $r1To9 =
  one {out.number=1} |

and then copy/paste the first item 8 times, replace “one” by the digits “two” to “nine” and do the same for their corresponding semantic value. I’ll get something like:

public $r1To9 =
  one {out.number=1} |
  two {out.number=2} |
  three {out.number=3} |
  four {out.number=4} |
  five {out.number=5} |
  five {out.number=6} |
  seven {out.number=7} |
  eight {out.number=8} |
  nine {out.number=9}
;

Of course, you’ve already seen the error (probably much faster than I have). It’s easy here since you have the offending fragment right before our eyes. But when developing a grammar with many rules, it may not be that obvious. And even carefully reviewing the grammar may not suffice. (How many times do we miss typos in our own texts that another reviewer finds in a matter of seconds?)

So how do we find the problem? In this case, I simply need to build a coverage test set with the sentence generator using the Tags Coverage strategy. The following video shows how to do that:

Of course, in this example, I knew how to proceed to find the problem. In practice, and this will be a recurring idea in the series, a grammar needs to be tested in many different ways. There are many techniques and tools that need to be applied. The Tags Coverage (or the All Paths) generation strategy is often the first we use. It has the advantage of exercising all the semantic actions and finding lots of potential problems very early on in the debugging process.

In my next post in this series, I’ll write about ambiguities, how they affect speech recognition performance, and how to detect and deal with them.