January 12th, 2010 No Comments

by Yves Normandin

Reducing false accepts with decoys

As discussed in a previous post, one of the unfortunate consequences of out-of-grammar utterances is that they can cause many false accepts that may seriously degrade application performance and user experience. In order to illustrate this, let’s use the simple example of a small menu where callers must choose among three options: “validate”, “repeat”, and “cancel”.

We use a test set of 5042 field utterances distributed as follows:

Menu choice Number of utterances Proportion of test corpus
cancel 367 7.28%
repeat 896 17.77%
validate 3478 68.98%
OOG 301 5.97%

As we can see, this is a fairly clean test set with only about 6% of out-of-grammar utterances. As usual, these include background speech, various noises, side conversations, some common OOG utterances (”yes”, “no”, “okay”, “options”, “oh”, etc.), as well as a wide variety of rambling responses of different kinds.

Naturally, since the grammar can only recognize one of the three keywords (and legitimate variants), most of these OOG utterances are misrecognized as one of the keywords. That wouldn’t be a problem if the corresponding confidence scores were low and we could safely reject them, but that’s not always the case. In fact, many of these have a confidence score over 0.9, resulting in damaging false accepts.

An effective way to reduce false accepts is to add decoys to the grammar. For instance, you would normally want to start by adding common OOG responses, on the ground that it’s easier to reject an OOG utterance if you can recognize it correctly. You could also add more “general” decoys, for instance a phoneme loop, to help reject hard to predict OOG utterances. There are more advanced techniques that can be used in order to come up with “optimal” decoys for a given grammar, but I won’t go into them now.

In all cases, it is of course absolutely necessary to evaluate, on a large enough test corpus, the impact of these decoys since they could easily end up reducing recognition accuracy, sometimes significantly. In particular, one should be careful not to add decoys that could be confused with legitimate sentences or keywords.

Let’s illustrate the impact of decoys using the set of field utterances described above. The graph below compares the performance of a grammar without decoys (red curve) to that of the same grammar to which appropriate decoys were added (blue curve).

As can be seen, even for for a fairly clean test corpus with a low OOG rate, the addition of decoys can significantly improve performance. For instance:

  • For a False Accept rate of 0.5%, the Correct Accept rate increases from 95% to over 97.5%, which is equivalent to reducing the error rate by more than 50%.
  • For a Correct Accept rate of 97.5%, the addition of decoys decreases the False Accept rate from 1.5% to around 0.3%. That’s one fifth the False Accept rate for the same Correct Accept rate.

Another interesting observation is the impact of decoys on confidence thresholds. Let’s say we want to have a False Accept Rate of 0.5%. Then, we would need to use a confidence threshold of 0.73 for the grammar without decoys, but only  0.25 for the grammar with decoys. That’s quite a difference! This clearly shows that using “default” threshold values may sometimes produce results that are quite inadequate.

All of this once again demonstrates how important it is to pay close attention to out-of-grammar utterances in a tuning process and how decoys can provide an effective tool for containing the negative impact of such utterances on application performance.

January 5th, 2010 2 Comments

by Dominique Boucher

Unit testing in the IVR world

Last summer, for a demo I gave at SpeechTEK, I wrote a prototype dialog-based application framework in Erlang. The framework features a synchronous API to write dialog applications that could be accessed via instant messaging (using either IMified or XMPP) or the phone (through a VoiceXML gateway). What do I mean by synchronous API? Well, an API giving you the illusion that your program simply has to ask a question using a procedure call, and the result of the call is a representation of the answer from the user of the application.

Too abstract a definition? Look at the following Java code (this is a rough and simplified translation of some Erlang code):

void askPin() {
    Answer answer = dialogController.ask("What is your pin?");
    if (answer instanceOf DTMFAnswer)  {
       dialogController.play("Thanks, your pin is "
                                 + ((DTMAnswer) answer).getDigits();
       dialogController.hangup();
       finish();
    }
    else if (answer instanceOf NoInputAnswer) {
       retryPin();
    }
    else if (answer instanceOf Hangup) {
       finish();
    }
}

The askPin method calls the dialogController.play method to play a TTS string to the caller, waits for an answer, and processes it by either calling the dialogController.play function or the finish function on hangup.

This is essentially what platforms like Tropo, voicephp, and a few others provide to help develop telephony applications. This approach is very interesting for a number of reasons. For instance, it lets us use the abstraction mechanisms we are most familiar with: functions, classes, etc. And we can still use our favorite authoring tool. But more importantly, we don’t have to learn a new programming model, like VoiceXML (although the framework could itself produce VoiceXML in order to be executed on a standard VoiceXML platform, which is the approach taken in my prototype).

Dialog unit testing

An interesting feature of the prototype is its immediate support for dialog unit testing, due to its model-view-controller (MVC) architecture. Unit testing is an great technique for building robust software. (Unfortunately, the idea is not that widespread in the IVR world.)

To illustrate, here is an excerpt from a unit test for the code above:

    dialogController.send(Answer.Next);
    nextInteraction = dialogController.getInteraction(dialog);
    assertPrompts(nextInteraction, new String[]{
        "What is your pin?"
    });
    assertGrammars(nextInteraction, new String[]{ "pin.abnf" });

    dialogController.send(Answer.NoInput);
    nextInteraction = dialogController.getInteraction(dialog);
    assertPrompts(nextInteraction, new String[]{
        "Please answer the question.",
        "What is your pin?"
    });
    assertGrammars(nextInteraction, new String[]{ "pin.abnf" });

    dialogController.send(new DTMFAnswer("123456");
    nextInteraction = dialogController.getInteraction(dialog);
    assertPrompts(nextInteraction, new String[]{
          "Thanks, your pin is 1 2 3 4 5 6"
    });
    assertGrammars(nextInteraction, new String[]{});

In this example, the dialogController.send method call simulates an answer from the caller, while the call to dialogController.getInteraction retrieves the next actions taken by the application. The result of the latter is then checked against the expected action.

At Nu Echo, we are compulsive about tests. So we have developed a practice around dialog unit testing that we try to apply whenever we can. Let me share some of thoughts on the subject.

The “what”

There are a number of questions that arise when we start writing unit tests. The first is obviously: what do we want to test?

In the case of a dialog unit test, we’ll want to test the observable behaviour of the application, regardless of the way the code is organized. For example, we won’t want to test that the code is organized into classes and methods, that the application goes through a state X, etc. Doing so would make the tests more fragile to code reorganization (and we, as developers, do this all the time, right?). In fact, such dialog unit tests make us more confident in the application after refactoring parts of the application.

So we usually test:

  • which prompts are played,
  • which grammars are active,
  • the interaction properties (timeouts, maximum number of n-bests, etc.),
  • the attached data (when possible).

Stubs

Another interesting question is: what do we do with back-end calls (databases, web-services, etc.)? In principle, unit tests should be replicable, and ideally independant of the runtime environment.

Here the answer is simple: we stub everything. We completely simulate the back-end. However, in some cases this can be relatively difficult to do when there are complex relationships between the various pieces of information manipulated by the application.

The value of unit testing

Given a good framework, dialog unit tests should be very easy to write to encourage their development. It can take a few more minutes to code a unit test than to call the application directly. But this cost is soon amortized as we run the test. Each run of the test will take a fraction of a second, much faster than taking the phone. This means we can run hundreds of tests in a matter of seconds.

Moreover, some tests are very hard to replicate, especially when we introduce speech recognition in the equation. If the application has several thresholds for a given question, how do we test each case systematically to make sure the application behaves as intended in the specification? Unit tests are invaluable in this case.

Again, the use of a programming language is very helpful in creating lots of unit tests in very few lines of code. Repeated parts of some tests can be abstracted away in methods/functions used many times. For example, testing that a sequence of DTMF inputs leads to the call being transfered to a given extension with some data attached to it can be encoded in a function. This function can then be called for each path in the menu tree, like this:

  test_path(["1","3","2"],           # DTMF sequence
            "4231",                  # Extension
            {"reason" => "support",  # Data
             "product" => "nugram",
             "language" => "french"})

But the real value of a good unit test suite is that once you have it, you are not afraid anymore of inadvertently introducing a bug when you implement a new request for modification. Of course, it is not a panacea. At one point, you’ll have to take the phone to test some functional aspects of the application (usability, recorded prompts content, etc). But hopefully, you’ll not stumble upon trivial bugs that should have been caught much earlier in the development process by your unit tests.

So let me ask you: how do you test your application? What techniques do you employ?

Ever wanted to get access to an easy to use and affordable platform for your automated IVR tests?

Today, the Nu Echo team is delighted to announce the general availability of the NuBot Platform to the developer community, following a successful beta period that ended on November 30th.

With the NuBot Platform, you can:

  • Get a free copy of the NuBot Integrated Testing Environment (ITE), a powerful Eclipse-based environment for developing test scripts of any complexity, managing tests, and performing extensive analysis of test results.
  • Run tests, small or large, using the NuBot Hosted Service, and only have to pay for your actual use of the service.

While hosted IVR testing services have been available for some time, NuBot is unique in giving away all the tools required in order to be in complete control of the entire testing process. This provides you with the best of both worlds: Complete autonomy and access to an on-demand platform to execute your tests.

To quote Andreas Volmer, Presales Manager EMEA at Voxeo VoiceObjects:

“I found NuBot easy to master and a very powerful addition to my automated testing portfolio. I can only recommend to get your hands on it and try it; it’s about time that we take automated testing more seriously in the IVR application business.”

Sounds interesting? Make sure to visit the product page at http://www.nuecho.com/nubot.

December 9th, 2009 No Comments

by Dominique Boucher

Grammar problem #1 - repeated tokens

It is quite easy to write a speech recognition grammar. After all, it’s only a text file. And with the help of a good editor, we can expect the grammar to be free of syntax errors, i.e. to conform to the SRGS specification (ABNF or XML).

The real challenge is in making sure that the grammar does not contain any “error” from the point-of-view of the set of sentences it accepts, and the semantic values it associates to these sentences. We also have to make sure that it does not over-generate (accept sentences that are not part of the usual spoken language or cannot be uttered in the context of the question asked).

This post is the first in a series that will show the most common types of errors made when developing grammars and how they can be found and fixed using the advanced features of NuGram IDE. The examples I’ll use are all variants of grammars we’ve seen in the course of our grammar developments projects, either internally or for our customers.

Problem 1: Repeated Tokens

We’ll start this series with a very simple one. Suppose I’m editing a long list of ordered tokens. It is very tempting to copy one of the items and paste it as many times as needed and edit the copies. I do this all the time. It’s such a common pattern in text editing (and programming, unfortunately…) Of course, it is very easy to forget editing one of the copies.

For example, let’s say I want to write a number grammar. I’ll start writing something like:

public $r1To9 =
  one {out.number=1} |

and then copy/paste the first item 8 times, replace “one” by the digits “two” to “nine” and do the same for their corresponding semantic value. I’ll get something like:

public $r1To9 =
  one {out.number=1} |
  two {out.number=2} |
  three {out.number=3} |
  four {out.number=4} |
  five {out.number=5} |
  five {out.number=6} |
  seven {out.number=7} |
  eight {out.number=8} |
  nine {out.number=9}
;

Of course, you’ve already seen the error (probably much faster than I have). It’s easy here since you have the offending fragment right before our eyes. But when developing a grammar with many rules, it may not be that obvious. And even carefully reviewing the grammar may not suffice. (How many times do we miss typos in our own texts that another reviewer finds in a matter of seconds?)

So how do we find the problem? In this case, I simply need to build a coverage test set with the sentence generator using the Tags Coverage strategy. The following video shows how to do that:

Of course, in this example, I knew how to proceed to find the problem. In practice, and this will be a recurring idea in the series, a grammar needs to be tested in many different ways. There are many techniques and tools that need to be applied. The Tags Coverage (or the All Paths) generation strategy is often the first we use. It has the advantage of exercising all the semantic actions and finding lots of potential problems very early on in the debugging process.

In my next post in this series, I’ll write about ambiguities, how they affect speech recognition performance, and how to detect and deal with them.

December 2nd, 2009 No Comments

by Yves Normandin

The out-of-grammar challenge

The life of speech application developers would so be much simpler if callers were kind enough to only say things that are covered by the grammars. Unfortunately, because life was never meant to be simple, we will always have to deal with people that:

  • use all kinds of creative sentence constructions
  • stutter, correct themselves, or repeat portions of their utterance
  • find it impossible to just answer the question
  • have side conversations
  • don’t listen to the prompts
  • fumble while they look for the requested information
  • express their displeasure in a colorful way
  • say something that makes no sense
  • etc., etc., etc.

Then, of course, there’s all these utterances truncated by the endpointer, all these false barge-ins caused by noises, etc.

All of this explains why so many applications that work so well in demos actually perform so poorly in the field. There’s no avoiding that we have to build applications that real people can use and, unfortunately, real people quite often don’t behave the way we would like them to. And that’s OK. It’s our job to make sure that as many callers as possible get the best possible user experience.

The out-of-grammar impact on tuning

Many of the biggest tuning challenges relate to “out-of-grammar” utterances (see previous post, for a discussion on the different meanings of “out-of-grammar”), which mostly fall into two categories:

  1. Valid utterances —These are perfectly understandable utterances that provide the information that is expected by the application but which, for one reason or another, are not covered (i.e., can’t be parsed) by the grammar.
  2. Invalid utterances —These are utterances that are unusable by the application because they have no useful meaning.

Here is a list of ways in which out-of-grammar utterances can impact tuning:

  • Inflated False Accept rate — Valid utterances that are incorrectly labeled “out-of-grammar” can significantly inflate the False Accept rate and force the use of a high threshold much higher than necessary. See below for details.
  • Computing the reference semantic interpretation — In order to evaluate key performance metrics, we need to have the correct semantic interpretation for each valid utterance in our test set (the “reference semantic interpretation”) so that we can compare with the semantic interpretation obtained from the recognition result. For those utterances whose transcription can be parsed by the grammar, that’s trivial. Unfortunately, there are usually quite a few valid utterances whose transcription produces no parse.
  • Grammar coverage optimization — Careful analysis of field utterances almost always reveals grammar coverage problems that should be addressed. Without tools to suggest improvements to the grammar, though, this can be a lot of work. Moreover, optimum coverage – which is different from maximum coverage – can only be established through iterative experimentation.
  • Avoiding false accepts — Quite often, an invalid utterance will produce a recognition result with a high confidence score, leading to a false accept and, potentially, a dialogue failure. In some cases, this can be a very significant problem.
  • Prior probability considerations — Let’s say we use a speech menu in which a certain choice is used very rarely. If we assume that all choices are equally likely to falsely match out-of-grammar utterances, then the out-of-grammar impact on the rare choice will be proportionally much greater than on the other choices. This should be taken into consideration.
  • When to propose a second choice — Let’s say a user just said no to the confirmation: “I think you said ‘Austin’. Is that correct?” Should we propose the second choice in the N-best list (Boston)? That depends on the probability that this second choice is correct, which to a large extent depends on the proportion of out-of-grammar responses.

In upcoming posts, I’ll discuss each of these issues in more detail. For the time being, I’ll focus on the first one.

The inflated false accept problem

Let me illustrate this problem using a simple speech menu where people can select between three choices: “correct address”, “wrong address”, and “repeat the address”. The grammar naturally supports many variations of these key phrases, with a number of appropriate prefixes and suffixes.

The problem is that, in practice, responses contain a fair proportion of disfluencies (stuttering, corrections, repeats, etc.). As a result, there are quite a few transcriptions for which the grammar produces no parse. In the graph below (showing Correct Accept vs. False Accept, see previous post for definitions), the blue curve shows what happens if these are left “out-of-grammar” while the red curve shows what happens when all valid utterances are classified “in-grammar” and labeled with the correct reference semantic interpretation.

As we can see, the difference is quite significant. Let’s suppose we want to set the high threshold so that we have a maximum false accept rate of 0.5%. In the first case (blue curve), we would need to use a high threshold of 0.98, resulting in a Correct Accept rate of around 50%, while in the second case (red curve), we could get a Correct Accept rate of 96%, using a confidence threshold of 0.05.

In other words, properly managing these OOG utterances can mean the difference between a lot of needless confirmations and almost no confirmation, which makes a huge difference in user experience.