January 12th, 2010 No Comments

by Yves Normandin

Reducing false accepts with decoys

As discussed in a previous post, one of the unfortunate consequences of out-of-grammar utterances is that they can cause many false accepts that may seriously degrade application performance and user experience. In order to illustrate this, let’s use the simple example of a small menu where callers must choose among three options: “validate”, “repeat”, and “cancel”.

We use a test set of 5042 field utterances distributed as follows:

Menu choice Number of utterances Proportion of test corpus
cancel 367 7.28%
repeat 896 17.77%
validate 3478 68.98%
OOG 301 5.97%

As we can see, this is a fairly clean test set with only about 6% of out-of-grammar utterances. As usual, these include background speech, various noises, side conversations, some common OOG utterances (”yes”, “no”, “okay”, “options”, “oh”, etc.), as well as a wide variety of rambling responses of different kinds.

Naturally, since the grammar can only recognize one of the three keywords (and legitimate variants), most of these OOG utterances are misrecognized as one of the keywords. That wouldn’t be a problem if the corresponding confidence scores were low and we could safely reject them, but that’s not always the case. In fact, many of these have a confidence score over 0.9, resulting in damaging false accepts.

An effective way to reduce false accepts is to add decoys to the grammar. For instance, you would normally want to start by adding common OOG responses, on the ground that it’s easier to reject an OOG utterance if you can recognize it correctly. You could also add more “general” decoys, for instance a phoneme loop, to help reject hard to predict OOG utterances. There are more advanced techniques that can be used in order to come up with “optimal” decoys for a given grammar, but I won’t go into them now.

In all cases, it is of course absolutely necessary to evaluate, on a large enough test corpus, the impact of these decoys since they could easily end up reducing recognition accuracy, sometimes significantly. In particular, one should be careful not to add decoys that could be confused with legitimate sentences or keywords.

Let’s illustrate the impact of decoys using the set of field utterances described above. The graph below compares the performance of a grammar without decoys (red curve) to that of the same grammar to which appropriate decoys were added (blue curve).

As can be seen, even for for a fairly clean test corpus with a low OOG rate, the addition of decoys can significantly improve performance. For instance:

  • For a False Accept rate of 0.5%, the Correct Accept rate increases from 95% to over 97.5%, which is equivalent to reducing the error rate by more than 50%.
  • For a Correct Accept rate of 97.5%, the addition of decoys decreases the False Accept rate from 1.5% to around 0.3%. That’s one fifth the False Accept rate for the same Correct Accept rate.

Another interesting observation is the impact of decoys on confidence thresholds. Let’s say we want to have a False Accept Rate of 0.5%. Then, we would need to use a confidence threshold of 0.73 for the grammar without decoys, but only  0.25 for the grammar with decoys. That’s quite a difference! This clearly shows that using “default” threshold values may sometimes produce results that are quite inadequate.

All of this once again demonstrates how important it is to pay close attention to out-of-grammar utterances in a tuning process and how decoys can provide an effective tool for containing the negative impact of such utterances on application performance.

December 9th, 2009 No Comments

by Dominique Boucher

Grammar problem #1 - repeated tokens

It is quite easy to write a speech recognition grammar. After all, it’s only a text file. And with the help of a good editor, we can expect the grammar to be free of syntax errors, i.e. to conform to the SRGS specification (ABNF or XML).

The real challenge is in making sure that the grammar does not contain any “error” from the point-of-view of the set of sentences it accepts, and the semantic values it associates to these sentences. We also have to make sure that it does not over-generate (accept sentences that are not part of the usual spoken language or cannot be uttered in the context of the question asked).

This post is the first in a series that will show the most common types of errors made when developing grammars and how they can be found and fixed using the advanced features of NuGram IDE. The examples I’ll use are all variants of grammars we’ve seen in the course of our grammar developments projects, either internally or for our customers.

Problem 1: Repeated Tokens

We’ll start this series with a very simple one. Suppose I’m editing a long list of ordered tokens. It is very tempting to copy one of the items and paste it as many times as needed and edit the copies. I do this all the time. It’s such a common pattern in text editing (and programming, unfortunately…) Of course, it is very easy to forget editing one of the copies.

For example, let’s say I want to write a number grammar. I’ll start writing something like:

public $r1To9 =
  one {out.number=1} |

and then copy/paste the first item 8 times, replace “one” by the digits “two” to “nine” and do the same for their corresponding semantic value. I’ll get something like:

public $r1To9 =
  one {out.number=1} |
  two {out.number=2} |
  three {out.number=3} |
  four {out.number=4} |
  five {out.number=5} |
  five {out.number=6} |
  seven {out.number=7} |
  eight {out.number=8} |
  nine {out.number=9}
;

Of course, you’ve already seen the error (probably much faster than I have). It’s easy here since you have the offending fragment right before our eyes. But when developing a grammar with many rules, it may not be that obvious. And even carefully reviewing the grammar may not suffice. (How many times do we miss typos in our own texts that another reviewer finds in a matter of seconds?)

So how do we find the problem? In this case, I simply need to build a coverage test set with the sentence generator using the Tags Coverage strategy. The following video shows how to do that:

Of course, in this example, I knew how to proceed to find the problem. In practice, and this will be a recurring idea in the series, a grammar needs to be tested in many different ways. There are many techniques and tools that need to be applied. The Tags Coverage (or the All Paths) generation strategy is often the first we use. It has the advantage of exercising all the semantic actions and finding lots of potential problems very early on in the debugging process.

In my next post in this series, I’ll write about ambiguities, how they affect speech recognition performance, and how to detect and deal with them.

November 30th, 2009 2 Comments

by Dominique Boucher

Effective sentence generation

There has been some activity lately on the Yahoo VUIDs group about the difficulty of generating sentences from a speech recognition grammar. This is a recurrent problem in speech grammar engineering, one that really deserves a full blog post (and maybe more than one), especially since we’ve worked hard on this problem in the last year. So I’ll share my thoughts on this subject.

First, let me surmmarize why this problem is so difficult.

The problem

You have an  ABNF (or GrXML, or GSL) grammar for which you would like to generate sentences. Except if it’s a toy grammar or a small item-list, the grammar will most certainly generate thousands, millions, or even more different sentences, if not an infinite number of them. Why? That’s simple:

  • It can contain repeated items. If you have 10 words repeated 4 times, you have 10,000 sentences. When you have unbounded repeats, of course you get an infinite language.
  • It can contain all sort of filler words, to better handle disfluencies (hesitations, corrections, etc.). When you have those fillers at the start and end of every possible sentence, the number of sentences grows very rapidly. For example, just adding 9 optional filler words at the start and end of every sentences multplies the number of sentences by a factor of 100.

For example, one of our VoiceXML applications has a grammar that generates 29,822,907,679,607,676,696 sentences! (after removing some fillers that would have made the language infinite.) And it’s just a grammar for collecting a building number, albeit a highly tuned one. Pretty standard stuff.

As you have certainly guessed, it is rather impractical to generate all sentences. Are you really interested in reviewing tens of thousands of sentences? Remember that most sentences will only differ in very uninteresting ways (the pre/post fillers, a “two” instead of a “three”, etc.).

What would you want to generate sentences for?

Sentence generation can be helpful in many situations:

  • Grammar coverage. Coverage test sets can be built in many different ways. But one that works very well is to start with the grammar and generate sentences. As you do so, you add some or all of them to the coverage set, either as ING (in-grammar) or OOG (out-of-grammar) sentences.
  • To detect problems. Sentence generation is often an effective way to find potential problems with a grammar. Typical problems are over-generation (which can lead to reduced recognition accuracy) and grammar problems (misplaced parentheses, missing parentheses, misplaced vertical bar, etc.).
  • Application testing. Some people use the generated sentences in manual application test scripts. Automated, text-based testing tools could also use sentence generation to “navigate” an application call flow.

Some available tools

Many, if not most, recognition engine vendors provide tools to generate sentences from a grammar. For example, Nuance 9 comes with parseTool, which lets you generate a fixed number of random sentences from either a GrXML grammar or a compiled grammar.

Those tools, however, are rarely adequate at dealing with a large number of sentences in an effective way. They either exhaustively generate all sentences, or they generate a fixed number of random sentences. As I mentioned above, the exhaustive generation strategy works well only for very simple grammars. The random strategy, on the other hand, doesn’t provide any control mechanisms that enable us to only generate the sentences we want. As a result, we typically end up with mostly redundant sentences in which some sentence patterns are grossly over-represented while others are missing.

Our approach

In NuGram IDE Pro, we have implemented a very different approach to sentence generation. I won’t go into explaining all the details here, but let me emphasize some aspects of our approach:

  • The generation strategies can be configured on a rule-by-rule basis.
  • The generation algorithm can be started on arbitrary expansions (sequences of words and rule references) that can be derived from the root rule. These expansions are usually produced by the Sentence Explorer tool.
  • The available strategies are:
    • All sentences - This is the default strategy. As one would expect, it consists in exhaustively generating all the sentences. However, all referenced rules will obey their own strategy.
    • First sentence - This strategy generates a single sentence, the “first” in the document order.
    • Random sentences - This strategy generates a configurable number of random sentences each time the rule is referenced from another rule.
    • Tags coverage - This strategy generates the smallest set of sentences that will cover all the semantic actions in the rule and its referenced rules (and recursively). This is a very effective strategy to build coverage sets to test all the semantic tags in a grammar/rule.
    • All paths - This strategy is a variable of the tags coverage strategy. It generates the smallest set of sentences to cover all the paths in a rule and its referenced rules (and recursively).
    • Rule examples - This strategy consists in using the examples in the rule’s documentation comment. This strategy is dangerous in that it can generate sentences that are not parsable by the grammar if the examples are not changed when the rule is modified.

Here is a screencast showing these concepts in action:

That’s it for now. My next post will be about how to use the sentence generation tool to effectively find common problems with grammars and how to fix them.

And I’d be very interested in knowing how you deal yourself with this problem. So leave me a comment!

November 23rd, 2009 No Comments

by Yves Normandin

You can only tune what you can measure

This is the first in a series of posts I’ll do on speech application tuning over the coming weeks. Hopefully, this will provoke interesting feedback and, who knows, even spark some lively discussions.

I’m starting with the very important topic of speech recognition metrics because that’s necessary in order set the stage for most of what I’ll talk about next. Although this may not be the most exciting topic, it is clearly a very important one.

Some terminology

Tuning a speech application involves attempts to optimize a certain number of key performance metrics. Improvements or deterioration of these metrics is what tells us whether or not we’re making progress. Although that should be intuitively obvious to most people, what’s perhaps less obvious is how to select metrics that correlate best with the application’s success rate and user experience in the field.

Let me start by defining some terminology:

  • In-grammar / out-of-grammar —This determines whether an utterance is covered or not by the grammar. Later on in this post, I’ll talk at length about the different ways the word “covered” may be interpreted.
  • Accepted / rejected —This determines whether the recognition result’s confidence score is greater (accepted) or smaller (rejected) than the given confidence threshold. Note that there may be more than one confidence threshold for a given recognition context. If we’re talking about the high threshold, then “accepted” usually means that no confirmation is required. If we’re talking about the low threshold, then “accepted” means a confirmation will be required and “rejected” means that the user needs to be re-prompted.
  • Correct / incorrect —This determines whether or not a recognition result is correct. Although the definition of “correct” may vary, it is often interpreted to mean that the top recognition result in the N-best list has the correct semantic result (i.e., it doesn’t matter that not all words were correctly recognized as long as the semantic result is correct). Note that we assume here that only in-grammar utterances can be classified as either correct or incorrect.

When we perform a recognition test for a grammar using utterances collected in the field, we compute a set of 6 counters for each confidence threshold value in a range from 0.0 to 1.0. These counters are:

  • AC — Number of in-grammar utterances that are accepted and correct (often called CA-in in the industry)
  • AI — Number of in-grammar utterances that are accepted and incorrect (often called FA-in)
  • RC — Number of in-grammar utterances that are rejected and correct
  • RI — Number of in-grammar utterances that are rejected and incorrect
  • Aoog — Number of out-of-grammar utterances that are accepted (often called FA-out)
  • Roog — Number of out-of-grammar utterances that are rejected (often called CR-out)

Note that the value FR-in, often seen in the industry, is equal to RC+RI. We like to keep these two values separate since they allow us to distinguish recognition errors from rejection errors. We add two important variables, that are computed from the above counters:

  • ing — Number of in-grammar utterance (= AC+AI+RC+RI)
  • oog — Number of out-of-grammar utterances (= Aoog+Roog)

With these, we define the two key metrics that we’ll use constantly:

  • Correct accept rate (CA-rate) —This is the percentage of in-grammar utterances that are accepted with a correct result. It is computed as CA-rate = AC/ing.
  • False accept rate (FA-rate) — We use two versions of this metric:
    1. The percentage of all utterances that are incorrectly accepted. It is computed as: FA-rate = (AI+Aoog)/(ing+oog) = (AI+Aoog)/all
    2. The percentage of accepted utterances that are incorrect. It is computed as FA-rate = (AI+Aoog)/(AC+AI) = (AI+Aoog)/A

Here’s an example that will hopefully help clarify all this. The following graph plots the CA-rate as a function of the FA-rate for a phone number recognition experiment. I’ll use this type of graph constantly, so you might want to familiarize yourself with it. Note that, in order to avoid any confusion, the axes are labeled with the metric’s definition, not its name.

In the graph, the hidden variable is the confidence threshold. As the confidence threshold decreases from 1.0 to 0.0, both the CA-rate and the FA-rate increase. If we are using two thresholds then we would want to set the high threshold so that the FA-rate is very low (less than 1%, say), while the low threshold would be set in order to have an appropriate balance between confirming too many incorrect results and rejecting too many correct results.

Using a graphical representation of results has several advantages. One advantage is that it provides a visually clear view of how effectively we avoid false accepts. A curve that grows slowly from left to right is a clear indication that we’re not effectively rejecting out-of-grammar utterances. We’d like the curve to initially have a very steep slope and then to taper off when the CA-rate gets close to the maximum value.

Another very important advantage is that it makes it easy to compare results from different experiments. A curve that’s above another immediately tells us that it’s a better result (for a given FA-rate, we have a better CA-rate). That, however, assumes that the results truly are really comparable, which brings us back to an issue that we had earlier postponed: The definition of “in-grammar”.

The importance of correctly defining “in-grammar”

I had mentioned earlier that an in-grammar utterance is an utterance that is “covered” by the grammar. But what does “covered” mean? For much of the industry, this simply means that the sentence transcription can be parsed by the grammar.

This definition turns out to be quite problematic. Fundamentally, the main problem is that the in-grammar utterances are those we consider valid and therefore those that we should recognize as best as possible while out-of-grammar utterances are those the application should be rejecting. However, the definition of “valid” should be based on what application users perceive, not on what we have decided that the grammar should cover. If, for speech recognition accuracy considerations, we decide not to cover certain forms of user responses that are not used very often, this doesn’t make them any less valid. It just makes them less frequent.

Let me illustrate this with a simple example. The graph below shows the results from three date recognition experiments. The blue curve shows the result obtained using a grammar that supports two main forms: <month><day>[<year>] (”January fifth”) and <day><month>[<year>] (”Fifth of January”). Let’s say we want to see what happens if we decide not to support the second, rarer form. The result is the red curve which, as we can see, seems to indicate better performance.

But that’s an illusion. The problem is that the red curve considers all dates of the form “Fifth of January” as out-of-grammar. We’re of course doing a better job of recognizing the now reduced set of in-grammar utterances, but the problem is that a larger proportion of user utterances are considered invalid. In other words, we’re now correctly handling a smaller proportion of valid user utterance. From the user’s perspective, that’s certainly not an improvement. In fact, if we consider both forms as in-grammar (i.e., valid), then we get the green curve, which clearly tells us that we indeed have considerably deteriorated results.

In order to get meaningful results, we need to determine which utterances are in-grammar and which are out-of-grammar based on what should be considered a valid response, regardless of what we decide to include in our grammar. This has several important benefits:

  • It makes it possible to get meaningful comparisons between results obtained with grammars having different coverages since the sets of in-grammar and out-of-grammar utterances is the same in all cases.
  • We get results that are much more representative of user experience. To make sure that this is the case, we normally decide that an utterance is “in-grammar” if a human listener would consider this to be valid and unambiguous response to the question (within an acceptable “domain” of responses).
  • Last, but not least, we get applications that deliver similar or better success rates with fewer confirmations, and therefore better user experience. The reason is that, very often, many sentences that can’t be parsed by the grammar nonetheless give a high-confidence, correct semantic result. If these were considered out-of-grammar, then they would become false accepts. It takes very few of these to have a significant impact on the FA-rate, with the unfortunate consequence that we would end up using confidence thresholds much higher than necessary, resulting in many unnecessary confirmations.

So this pretty much concludes what I wanted to talk about today. In the next post, I’ll talk about tuning challenges related to out-of-grammar utterances.

November 10th, 2009 No Comments

by Dominique Boucher

Learn grammar development from the grammar experts!

In response to many requests from NuGram users, Nu Echo is pleased to announce that it’s now offering a two-day, on-site grammar development course.

This course – Effective Grammar Development with NuGram IDE – teaches participants how to systematically deliver high-quality, high-performance grammars by fully leveraging the features and tools available in NuGram IDE. Using hands-on exercises and numerous examples, the course provides a breadth of knowledge, best practices, and tips and tricks that have shown their effectiveness at addressing the main challenges of grammar development and at delivering better grammars faster.

Topics covered include:

  • Fundamental speech recognition and grammar concepts
  • The ABNF Grammar Syntax
  • Semantic tags – SISR, pre-SISR, swi-semantics, GSL, Nuance extensions.
  • NuGram IDE Tools – ABNF editor, Coverage tool, sentence interpreter, sentence generation,
    sentence explorer, semantics stepper, grammar conversion tools, etc.
  • The Grammar Development Process – Importance of a rigorous and systematic process, how
    NuGram IDE supports it, integration into a build process, etc.
  • Tips and Tricks – Style issues, guidelines for writing semantic tags, common sources of errors and how to detect and fix them
  • Dynamic Grammars – Use cases, traditional approaches, NuGram support (dynamic grammar
    language directives, testing/debugging tools, NuGram Server)
  • Managing phonetic pronunciations
  • Special Topics – Ambiguities, compound words, decoys, disfluencies (voiced pauses, false starts, corrections, etc.), grammar weights, Nuance-specific features.

You have special topics that you’d like us to cover? No problem. We can customize the course to fit your specific requirements. Contact us for details.