Author Archives: Yves Normandin

What’s the point of having 98% in-grammar accuracy if 40% of user utterances are out-of-grammar?

How many times have you heard people say that they “achieve 95% speech recognition accuracy” (or more)? That sounds really impressive, doesn’t it?

It shouldn’t. What they don’t tell you is that they actually measure “in-grammar accuracy”, which means that accuracy is measured only on utterances that are perfectly covered by the grammar. For instance, for a date grammar, an utterance such as “well, uh, january fourth” would be considered out-of-grammar (and therefore ignored from the accuracy calculation) if “well, uh” is not covered by the grammar.

Unfortunately, in the real world there’s no way to force users to stick to in-grammar utterances. In fact, users usually have no way of even knowing what the grammar covers other than through hints provided by the prompts. Even well-behaved users can hesitate, correct themselves, or use an unexpected formulation (which sounds perfectly natural to them), all of which are likely to be out-of-grammar. They can even say things that they believe will help the machine understand them (for instance using “victor” instead of “v” when spelling).

As a result, it’s not unusual to have between 30% and 50% of user utterances that are considered out-of-grammar, many of which are perfectly legitimate responses to the application prompt. So what’s the point of reporting in-grammar accuracy if this ignores a large chunk of legitimate user utterances? You tell me.

Just to illustrate, you want to know one of the most effective ways of improving in-grammar accuracy? Just reduce grammar coverage. Sure, your out-of-grammar rate will increase but, hey, you’ll improve in-grammar accuracy! Isn’t that great? This tells you how useless in-grammar accuracy is at telling you whether you improved the grammar.

This is why we always report accuracy by considering every legitimate user utterance (i.e., the ones that contains a valid response to the prompt, regardless of wording or extraneous speech). This way, we make sure that we don’t conveniently ignore the utterances that happen to be the more challenging and we get results that accurately represent the real recognition performance (not some imaginary performance calculated on an idealized set of clean utterances).

But the best reason for doing it our way is that it enables us to truly measure improvements when we tune grammars. The reason is simple. Changing the coverage of a grammar always involves a trade-off. We can improve accuracy by covering more user utterances, but this can reduce overall accuracy if the new grammar paths introduce new speech recognition errors. The only way we can measure improvement is if we measure accuracy on a fixed set of valid utterances that doesn’t depend on the actual grammar coverage.

How a great speech application may appear to perform poorly

One of our products, a Canadian address capture VoiceXML module, has been deployed with great success by several of our customers. One of these deployments was done in the context of a change of address application, where the module has to capture the new address, the date when the new address becomes effective, and the new telephone number. Note that all information is entirely obtained through speech recognition.

In this deployment, the contract specified that the application had to achieve a minimum success rate. In order to track performance, two success metrics were jointly defined with the customer:

  • The Raw Success Rate. This is calculated simply by dividing the number of calls for which the change of address was successfully completed (with all collected information confirmed by the caller), divided by the total number of calls for which the change of address module was used.
  • The Real Success Rate. This is calculated similarly, with the exception that certain calls were excluded from consideration, namely calls where the caller provided no input whatsoever and calls where the caller hung up within the first two interactions.

The customer specified that the application had to achieve a Real Success Rate of 75% or more. The rationale for the Real Success Rate is to exclude callers that either don’t want to use the application (for instance because they ended up in the application by mistake) or don’t have the requested information. As a matter of fact, after the initial deployment revealed a fairly high hang-up rate early in the change of address call flow, the customer contacted a number of those callers in order to find out why they had decided to hang up and it turns out that most of them admitted that they had no intention of changing their address; they had simply selected this option in the hope of getting connected to an agent faster.

It’s nonetheless interesting to track both metrics since a large difference between them can indicate problems that occurred earlier in the call (that is, before going into the change of address application).

For instance, at the end of 2008, the customer made some changes in the front menus, which significantly increased the number of callers that incorrectly found themselves in the change of address application. As shown in the graph below, this created a big drop in the Raw Success Rate while the Real Success Rate remained relatively constant. The customer implemented various changes to the front menu throughout 2009 (while the change of address application remained unchanged), with the result that the Raw Success Rate was finally stabilized at around 75% (and the Real Success Rate at 85%).

This shows that, when trying to evaluate the performance of an application, it’s important to focus on the correct metrics. Otherwise, we may end up not only with an incorrect assessment of its real performance, but also with wild variations that have nothing to do with the application itself.

Comparing different speech recognition engines

We’re sometimes asked to compare the performance of different speech recognition engines on an identical task (same grammar, same set of test utterances). To do so in an effective way, we rely on three important features of NuGram Server (on-the-fly conversion of grammars to any format, semantic interpretation of textual sentences, and a NuGram-specific meta value that removes all semantic tags from generated grammars), which we use extensively in our tuning environment.

One powerful aspect about our tuning environment is that, no matter what recognition engine we use, there is no difference in the way we perform speech recognition experiments and then score and analyze results. It’s all completely transparent. This makes it easy to run the exact same experiment using different recognition engines and then compare results using metrics, graphs, and other tools that are used consistently across all engines.

A big challenge when comparing different engines is that we usually can’t use the same grammar since different engines often use incompatible grammar tag formats. For instance, let’s say we have a recognition grammar for the Loquendo LASR speech recognition engine and we would like to compare the performance we get with this grammar using three different engines: Loquendo LASR, OSR 3.0, and Nuance 8.5. In that case, we have three different tag formats: Loquendo uses SISR, OSR 3.0 uses swi-semantics and Nuance 8.5 uses the Nuance GSL proprietary tag format. So in principle, we would need to convert the grammars for each recognition engine, which can be a significant effort for complex grammars.

No need for manual grammar conversion

It is, however, possible to compare different recognition engines without having to manually convert the grammars. The approach we use is quite simple: With each engine that is not compatible with the original grammar’s tag format, we perform speech recognition using a grammar from which semantic tags have been removed and we then add semantic information back to the recognition result as a post-processing step.

This is all done using NuGram Server, as follows.

We start with the original grammar in ABNF format (here credit-card.abnf), which we use for the recognition test using Loquendo ASR. Then, we add a special-purpose NuGram meta directive to the grammar, which tells NuGram Server to omit the semantic tags when generating the grammar:

#ABNF 1.0 ISO-8859-1;
language en-US;
mode voice;
tag-format <semantics/1.0>

meta "com.nuecho.generation.omit-tags" is "true";

root $main;

When we perform the recognition test with OSR, we tell NuGram Server that we want to use credit-card.grxml (note the extension). NuGram Server then automatically converts credit-card.abnf to the SRGS XML format, while omitting the semantic tags from the grammar. Recognition then proceeds without a hitch, but results are returned without any semantic slots.

Similarly, when we perform the recognition test with Nuance 8.5, we tell NuGram Server that we want to use credit-card.gsl, which tells NuGram Server to automatically convert credit-card.abnf into a GSL grammar (still without semantic tags). Recognition once again proceeds without a hitch and results are returned without semantic slots.

Finally, in order to get recognition results with semantic slots, we simply send the original credit-card.abnf grammar and the recognition results to NuGram Server in order to add semantic slots to the recognition results. In other words, semantic interpretation is done as a post-processing step by NuGram Server based on the SISR tags in the original grammar.

Note that if the original grammar had been a GSL grammar or an OSR grammar, NuGram Server could still have computed the semantic interpretation based on the semantic tags in the original grammar (NuGram understands many different tag formats).

Dealing with engine-proprietary features

Some engine-proprietary features might make results more difficult to compare. For instance, OSR and Nuance 9 provide the special-purpose SWI_disallow key, which can be used to remove hypotheses from the N-best list of recognition hypotheses returned by the engine. This could for instance be used to remove credit card numbers that don’t have a valid checksum, therefore improving recognition accuracy as a result.

This useful feature could make recognition results difficult to compare if some engines have it and others don’t (in which case an equivalent result could be obtained by removing invalid hypotheses in the application). Fortunately, in our recognition tests we have the ability to tell NuGram Server to remove, from the N-best list, those hypotheses that match a specified slot pattern (e.g., SWI_disallow=1). This once again makes it possible to make fair and accurate comparisons between OSR or Nuance 9 and other engines.

Reducing false accepts with decoys

As discussed in a previous post, one of the unfortunate consequences of out-of-grammar utterances is that they can cause many false accepts that may seriously degrade application performance and user experience. In order to illustrate this, let’s use the simple example of a small menu where callers must choose among three options: “validate”, “repeat”, and “cancel”.

We use a test set of 5042 field utterances distributed as follows:

Menu choice Number of utterances Proportion of test corpus
cancel 367 7.28%
repeat 896 17.77%
validate 3478 68.98%
OOG 301 5.97%

As we can see, this is a fairly clean test set with only about 6% of out-of-grammar utterances. As usual, these include background speech, various noises, side conversations, some common OOG utterances (“yes”, “no”, “okay”, “options”, “oh”, etc.), as well as a wide variety of rambling responses of different kinds.

Naturally, since the grammar can only recognize one of the three keywords (and legitimate variants), most of these OOG utterances are misrecognized as one of the keywords. That wouldn’t be a problem if the corresponding confidence scores were low and we could safely reject them, but that’s not always the case. In fact, many of these have a confidence score over 0.9, resulting in damaging false accepts.

An effective way to reduce false accepts is to add decoys to the grammar. For instance, you would normally want to start by adding common OOG responses, on the ground that it’s easier to reject an OOG utterance if you can recognize it correctly. You could also add more “general” decoys, for instance a phoneme loop, to help reject hard to predict OOG utterances. There are more advanced techniques that can be used in order to come up with “optimal” decoys for a given grammar, but I won’t go into them now.

In all cases, it is of course absolutely necessary to evaluate, on a large enough test corpus, the impact of these decoys since they could easily end up reducing recognition accuracy, sometimes significantly. In particular, one should be careful not to add decoys that could be confused with legitimate sentences or keywords.

Let’s illustrate the impact of decoys using the set of field utterances described above. The graph below compares the performance of a grammar without decoys (red curve) to that of the same grammar to which appropriate decoys were added (blue curve).

As can be seen, even for for a fairly clean test corpus with a low OOG rate, the addition of decoys can significantly improve performance. For instance:

  • For a False Accept rate of 0.5%, the Correct Accept rate increases from 95% to over 97.5%, which is equivalent to reducing the error rate by more than 50%.
  • For a Correct Accept rate of 97.5%, the addition of decoys decreases the False Accept rate from 1.5% to around 0.3%. That’s one fifth the False Accept rate for the same Correct Accept rate.

Another interesting observation is the impact of decoys on confidence thresholds. Let’s say we want to have a False Accept Rate of 0.5%. Then, we would need to use a confidence threshold of 0.73 for the grammar without decoys, but only  0.25 for the grammar with decoys. That’s quite a difference! This clearly shows that using “default” threshold values may sometimes produce results that are quite inadequate.

All of this once again demonstrates how important it is to pay close attention to out-of-grammar utterances in a tuning process and how decoys can provide an effective tool for containing the negative impact of such utterances on application performance.

The out-of-grammar challenge

The life of speech application developers would so be much simpler if callers were kind enough to only say things that are covered by the grammars. Unfortunately, because life was never meant to be simple, we will always have to deal with people that:

  • use all kinds of creative sentence constructions
  • stutter, correct themselves, or repeat portions of their utterance
  • find it impossible to just answer the question
  • have side conversations
  • don’t listen to the prompts
  • fumble while they look for the requested information
  • express their displeasure in a colorful way
  • say something that makes no sense
  • etc., etc., etc.

Then, of course, there’s all these utterances truncated by the endpointer, all these false barge-ins caused by noises, etc.

All of this explains why so many applications that work so well in demos actually perform so poorly in the field. There’s no avoiding that we have to build applications that real people can use and, unfortunately, real people quite often don’t behave the way we would like them to. And that’s OK. It’s our job to make sure that as many callers as possible get the best possible user experience.

The out-of-grammar impact on tuning

Many of the biggest tuning challenges relate to “out-of-grammar” utterances (see previous post, for a discussion on the different meanings of “out-of-grammar”), which mostly fall into two categories:

  1. Valid utterances —These are perfectly understandable utterances that provide the information that is expected by the application but which, for one reason or another, are not covered (i.e., can’t be parsed) by the grammar.
  2. Invalid utterances —These are utterances that are unusable by the application because they have no useful meaning.

Here is a list of ways in which out-of-grammar utterances can impact tuning:

  • Inflated False Accept rate — Valid utterances that are incorrectly labeled “out-of-grammar” can significantly inflate the False Accept rate and force the use of a high threshold much higher than necessary. See below for details.
  • Computing the reference semantic interpretation — In order to evaluate key performance metrics, we need to have the correct semantic interpretation for each valid utterance in our test set (the “reference semantic interpretation”) so that we can compare with the semantic interpretation obtained from the recognition result. For those utterances whose transcription can be parsed by the grammar, that’s trivial. Unfortunately, there are usually quite a few valid utterances whose transcription produces no parse.
  • Grammar coverage optimization — Careful analysis of field utterances almost always reveals grammar coverage problems that should be addressed. Without tools to suggest improvements to the grammar, though, this can be a lot of work. Moreover, optimum coverage – which is different from maximum coverage – can only be established through iterative experimentation.
  • Avoiding false accepts — Quite often, an invalid utterance will produce a recognition result with a high confidence score, leading to a false accept and, potentially, a dialogue failure. In some cases, this can be a very significant problem.
  • Prior probability considerations — Let’s say we use a speech menu in which a certain choice is used very rarely. If we assume that all choices are equally likely to falsely match out-of-grammar utterances, then the out-of-grammar impact on the rare choice will be proportionally much greater than on the other choices. This should be taken into consideration.
  • When to propose a second choice — Let’s say a user just said no to the confirmation: “I think you said ‘Austin’. Is that correct?” Should we propose the second choice in the N-best list (Boston)? That depends on the probability that this second choice is correct, which to a large extent depends on the proportion of out-of-grammar responses.

In upcoming posts, I’ll discuss each of these issues in more detail. For the time being, I’ll focus on the first one.

The inflated false accept problem

Let me illustrate this problem using a simple speech menu where people can select between three choices: “correct address”, “wrong address”, and “repeat the address”. The grammar naturally supports many variations of these key phrases, with a number of appropriate prefixes and suffixes.

The problem is that, in practice, responses contain a fair proportion of disfluencies (stuttering, corrections, repeats, etc.). As a result, there are quite a few transcriptions for which the grammar produces no parse. In the graph below (showing Correct Accept vs. False Accept, see previous post for definitions), the blue curve shows what happens if these are left “out-of-grammar” while the red curve shows what happens when all valid utterances are classified “in-grammar” and labeled with the correct reference semantic interpretation.

As we can see, the difference is quite significant. Let’s suppose we want to set the high threshold so that we have a maximum false accept rate of 0.5%. In the first case (blue curve), we would need to use a high threshold of 0.98, resulting in a Correct Accept rate of around 50%, while in the second case (red curve), we could get a Correct Accept rate of 96%, using a confidence threshold of 0.05.

In other words, properly managing these OOG utterances can mean the difference between a lot of needless confirmations and almost no confirmation, which makes a huge difference in user experience.