The out-of-grammar challenge

The life of speech application developers would so be much simpler if callers were kind enough to only say things that are covered by the grammars. Unfortunately, because life was never meant to be simple, we will always have to deal with people that:

  • use all kinds of creative sentence constructions
  • stutter, correct themselves, or repeat portions of their utterance
  • find it impossible to just answer the question
  • have side conversations
  • don’t listen to the prompts
  • fumble while they look for the requested information
  • express their displeasure in a colorful way
  • say something that makes no sense
  • etc., etc., etc.

Then, of course, there’s all these utterances truncated by the endpointer, all these false barge-ins caused by noises, etc.

All of this explains why so many applications that work so well in demos actually perform so poorly in the field. There’s no avoiding that we have to build applications that real people can use and, unfortunately, real people quite often don’t behave the way we would like them to. And that’s OK. It’s our job to make sure that as many callers as possible get the best possible user experience.

The out-of-grammar impact on tuning

Many of the biggest tuning challenges relate to “out-of-grammar” utterances (see previous post, for a discussion on the different meanings of “out-of-grammar”), which mostly fall into two categories:

  1. Valid utterances —These are perfectly understandable utterances that provide the information that is expected by the application but which, for one reason or another, are not covered (i.e., can’t be parsed) by the grammar.
  2. Invalid utterances —These are utterances that are unusable by the application because they have no useful meaning.

Here is a list of ways in which out-of-grammar utterances can impact tuning:

  • Inflated False Accept rate — Valid utterances that are incorrectly labeled “out-of-grammar” can significantly inflate the False Accept rate and force the use of a high threshold much higher than necessary. See below for details.
  • Computing the reference semantic interpretation — In order to evaluate key performance metrics, we need to have the correct semantic interpretation for each valid utterance in our test set (the “reference semantic interpretation”) so that we can compare with the semantic interpretation obtained from the recognition result. For those utterances whose transcription can be parsed by the grammar, that’s trivial. Unfortunately, there are usually quite a few valid utterances whose transcription produces no parse.
  • Grammar coverage optimization — Careful analysis of field utterances almost always reveals grammar coverage problems that should be addressed. Without tools to suggest improvements to the grammar, though, this can be a lot of work. Moreover, optimum coverage – which is different from maximum coverage – can only be established through iterative experimentation.
  • Avoiding false accepts — Quite often, an invalid utterance will produce a recognition result with a high confidence score, leading to a false accept and, potentially, a dialogue failure. In some cases, this can be a very significant problem.
  • Prior probability considerations — Let’s say we use a speech menu in which a certain choice is used very rarely. If we assume that all choices are equally likely to falsely match out-of-grammar utterances, then the out-of-grammar impact on the rare choice will be proportionally much greater than on the other choices. This should be taken into consideration.
  • When to propose a second choice — Let’s say a user just said no to the confirmation: “I think you said ‘Austin’. Is that correct?” Should we propose the second choice in the N-best list (Boston)? That depends on the probability that this second choice is correct, which to a large extent depends on the proportion of out-of-grammar responses.

In upcoming posts, I’ll discuss each of these issues in more detail. For the time being, I’ll focus on the first one.

The inflated false accept problem

Let me illustrate this problem using a simple speech menu where people can select between three choices: “correct address”, “wrong address”, and “repeat the address”. The grammar naturally supports many variations of these key phrases, with a number of appropriate prefixes and suffixes.

The problem is that, in practice, responses contain a fair proportion of disfluencies (stuttering, corrections, repeats, etc.). As a result, there are quite a few transcriptions for which the grammar produces no parse. In the graph below (showing Correct Accept vs. False Accept, see previous post for definitions), the blue curve shows what happens if these are left “out-of-grammar” while the red curve shows what happens when all valid utterances are classified “in-grammar” and labeled with the correct reference semantic interpretation.

As we can see, the difference is quite significant. Let’s suppose we want to set the high threshold so that we have a maximum false accept rate of 0.5%. In the first case (blue curve), we would need to use a high threshold of 0.98, resulting in a Correct Accept rate of around 50%, while in the second case (red curve), we could get a Correct Accept rate of 96%, using a confidence threshold of 0.05.

In other words, properly managing these OOG utterances can mean the difference between a lot of needless confirmations and almost no confirmation, which makes a huge difference in user experience.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>