Tag Archives: tuning

Grammar conversion : lessons learned

Lately, I have been involved in a number of grammar conversion projects. This has been a great opportunity to put our process and  tools to the test once again. And since every project has its peculiarities, we learn constantly.

The process we outlined about a year ago omitted  a number of small details. That was OK for small scale conversion projects. But when you have to deal with much larger projects (with thousands of grammars to convert), these details add up significantly. Let me share some of the issues we face daily.

It’s not just semantic tags

When you have tools to automatically convert semantics tags from one format to another, grammar conversion can seem to be a no-brainer. But reality is not that simple. Grammars are not written for an abstract specification, they are written for a very specific recognition engine. They often contain:

  • Words (tokens) that map to very specific pronunciations or that try to model some disfluencies (like hesitations, for instance), but for which the SRGS $GARBAGE rule is more appropriate.
  • Multiword duplicates, with one sequence of space-separated words, and a similar sequence of underscore-separated words to allow cross-word phonetization (like “thirty one” and “thirty_one”).
  • Words that map to very specific, tuned pronunciations. Such words often have an unusual orthography to make sure they are not confused with real words.

All this means that there are a number of transformations either to the original grammar or to the converted grammars that must be applied. This can be by means of regular expression search&replace, or manually inspecting grammars.

Generation of coverage sets

When dealing with hundreds (if not thousands) of grammars, it is not feasible to create initial coverage test sets manually. This is way too time consuming. That means you have to find a way to generate those initial coverage test sets automatically in batch. But how do you do that?

Fortunately, NuGram IDE already provides sophisticated tools to analyze grammars and generate sentences from them. We just built on this foundation a tool to automatically generate coverage tests sets for a set of ABNF grammars. The tool also reports problems found in the grammars, like the use of digits in voice grammars, or words in DTMF grammars.

The coverage set generation tool uses a combination of  configuration and sophisticated analyses to determine how to generate sentences and how many sentences to generate. For example, it’s not possible to generate all sentences from a grammar that covers an infinite number of sentences. When that’s the case (or when the number of sentences covered by the grammar is above a certain threshold), the tool reverts to other generation strategies.

Recognition tests as part of the QA process

Finally, even a syntactically valid grammar may fail to load in the ASR for a variety of reasons, the most common one being a limitation or constraint from the ASR  itself. For this reason, we got to the conclusion that doing recognition tests (ideally benchmarking of the converted grammars) is a very useful addition to the QA process. Of course, simply compiling grammars may catch a number of problems. But doing a “before and after” comparison can detect conversion problems that were not caught by the coverage tests when they are not exhaustive.

Another benefit of doing recognition tests is the ability to check the performance of the converted grammars to identify those needing additional work. Some converted grammars may have words that prove difficult to recognize with the new engine because they are not properly phonetized, thus calling for application-specific (or even grammar-specific) phonetic dictionaries.

What about DTMF?

In the specific case of converting GSL grammars to GrXML or ABNF,  a complication arises with the presence, in the same grammar, of both DTMF sequences and words. I will discuss this issue in a separate post.

What’s the point of having 98% in-grammar accuracy if 40% of user utterances are out-of-grammar?

How many times have you heard people say that they “achieve 95% speech recognition accuracy” (or more)? That sounds really impressive, doesn’t it?

It shouldn’t. What they don’t tell you is that they actually measure “in-grammar accuracy”, which means that accuracy is measured only on utterances that are perfectly covered by the grammar. For instance, for a date grammar, an utterance such as “well, uh, january fourth” would be considered out-of-grammar (and therefore ignored from the accuracy calculation) if “well, uh” is not covered by the grammar.

Unfortunately, in the real world there’s no way to force users to stick to in-grammar utterances. In fact, users usually have no way of even knowing what the grammar covers other than through hints provided by the prompts. Even well-behaved users can hesitate, correct themselves, or use an unexpected formulation (which sounds perfectly natural to them), all of which are likely to be out-of-grammar. They can even say things that they believe will help the machine understand them (for instance using “victor” instead of “v” when spelling).

As a result, it’s not unusual to have between 30% and 50% of user utterances that are considered out-of-grammar, many of which are perfectly legitimate responses to the application prompt. So what’s the point of reporting in-grammar accuracy if this ignores a large chunk of legitimate user utterances? You tell me.

Just to illustrate, you want to know one of the most effective ways of improving in-grammar accuracy? Just reduce grammar coverage. Sure, your out-of-grammar rate will increase but, hey, you’ll improve in-grammar accuracy! Isn’t that great? This tells you how useless in-grammar accuracy is at telling you whether you improved the grammar.

This is why we always report accuracy by considering every legitimate user utterance (i.e., the ones that contains a valid response to the prompt, regardless of wording or extraneous speech). This way, we make sure that we don’t conveniently ignore the utterances that happen to be the more challenging and we get results that accurately represent the real recognition performance (not some imaginary performance calculated on an idealized set of clean utterances).

But the best reason for doing it our way is that it enables us to truly measure improvements when we tune grammars. The reason is simple. Changing the coverage of a grammar always involves a trade-off. We can improve accuracy by covering more user utterances, but this can reduce overall accuracy if the new grammar paths introduce new speech recognition errors. The only way we can measure improvement is if we measure accuracy on a fixed set of valid utterances that doesn’t depend on the actual grammar coverage.

How a great speech application may appear to perform poorly

One of our products, a Canadian address capture VoiceXML module, has been deployed with great success by several of our customers. One of these deployments was done in the context of a change of address application, where the module has to capture the new address, the date when the new address becomes effective, and the new telephone number. Note that all information is entirely obtained through speech recognition.

In this deployment, the contract specified that the application had to achieve a minimum success rate. In order to track performance, two success metrics were jointly defined with the customer:

  • The Raw Success Rate. This is calculated simply by dividing the number of calls for which the change of address was successfully completed (with all collected information confirmed by the caller), divided by the total number of calls for which the change of address module was used.
  • The Real Success Rate. This is calculated similarly, with the exception that certain calls were excluded from consideration, namely calls where the caller provided no input whatsoever and calls where the caller hung up within the first two interactions.

The customer specified that the application had to achieve a Real Success Rate of 75% or more. The rationale for the Real Success Rate is to exclude callers that either don’t want to use the application (for instance because they ended up in the application by mistake) or don’t have the requested information. As a matter of fact, after the initial deployment revealed a fairly high hang-up rate early in the change of address call flow, the customer contacted a number of those callers in order to find out why they had decided to hang up and it turns out that most of them admitted that they had no intention of changing their address; they had simply selected this option in the hope of getting connected to an agent faster.

It’s nonetheless interesting to track both metrics since a large difference between them can indicate problems that occurred earlier in the call (that is, before going into the change of address application).

For instance, at the end of 2008, the customer made some changes in the front menus, which significantly increased the number of callers that incorrectly found themselves in the change of address application. As shown in the graph below, this created a big drop in the Raw Success Rate while the Real Success Rate remained relatively constant. The customer implemented various changes to the front menu throughout 2009 (while the change of address application remained unchanged), with the result that the Raw Success Rate was finally stabilized at around 75% (and the Real Success Rate at 85%).

This shows that, when trying to evaluate the performance of an application, it’s important to focus on the correct metrics. Otherwise, we may end up not only with an incorrect assessment of its real performance, but also with wild variations that have nothing to do with the application itself.

Comparing different speech recognition engines

We’re sometimes asked to compare the performance of different speech recognition engines on an identical task (same grammar, same set of test utterances). To do so in an effective way, we rely on three important features of NuGram Server (on-the-fly conversion of grammars to any format, semantic interpretation of textual sentences, and a NuGram-specific meta value that removes all semantic tags from generated grammars), which we use extensively in our tuning environment.

One powerful aspect about our tuning environment is that, no matter what recognition engine we use, there is no difference in the way we perform speech recognition experiments and then score and analyze results. It’s all completely transparent. This makes it easy to run the exact same experiment using different recognition engines and then compare results using metrics, graphs, and other tools that are used consistently across all engines.

A big challenge when comparing different engines is that we usually can’t use the same grammar since different engines often use incompatible grammar tag formats. For instance, let’s say we have a recognition grammar for the Loquendo LASR speech recognition engine and we would like to compare the performance we get with this grammar using three different engines: Loquendo LASR, OSR 3.0, and Nuance 8.5. In that case, we have three different tag formats: Loquendo uses SISR, OSR 3.0 uses swi-semantics and Nuance 8.5 uses the Nuance GSL proprietary tag format. So in principle, we would need to convert the grammars for each recognition engine, which can be a significant effort for complex grammars.

No need for manual grammar conversion

It is, however, possible to compare different recognition engines without having to manually convert the grammars. The approach we use is quite simple: With each engine that is not compatible with the original grammar’s tag format, we perform speech recognition using a grammar from which semantic tags have been removed and we then add semantic information back to the recognition result as a post-processing step.

This is all done using NuGram Server, as follows.

We start with the original grammar in ABNF format (here credit-card.abnf), which we use for the recognition test using Loquendo ASR. Then, we add a special-purpose NuGram meta directive to the grammar, which tells NuGram Server to omit the semantic tags when generating the grammar:

#ABNF 1.0 ISO-8859-1;
language en-US;
mode voice;
tag-format <semantics/1.0>

meta "com.nuecho.generation.omit-tags" is "true";

root $main;

When we perform the recognition test with OSR, we tell NuGram Server that we want to use credit-card.grxml (note the extension). NuGram Server then automatically converts credit-card.abnf to the SRGS XML format, while omitting the semantic tags from the grammar. Recognition then proceeds without a hitch, but results are returned without any semantic slots.

Similarly, when we perform the recognition test with Nuance 8.5, we tell NuGram Server that we want to use credit-card.gsl, which tells NuGram Server to automatically convert credit-card.abnf into a GSL grammar (still without semantic tags). Recognition once again proceeds without a hitch and results are returned without semantic slots.

Finally, in order to get recognition results with semantic slots, we simply send the original credit-card.abnf grammar and the recognition results to NuGram Server in order to add semantic slots to the recognition results. In other words, semantic interpretation is done as a post-processing step by NuGram Server based on the SISR tags in the original grammar.

Note that if the original grammar had been a GSL grammar or an OSR grammar, NuGram Server could still have computed the semantic interpretation based on the semantic tags in the original grammar (NuGram understands many different tag formats).

Dealing with engine-proprietary features

Some engine-proprietary features might make results more difficult to compare. For instance, OSR and Nuance 9 provide the special-purpose SWI_disallow key, which can be used to remove hypotheses from the N-best list of recognition hypotheses returned by the engine. This could for instance be used to remove credit card numbers that don’t have a valid checksum, therefore improving recognition accuracy as a result.

This useful feature could make recognition results difficult to compare if some engines have it and others don’t (in which case an equivalent result could be obtained by removing invalid hypotheses in the application). Fortunately, in our recognition tests we have the ability to tell NuGram Server to remove, from the N-best list, those hypotheses that match a specified slot pattern (e.g., SWI_disallow=1). This once again makes it possible to make fair and accurate comparisons between OSR or Nuance 9 and other engines.

Reducing false accepts with decoys

As discussed in a previous post, one of the unfortunate consequences of out-of-grammar utterances is that they can cause many false accepts that may seriously degrade application performance and user experience. In order to illustrate this, let’s use the simple example of a small menu where callers must choose among three options: “validate”, “repeat”, and “cancel”.

We use a test set of 5042 field utterances distributed as follows:

Menu choice Number of utterances Proportion of test corpus
cancel 367 7.28%
repeat 896 17.77%
validate 3478 68.98%
OOG 301 5.97%

As we can see, this is a fairly clean test set with only about 6% of out-of-grammar utterances. As usual, these include background speech, various noises, side conversations, some common OOG utterances (“yes”, “no”, “okay”, “options”, “oh”, etc.), as well as a wide variety of rambling responses of different kinds.

Naturally, since the grammar can only recognize one of the three keywords (and legitimate variants), most of these OOG utterances are misrecognized as one of the keywords. That wouldn’t be a problem if the corresponding confidence scores were low and we could safely reject them, but that’s not always the case. In fact, many of these have a confidence score over 0.9, resulting in damaging false accepts.

An effective way to reduce false accepts is to add decoys to the grammar. For instance, you would normally want to start by adding common OOG responses, on the ground that it’s easier to reject an OOG utterance if you can recognize it correctly. You could also add more “general” decoys, for instance a phoneme loop, to help reject hard to predict OOG utterances. There are more advanced techniques that can be used in order to come up with “optimal” decoys for a given grammar, but I won’t go into them now.

In all cases, it is of course absolutely necessary to evaluate, on a large enough test corpus, the impact of these decoys since they could easily end up reducing recognition accuracy, sometimes significantly. In particular, one should be careful not to add decoys that could be confused with legitimate sentences or keywords.

Let’s illustrate the impact of decoys using the set of field utterances described above. The graph below compares the performance of a grammar without decoys (red curve) to that of the same grammar to which appropriate decoys were added (blue curve).

As can be seen, even for for a fairly clean test corpus with a low OOG rate, the addition of decoys can significantly improve performance. For instance:

  • For a False Accept rate of 0.5%, the Correct Accept rate increases from 95% to over 97.5%, which is equivalent to reducing the error rate by more than 50%.
  • For a Correct Accept rate of 97.5%, the addition of decoys decreases the False Accept rate from 1.5% to around 0.3%. That’s one fifth the False Accept rate for the same Correct Accept rate.

Another interesting observation is the impact of decoys on confidence thresholds. Let’s say we want to have a False Accept Rate of 0.5%. Then, we would need to use a confidence threshold of 0.73 for the grammar without decoys, but only  0.25 for the grammar with decoys. That’s quite a difference! This clearly shows that using “default” threshold values may sometimes produce results that are quite inadequate.

All of this once again demonstrates how important it is to pay close attention to out-of-grammar utterances in a tuning process and how decoys can provide an effective tool for containing the negative impact of such utterances on application performance.