July 15th, 2010 1 Comment

by Dominique Boucher

An ABNF primer

Interestingly, a lot of hits on our NuGram web site come from people looking for the words ABNF tutorial on one of the major search engines. And although we provide great tools for working with ABNF grammars, we don’t provide any introductory text on the ABNF syntax. That’s a shame!

To remedy this situation, I just put on Slideshare a presentation extracted from our training material that covers the basic concepts of ABNF grammars.

Remember that ABNF is the native syntax for many speech recognition (ASR) engines. And if your ASR doesn’t support it, let NuGram IDE handle the conversion to XML for you!

Tip #2: In SISR semantic tags, return key/value pairs whenever possible.

Strings all over the place

It is fairly common for new SRGS grammar writers to write SISR semantic tags that only return string values to calling rules or to the voice application, even when the data has some structure. For example, a dollar amount rule could return a string like this (in ABNF):

public $amount =
  $dollars {out = rules.dollars + ".00";}
  [and $cents {out = out.substring(0, out.length - 3)
                        + "." + rules.cents; }]
;

...

One obvious disadvantage of this approach is that the application has to extract the dollars and the cents from the returned string. Of course, a simple string to number conversion can be done. But due to possible rounding errors, it is best to extract both values separately and converting the two substrings to integers. This may not be that bad, machines are so fast these days.

A less obvious reason why this is not recommended relates to the fact that the computations made by the semantic tags can only begin once the engine has finished recognizing the utterance. In other words, the corresponding computation time directly adds to the application’s response time. The ECMAScript interpreter typically compiles the script (the semantic tag) to an intermediate representation before executing it. Unless the ASR properly caches the result of this compilation process, the script is compiled again and again. The more complicated the script is, the more processing power it takes to parse it, compile it, and execute it.

We also have to add to that the fact that string concatenation/substring extraction creates a lot of unnecessary temporary objects, thus putting a bigger burden on the garbage collector (or any other memory management algorithm employed by the ECMAScript interpreter).

Finally, since semantic tags are compiled and executed for every hypothesis in the N-best list, the computation time and the number of objects created grows proportionately with the number of hypotheses requested by the application. If we sum all this, we end up with a grammar that requires unnecessary processing power from the ASR engine, which can cause significant delays in the recognition process. This may even result in noticeable latency at the application level (i.e. some dead-air).

Use semantic keys instead

A better way to write the above grammar would be:

public $amount =
  $number {out.dollars = rules.number;
           out.cents   = 0; }
  [and $cents {out.cents = rules.cents; }]
;

...

Using explicit semantic keys has many advantages:

  • Documentation. This self-documents the type/purpose of the returned values.
  • Maintenance/evolution. The scripts are much simpler, thus easier to understand for someone trying to understand the grammar. It is also easier to add other keys later if need be.
  • Analytics. The presence of distinct semantic keys facilitates the analysis of field data. For example, we can be interested in performing a recognition performance test for only a subset of our collected utterances, i.e. those utterances whose value for the cents semantic key is 0.
Related posts:

We’re sometimes asked to compare the performance of different speech recognition engines on an identical task (same grammar, same set of test utterances). To do so in an effective way, we rely on three important features of NuGram Server (on-the-fly conversion of grammars to any format, semantic interpretation of textual sentences, and a NuGram-specific meta value that removes all semantic tags from generated grammars), which we use extensively in our tuning environment.

One powerful aspect about our tuning environment is that, no matter what recognition engine we use, there is no difference in the way we perform speech recognition experiments and then score and analyze results. It’s all completely transparent. This makes it easy to run the exact same experiment using different recognition engines and then compare results using metrics, graphs, and other tools that are used consistently across all engines.

A big challenge when comparing different engines is that we usually can’t use the same grammar since different engines often use incompatible grammar tag formats. For instance, let’s say we have a recognition grammar for the Loquendo LASR speech recognition engine and we would like to compare the performance we get with this grammar using three different engines: Loquendo LASR, OSR 3.0, and Nuance 8.5. In that case, we have three different tag formats: Loquendo uses SISR, OSR 3.0 uses swi-semantics and Nuance 8.5 uses the Nuance GSL proprietary tag format. So in principle, we would need to convert the grammars for each recognition engine, which can be a significant effort for complex grammars.

No need for manual grammar conversion

It is, however, possible to compare different recognition engines without having to manually convert the grammars. The approach we use is quite simple: With each engine that is not compatible with the original grammar’s tag format, we perform speech recognition using a grammar from which semantic tags have been removed and we then add semantic information back to the recognition result as a post-processing step.

This is all done using NuGram Server, as follows.

We start with the original grammar in ABNF format (here credit-card.abnf), which we use for the recognition test using Loquendo ASR. Then, we add a special-purpose NuGram meta directive to the grammar, which tells NuGram Server to omit the semantic tags when generating the grammar:

#ABNF 1.0 ISO-8859-1;
language en-US;
mode voice;
tag-format <semantics/1.0>

meta "com.nuecho.generation.omit-tags" is "true";

root $main;

When we perform the recognition test with OSR, we tell NuGram Server that we want to use credit-card.grxml (note the extension). NuGram Server then automatically converts credit-card.abnf to the SRGS XML format, while omitting the semantic tags from the grammar. Recognition then proceeds without a hitch, but results are returned without any semantic slots.

Similarly, when we perform the recognition test with Nuance 8.5, we tell NuGram Server that we want to use credit-card.gsl, which tells NuGram Server to automatically convert credit-card.abnf into a GSL grammar (still without semantic tags). Recognition once again proceeds without a hitch and results are returned without semantic slots.

Finally, in order to get recognition results with semantic slots, we simply send the original credit-card.abnf grammar and the recognition results to NuGram Server in order to add semantic slots to the recognition results. In other words, semantic interpretation is done as a post-processing step by NuGram Server based on the SISR tags in the original grammar.

Note that if the original grammar had been a GSL grammar or an OSR grammar, NuGram Server could still have computed the semantic interpretation based on the semantic tags in the original grammar (NuGram understands many different tag formats).

Dealing with engine-proprietary features

Some engine-proprietary features might make results more difficult to compare. For instance, OSR and Nuance 9 provide the special-purpose SWI_disallow key, which can be used to remove hypotheses from the N-best list of recognition hypotheses returned by the engine. This could for instance be used to remove credit card numbers that don’t have a valid checksum, therefore improving recognition accuracy as a result.

This useful feature could make recognition results difficult to compare if some engines have it and others don’t (in which case an equivalent result could be obtained by removing invalid hypotheses in the application). Fortunately, in our recognition tests we have the ability to tell NuGram Server to remove, from the N-best list, those hypotheses that match a specified slot pattern (e.g., SWI_disallow=1). This once again makes it possible to make fair and accurate comparisons between OSR or Nuance 9 and other engines.

May 6th, 2010 No Comments

by Dominique Boucher

Grammar tips & tricks #1 - rules naming

[This post is the first in a series of short posts giving tips and tricks on speech grammar writing.]

Tip #1: make sure that your rule names are always ECMAScript identifiers.

In SRGS grammars, rule names must be valid XML names and may not contain the following characters: ., :, and -. For people new to speech grammar writing, It is not always obvious why there is such a restriction.

When you start writing your first semantic tags, you understand why. When using semantics/1.0 tags, values returned by referenced rules are exposed as properties of the rules and meta objects, while with swi-semantics/1.0 (the Nuance OSR tag format), those values are exposed as variables. In other words, in both cases rule names must be valid ECMAScript identifiers. In ECMAScript civic-number is not an identifier, it’s an arithmetic operation!

Of course, NuGram IDE always enforces this restriction, any mistake will be reported as you type.

A related OSR-specific pitfall

With swi-semantics/1.0, you need to be even more cautious. It is always a bad idea to have a variable whose name can conflict with the name of a referenced rule. If the variable is already defined, the value of the referenced rule will become inaccessible.

$someRule =
    [$prefix { type = 'default' }]
    $<types.abnf#type> { type = type.value; }
    $<values.abnf#value> { value = value.value; }
;

This grammar won’t work if something from $prefix is uttered. This will cause the slot (variable) type to be set to "default" and prevent the value returned by the reference $<types.abnf#type> from being bound to the type variable. When the second semantic tag is executed, the value of the variable type will still be "default", which is not an object with a property value, thus causing an execution error.

January 12th, 2010 No Comments

by Yves Normandin

Reducing false accepts with decoys

As discussed in a previous post, one of the unfortunate consequences of out-of-grammar utterances is that they can cause many false accepts that may seriously degrade application performance and user experience. In order to illustrate this, let’s use the simple example of a small menu where callers must choose among three options: “validate”, “repeat”, and “cancel”.

We use a test set of 5042 field utterances distributed as follows:

Menu choice Number of utterances Proportion of test corpus
cancel 367 7.28%
repeat 896 17.77%
validate 3478 68.98%
OOG 301 5.97%

As we can see, this is a fairly clean test set with only about 6% of out-of-grammar utterances. As usual, these include background speech, various noises, side conversations, some common OOG utterances (”yes”, “no”, “okay”, “options”, “oh”, etc.), as well as a wide variety of rambling responses of different kinds.

Naturally, since the grammar can only recognize one of the three keywords (and legitimate variants), most of these OOG utterances are misrecognized as one of the keywords. That wouldn’t be a problem if the corresponding confidence scores were low and we could safely reject them, but that’s not always the case. In fact, many of these have a confidence score over 0.9, resulting in damaging false accepts.

An effective way to reduce false accepts is to add decoys to the grammar. For instance, you would normally want to start by adding common OOG responses, on the ground that it’s easier to reject an OOG utterance if you can recognize it correctly. You could also add more “general” decoys, for instance a phoneme loop, to help reject hard to predict OOG utterances. There are more advanced techniques that can be used in order to come up with “optimal” decoys for a given grammar, but I won’t go into them now.

In all cases, it is of course absolutely necessary to evaluate, on a large enough test corpus, the impact of these decoys since they could easily end up reducing recognition accuracy, sometimes significantly. In particular, one should be careful not to add decoys that could be confused with legitimate sentences or keywords.

Let’s illustrate the impact of decoys using the set of field utterances described above. The graph below compares the performance of a grammar without decoys (red curve) to that of the same grammar to which appropriate decoys were added (blue curve).

As can be seen, even for for a fairly clean test corpus with a low OOG rate, the addition of decoys can significantly improve performance. For instance:

  • For a False Accept rate of 0.5%, the Correct Accept rate increases from 95% to over 97.5%, which is equivalent to reducing the error rate by more than 50%.
  • For a Correct Accept rate of 97.5%, the addition of decoys decreases the False Accept rate from 1.5% to around 0.3%. That’s one fifth the False Accept rate for the same Correct Accept rate.

Another interesting observation is the impact of decoys on confidence thresholds. Let’s say we want to have a False Accept Rate of 0.5%. Then, we would need to use a confidence threshold of 0.73 for the grammar without decoys, but only  0.25 for the grammar with decoys. That’s quite a difference! This clearly shows that using “default” threshold values may sometimes produce results that are quite inadequate.

All of this once again demonstrates how important it is to pay close attention to out-of-grammar utterances in a tuning process and how decoys can provide an effective tool for containing the negative impact of such utterances on application performance.