One of our products, a Canadian address capture VoiceXML module, has been deployed with great success by several of our customers. One of these deployments was done in the context of a change of address application, where the module has to capture the new address, the date when the new address becomes effective, and the new telephone number. Note that all information is entirely obtained through speech recognition.

In this deployment, the contract specified that the application had to achieve a minimum success rate. In order to track performance, two success metrics were jointly defined with the customer:

  • The Raw Success Rate. This is calculated simply by dividing the number of calls for which the change of address was successfully completed (with all collected information confirmed by the caller), divided by the total number of calls for which the change of address module was used.
  • The Real Success Rate. This is calculated similarly, with the exception that certain calls were excluded from consideration, namely calls where the caller provided no input whatsoever and calls where the caller hung up within the first two interactions.

The customer specified that the application had to achieve a Real Success Rate of 75% or more. The rationale for the Real Success Rate is to exclude callers that either don’t want to use the application (for instance because they ended up in the application by mistake) or don’t have the requested information. As a matter of fact, after the initial deployment revealed a fairly high hang-up rate early in the change of address call flow, the customer contacted a number of those callers in order to find out why they had decided to hang up and it turns out that most of them admitted that they had no intention of changing their address; they had simply selected this option in the hope of getting connected to an agent faster.

It’s nonetheless interesting to track both metrics since a large difference between them can indicate problems that occurred earlier in the call (that is, before going into the change of address application).

For instance, at the end of 2008, the customer made some changes in the front menus, which significantly increased the number of callers that incorrectly found themselves in the change of address application. As shown in the graph below, this created a big drop in the Raw Success Rate while the Real Success Rate remained relatively constant. The customer implemented various changes to the front menu throughout 2009 (while the change of address application remained unchanged), with the result that the Raw Success Rate was finally stabilized at around 75% (and the Real Success Rate at 85%).

This shows that, when trying to evaluate the performance of an application, it’s important to focus on the correct metrics. Otherwise, we may end up not only with an incorrect assessment of its real performance, but also with wild variations that have nothing to do with the application itself.

Tip #2: In SISR semantic tags, return key/value pairs whenever possible.

Strings all over the place

It is fairly common for new SRGS grammar writers to write SISR semantic tags that only return string values to calling rules or to the voice application, even when the data has some structure. For example, a dollar amount rule could return a string like this (in ABNF):

public $amount =
  $dollars {out = rules.dollars + ".00";}
  [and $cents {out = out.substring(0, out.length - 3)
                        + "." + rules.cents; }]
;

...

One obvious disadvantage of this approach is that the application has to extract the dollars and the cents from the returned string. Of course, a simple string to number conversion can be done. But due to possible rounding errors, it is best to extract both values separately and converting the two substrings to integers. This may not be that bad, machines are so fast these days.

A less obvious reason why this is not recommended relates to the fact that the computations made by the semantic tags can only begin once the engine has finished recognizing the utterance. In other words, the corresponding computation time directly adds to the application’s response time. The ECMAScript interpreter typically compiles the script (the semantic tag) to an intermediate representation before executing it. Unless the ASR properly caches the result of this compilation process, the script is compiled again and again. The more complicated the script is, the more processing power it takes to parse it, compile it, and execute it.

We also have to add to that the fact that string concatenation/substring extraction creates a lot of unnecessary temporary objects, thus putting a bigger burden on the garbage collector (or any other memory management algorithm employed by the ECMAScript interpreter).

Finally, since semantic tags are compiled and executed for every hypothesis in the N-best list, the computation time and the number of objects created grows proportionately with the number of hypotheses requested by the application. If we sum all this, we end up with a grammar that requires unnecessary processing power from the ASR engine, which can cause significant delays in the recognition process. This may even result in noticeable latency at the application level (i.e. some dead-air).

Use semantic keys instead

A better way to write the above grammar would be:

public $amount =
  $number {out.dollars = rules.number;
           out.cents   = 0; }
  [and $cents {out.cents = rules.cents; }]
;

...

Using explicit semantic keys has many advantages:

  • Documentation. This self-documents the type/purpose of the returned values.
  • Maintenance/evolution. The scripts are much simpler, thus easier to understand for someone trying to understand the grammar. It is also easier to add other keys later if need be.
  • Analytics. The presence of distinct semantic keys facilitates the analysis of field data. For example, we can be interested in performing a recognition performance test for only a subset of our collected utterances, i.e. those utterances whose value for the cents semantic key is 0.
Related posts:
June 7th, 2010 No Comments

by Dominique Boucher

An alternative way to automate IVR tests

A few weeks ago, I posted an article describing a real hands-on experience on implementing IVR unit tests in an CVP Studio application. But programmatic unit tests are not the only way to automate IVR application tests and provide a repeatable and reliable way of testing large portions of an application easily, in a matter of minutes. There are other ways to get most of the benefits of unit testing without even having to pick the phone (you have better things to do than make hundreds of phone calls a day, right?).

One of them is NuBot, Nu Echo’s hosted IVR application testing platform. With NuBot, there are basically three steps involved:

  1. You instrument your application with DTMF sequences at specific places in the application’s call-flow. These sequences are used to synchronize the application with your test scenarios and are only played when the application is in test mode. (We also support speech-recognition-based synchronization, but only through our professional services at the moment.)
  2. You program your test scenarios using the free NuBot integrated testing environment (ITE), an Eclipse plugin that can co-exist alongside the rest of your programming environment.
  3. You schedule and launch your test on our hosted platform from the NuBot ITE, specifying which scenarios to use, how many ports are needed, how many runs of each scenario to do, etc.

Now as you modify your application, you simply keep your tests up-to-date and re-run them as needed. They can even be incorporated into an automated continuous integration process running every night. So if you break something in the application, you will know it fast.

Of course, in contrast to unit tests which are run on the developer’s machine, automated tests using NuBot require that the application is deployed on a server first. This requires some extra work. But you would have to do that anyway if you were to do your tests manually. And it’s worth it considering that you are doing end-to-end testing of your application, not just running some Java code.

Load-testing ready

Another advantage of using NuBot is that once your application is instrumented and you have all your test scenarios, they can be readily used for load testing. This way, you won’t have to start planning for the development of load testing scripts only after the application is fully implemented.

And of course, you’ll do the load testing it at your own convenience, all by yourself. This way, you stay in control of your testing process. (We do offer professional services if you prefer, but they are completely optional.)

Try it now!

Want to cut your testing costs while delivering more reliable applications? Give NuBot a try.

We also have an on-premise version if using a hosted platform is not an option. Contact us for more details.

June 1st, 2010 No Comments

by Dominique Boucher

Converting GSL tags to SISR - conflicting goals

NuGram IDE provides a tool to convert Nuance GSL grammars to SRGS ABNF, which can then be converted to XML form. But the tool does not convert the semantic tags. So lately we’ve started working on the conversion of GSL semantic tags to SISR, and what initially seemed like a simple project provoked a heated debate internally (well, I may be exaggerating a bit… ;-). I soon realized that this was because there are really two competing forces driving the design of such a tool:

  • Correctness.The set of SISR tags generated automatically faithfully implement the behaviour of the corresponding Nuance GSL tags. In other words, the resulting grammar needs no manual intervention and the semantic results obtained using the generated grammar are always the same as if the original GSL grammar was used.
  • Maintenance.The set of SISR tags generated automatically are easy to understand, and thus to modify. They are close to what an SISR developer would have written from the start.

To see why these two goals conflict, simply consider how calls to predefined GSL functions can be translated. The GSL tags language provides predefined functions for things as simple as arithmetic operations: $add, $sub, etc. Converting a GSL tag of the form:

{return (add($n $m))

could generate a SISR tag like

{out = $add(n,m);}

if we want to preserve correctness. Here $add would be a function defined in a generated grammar header tag that implements an ECMAScript equivalent of the GSL add function, with proper handling of undefined values:

{!{
  function $add(x, y) {
    if (x == undefined || typeof x != "number") x = 0;
    if (y == undefined || typeof y != "number") y = 0;
    return x + y;
  }
}!};

But if the converter inlines the call and adds some code to check for undefined values, it could produce something like:

{out = ((n == undefined || typeof n != "number") ? 0 : n)
        + ((m == undefined || typeof m != "number") ? 0 : m);}

when one would have simply written:

{out = n + m;}

Unfortunately, this last version may produce unexpected results at runtime. If n is undefined and m is 3, the sum will be NaN (not a number) instead of 3 as would have produced the original GSL tag. In that case, it is essential to complement the tool with a rigorous testing process, one that can ideally ensure that each and every semantic tag in the grammar will be executed at least once.

So which one is better?

The answer is: that depends. In some cases correctness is preferable, especially if the grammar requires little to no maintenance at all. That may be true of simple grammars. But most of the time, grammars change over time. New sentence patterns are added, rules are extracted, etc. So maybe it’s best to leave the choice to the developer by offering a flexible tool.

And you, what would you prefer: a conversion tool that ensures full correctness at all costs, or a tool that sometimes produces grammars that are potentially not equivalent to the original one but are more maintainable?

We’re sometimes asked to compare the performance of different speech recognition engines on an identical task (same grammar, same set of test utterances). To do so in an effective way, we rely on three important features of NuGram Server (on-the-fly conversion of grammars to any format, semantic interpretation of textual sentences, and a NuGram-specific meta value that removes all semantic tags from generated grammars), which we use extensively in our tuning environment.

One powerful aspect about our tuning environment is that, no matter what recognition engine we use, there is no difference in the way we perform speech recognition experiments and then score and analyze results. It’s all completely transparent. This makes it easy to run the exact same experiment using different recognition engines and then compare results using metrics, graphs, and other tools that are used consistently across all engines.

A big challenge when comparing different engines is that we usually can’t use the same grammar since different engines often use incompatible grammar tag formats. For instance, let’s say we have a recognition grammar for the Loquendo LASR speech recognition engine and we would like to compare the performance we get with this grammar using three different engines: Loquendo LASR, OSR 3.0, and Nuance 8.5. In that case, we have three different tag formats: Loquendo uses SISR, OSR 3.0 uses swi-semantics and Nuance 8.5 uses the Nuance GSL proprietary tag format. So in principle, we would need to convert the grammars for each recognition engine, which can be a significant effort for complex grammars.

No need for manual grammar conversion

It is, however, possible to compare different recognition engines without having to manually convert the grammars. The approach we use is quite simple: With each engine that is not compatible with the original grammar’s tag format, we perform speech recognition using a grammar from which semantic tags have been removed and we then add semantic information back to the recognition result as a post-processing step.

This is all done using NuGram Server, as follows.

We start with the original grammar in ABNF format (here credit-card.abnf), which we use for the recognition test using Loquendo ASR. Then, we add a special-purpose NuGram meta directive to the grammar, which tells NuGram Server to omit the semantic tags when generating the grammar:

#ABNF 1.0 ISO-8859-1;
language en-US;
mode voice;
tag-format <semantics/1.0>

meta "com.nuecho.generation.omit-tags" is "true";

root $main;

When we perform the recognition test with OSR, we tell NuGram Server that we want to use credit-card.grxml (note the extension). NuGram Server then automatically converts credit-card.abnf to the SRGS XML format, while omitting the semantic tags from the grammar. Recognition then proceeds without a hitch, but results are returned without any semantic slots.

Similarly, when we perform the recognition test with Nuance 8.5, we tell NuGram Server that we want to use credit-card.gsl, which tells NuGram Server to automatically convert credit-card.abnf into a GSL grammar (still without semantic tags). Recognition once again proceeds without a hitch and results are returned without semantic slots.

Finally, in order to get recognition results with semantic slots, we simply send the original credit-card.abnf grammar and the recognition results to NuGram Server in order to add semantic slots to the recognition results. In other words, semantic interpretation is done as a post-processing step by NuGram Server based on the SISR tags in the original grammar.

Note that if the original grammar had been a GSL grammar or an OSR grammar, NuGram Server could still have computed the semantic interpretation based on the semantic tags in the original grammar (NuGram understands many different tag formats).

Dealing with engine-proprietary features

Some engine-proprietary features might make results more difficult to compare. For instance, OSR and Nuance 9 provide the special-purpose SWI_disallow key, which can be used to remove hypotheses from the N-best list of recognition hypotheses returned by the engine. This could for instance be used to remove credit card numbers that don’t have a valid checksum, therefore improving recognition accuracy as a result.

This useful feature could make recognition results difficult to compare if some engines have it and others don’t (in which case an equivalent result could be obtained by removing invalid hypotheses in the application). Fortunately, in our recognition tests we have the ability to tell NuGram Server to remove, from the N-best list, those hypotheses that match a specified slot pattern (e.g., SWI_disallow=1). This once again makes it possible to make fair and accurate comparisons between OSR or Nuance 9 and other engines.