Monthly Archives: May 2010

Comparing different speech recognition engines

We’re sometimes asked to compare the performance of different speech recognition engines on an identical task (same grammar, same set of test utterances). To do so in an effective way, we rely on three important features of NuGram Server (on-the-fly conversion of grammars to any format, semantic interpretation of textual sentences, and a NuGram-specific meta value that removes all semantic tags from generated grammars), which we use extensively in our tuning environment.

One powerful aspect about our tuning environment is that, no matter what recognition engine we use, there is no difference in the way we perform speech recognition experiments and then score and analyze results. It’s all completely transparent. This makes it easy to run the exact same experiment using different recognition engines and then compare results using metrics, graphs, and other tools that are used consistently across all engines.

A big challenge when comparing different engines is that we usually can’t use the same grammar since different engines often use incompatible grammar tag formats. For instance, let’s say we have a recognition grammar for the Loquendo LASR speech recognition engine and we would like to compare the performance we get with this grammar using three different engines: Loquendo LASR, OSR 3.0, and Nuance 8.5. In that case, we have three different tag formats: Loquendo uses SISR, OSR 3.0 uses swi-semantics and Nuance 8.5 uses the Nuance GSL proprietary tag format. So in principle, we would need to convert the grammars for each recognition engine, which can be a significant effort for complex grammars.

No need for manual grammar conversion

It is, however, possible to compare different recognition engines without having to manually convert the grammars. The approach we use is quite simple: With each engine that is not compatible with the original grammar’s tag format, we perform speech recognition using a grammar from which semantic tags have been removed and we then add semantic information back to the recognition result as a post-processing step.

This is all done using NuGram Server, as follows.

We start with the original grammar in ABNF format (here credit-card.abnf), which we use for the recognition test using Loquendo ASR. Then, we add a special-purpose NuGram meta directive to the grammar, which tells NuGram Server to omit the semantic tags when generating the grammar:

#ABNF 1.0 ISO-8859-1;
language en-US;
mode voice;
tag-format <semantics/1.0>

meta "com.nuecho.generation.omit-tags" is "true";

root $main;

When we perform the recognition test with OSR, we tell NuGram Server that we want to use credit-card.grxml (note the extension). NuGram Server then automatically converts credit-card.abnf to the SRGS XML format, while omitting the semantic tags from the grammar. Recognition then proceeds without a hitch, but results are returned without any semantic slots.

Similarly, when we perform the recognition test with Nuance 8.5, we tell NuGram Server that we want to use credit-card.gsl, which tells NuGram Server to automatically convert credit-card.abnf into a GSL grammar (still without semantic tags). Recognition once again proceeds without a hitch and results are returned without semantic slots.

Finally, in order to get recognition results with semantic slots, we simply send the original credit-card.abnf grammar and the recognition results to NuGram Server in order to add semantic slots to the recognition results. In other words, semantic interpretation is done as a post-processing step by NuGram Server based on the SISR tags in the original grammar.

Note that if the original grammar had been a GSL grammar or an OSR grammar, NuGram Server could still have computed the semantic interpretation based on the semantic tags in the original grammar (NuGram understands many different tag formats).

Dealing with engine-proprietary features

Some engine-proprietary features might make results more difficult to compare. For instance, OSR and Nuance 9 provide the special-purpose SWI_disallow key, which can be used to remove hypotheses from the N-best list of recognition hypotheses returned by the engine. This could for instance be used to remove credit card numbers that don’t have a valid checksum, therefore improving recognition accuracy as a result.

This useful feature could make recognition results difficult to compare if some engines have it and others don’t (in which case an equivalent result could be obtained by removing invalid hypotheses in the application). Fortunately, in our recognition tests we have the ability to tell NuGram Server to remove, from the N-best list, those hypotheses that match a specified slot pattern (e.g., SWI_disallow=1). This once again makes it possible to make fair and accurate comparisons between OSR or Nuance 9 and other engines.

The NuGram approach to dynamic grammars

I have just uploaded to Slideshare a short presentation about the Nu Echo approach to dynamic grammars.

For text-based applications too!

Remember that NuGram Server is not only for speech-enabled applications. You can use it to parse text-based sentences, too. So it is an ideal complement to your preferred cloud-based SMS or IM application platform like Tropo, Twilio, Teleku, just to name a few.

Try it now!

It’s free for development use, so don’t be shy. Give it a try! You simply need to register, upload your grammars, and use one of the many APIs we provide.

IVR unit testing in CVP Studio

In one of my previous posts, I presented the concept of IVR unit testing. Although a very nice concept in theory, I am sure many of you said to yourself: “Great, but I can’t do that since I use a graphical service creation environment (SCE)”. This may be true for some SCEs, but certainly not for all.

There are a few SCEs, like Cisco Unified Call Studio (formerly CVP Studio), that let you extend the environment with Java code. That’s what we did for one of our professional services projects. Let me briefly explain what we did in this project and present some of the benefits as well as a few lessons learned.

The application

The application was a very typical IVR DTMF-only hierarchical menu: lots of options, many optional messages at various places in the menu tree triggered by dynamic configuration options, information messages, etc. Each menu had to support a number of common navigation commands, like * to repeat, # to go back to the previous menu, and so on. The difficulty with such an application is that duplicating the dialog for each menu is quite time consuming and highly error-prone in the presence of constantly evolving customer requirements.

At least, CVP Studio provides a way to define reusable dialog patterns, but unfortunately once the pattern is copied at various places in your application, it cannot be modified in such a way that all its uses are automatically updated. You have to modify each use of the pattern manually.

In addition to those reusable dialog patterns, CVP Studio provides a programmatic API to implement custom elements. These elements can then be added to the SCE’s palette and used to implement nodes at various places in the application, each with its own configuration. Typically, such custom elements simply add some elements to a VoiceXML page (when they implement the VoiceElementBase interface).

For our application, we implemented all the menu nodes using a custom element. The element encapsulates the common behaviors shared by all menus, like how to handle no input, errors, the repeat key, etc. A key advantage of doing this is that when we need to change one of those behaviors, all the nodes in the application are updated at once (saving us a lot of maintenance headaches). Another advantage is that this custom element can be easily reused in other applications as well.

Of course, some will note that we could have used VoiceXML subdialogs rather than custom elements to implement our reusable dialogs. However, due to the design of our Java-based management console interface to configure all the dynamic elements of the application, it was more natural for us to build custom elements also in Java.

But the coolest thing about this approach is that the whole dialog for a single menu is driven by a small state machine that generates objects representing interactions with the caller (instead of plain XML elements) and accepting objects representing interaction results (like a no input, a no match, a recognition result, a DTMF input, etc.). And we ensured that it is possible to interact with the state machine without having to execute the dialog at run time. The state machines are completely decoupled from CVP Studio’s programmatic API!

Guess what? We could unit test all the menus very easily. We just wrote a test controller that injects interaction results programmatically into the state machine, retrieves the next interaction, and asserts some properties (which prompts are played, what options are available, etc.). It is thus very easy to test all the different situations that can be encountered at run time (call center open or closed, optional message activated or not, and so on).

Here is an example of a simple unit test:

The most interesting lines in this method are the ones near the end, beginning with testCase.addInputAssertion. They tell the test case which answers (interaction) from the user to simulate, as well as some assertions that must hold after the application has processed the interaction. For example, the first call simulates a NO MATCH event and the test case will make sure that the next step from the application will be to play a prompt (message) identified by the constant MenuConfiguration.NO_MATCH_1. The next one simulates a no input event and asserts that the generic options are enabled, and the menu prompts will be played. Finally, the third one simulates another no input event and ensures the call will be transferred.

This example only illustrates the testing of a simple generic behavior. The more interesting test cases involve specific nodes in the call-flow depending heavily on the dynamic configuration of the application. By stubbing the clock, for instance, we can make sure that messages telling the contact center is closed are properly played outside of business hours and that the option to transfer the call to an agent is disabled. Other tests ensure that during business hours, the proper transfer reasons are set before transferring to the call center queue manager.

Some lessons learned

Note that there are also some drawbacks to this approach as well. First, this technique does not make it possible to test the sequencing of these customm nodes in the application. For that, we had to rely on manual testing. But that was not such a big deal after all. The way those custom nodes are connected in CVP makes the validation process quite trivial by comparing the call-flow design document with the call-flow in CVP Studio. For instance, if the application goes from a A to B when DTMF 8 is pressed in the former, there is a transition labelled “8″ from custom node A to custom node B in the latter.

Reporting was another issue we faced due to this approach. CVP already provides extensive reporting capabilities when the application uses only predefined elements. When using custom elements, you have to carefully log events in a special table of the Informix DB, and this greatly complicates the consolidation of information to get a precise understanding of what’s happening with the calls.

Was it worth it?

All in all, the advantages of this approach far outweighed the issues just mentioned. At least for this project, in which a large part of the configuration is dynamic and requires a fair amount of Java code anyway. We still maintain the application and it evolves quite rapidly, even several years after its initial deployment.

Hey, I don’t use CVP!

You don’t use CVP? There are other approaches giving you some of these benefits. I’ll outline some of them in upcoming posts. Stay tuned!

And of course, if you have experience implementing unit testing using graphical service creation environments, please share it with us!

Grammar tips & tricks #1 – rules naming

[This post is the first in a series of short posts giving tips and tricks on speech grammar writing.]

Tip #1: make sure that your rule names are always ECMAScript identifiers.

In SRGS grammars, rule names must be valid XML names and may not contain the following characters: ., :, and -. For people new to speech grammar writing, It is not always obvious why there is such a restriction.

When you start writing your first semantic tags, you understand why. When using semantics/1.0 tags, values returned by referenced rules are exposed as properties of the rules and meta objects, while with swi-semantics/1.0 (the Nuance OSR tag format), those values are exposed as variables. In other words, in both cases rule names must be valid ECMAScript identifiers. In ECMAScript civic-number is not an identifier, it’s an arithmetic operation!

Of course, NuGram IDE always enforces this restriction, any mistake will be reported as you type.

A related OSR-specific pitfall

With swi-semantics/1.0, you need to be even more cautious. It is always a bad idea to have a variable whose name can conflict with the name of a referenced rule. If the variable is already defined, the value of the referenced rule will become inaccessible.

$someRule =
    [$prefix { type = 'default' }]
    $<types.abnf#type> { type = type.value; }
    $<values.abnf#value> { value = value.value; }
;

This grammar won’t work if something from $prefix is uttered. This will cause the slot (variable) type to be set to "default" and prevent the value returned by the reference $<types.abnf#type> from being bound to the type variable. When the second semantic tag is executed, the value of the variable type will still be "default", which is not an object with a property value, thus causing an execution error.