Category Archives: application

More robust automated test scripts: wraparound mode

Lately, I have been involved in the development of a new reusable VoiceXML dialog module. The module is invoked via a <subdialog> call with a number of parameters, one of which having an impact on the order of the questions asked by the module.

Writing automated test scripts for such parameterized applications or modules is too often a very time-consuming task. One has to take the order of questions into account, leading to an explosion in the number of scenarios and lots of duplication. In such cases, you often end up testing a single configuration, assuming that all others will be only small variations that need not be tested. But is it really safe to do that?

One of the nice features of NuBot is the ability to write test scenarios that are robust to the order in which questions are asked. To do that, test scenarios need only be created in wraparound mode. Each scenario is composed of action groups, each of which consists in an association between a state in the application and an answer to give to the tested application.

In the wraparound mode, when NuBot receives a feedback from the application, it looks at its next group. If the feedback does not match the expected action group, instead of generating an error, it simply skips it and considers the next one, and so on. If it reaches the end of the scenario’s groups, it “wraps around” (thus the mode name) and considers the groups from the start of the scenario in turn. Only if it cannot match a step in the scenario will it generate an error.

What’s the point of having 98% in-grammar accuracy if 40% of user utterances are out-of-grammar?

How many times have you heard people say that they “achieve 95% speech recognition accuracy” (or more)? That sounds really impressive, doesn’t it?

It shouldn’t. What they don’t tell you is that they actually measure “in-grammar accuracy”, which means that accuracy is measured only on utterances that are perfectly covered by the grammar. For instance, for a date grammar, an utterance such as “well, uh, january fourth” would be considered out-of-grammar (and therefore ignored from the accuracy calculation) if “well, uh” is not covered by the grammar.

Unfortunately, in the real world there’s no way to force users to stick to in-grammar utterances. In fact, users usually have no way of even knowing what the grammar covers other than through hints provided by the prompts. Even well-behaved users can hesitate, correct themselves, or use an unexpected formulation (which sounds perfectly natural to them), all of which are likely to be out-of-grammar. They can even say things that they believe will help the machine understand them (for instance using “victor” instead of “v” when spelling).

As a result, it’s not unusual to have between 30% and 50% of user utterances that are considered out-of-grammar, many of which are perfectly legitimate responses to the application prompt. So what’s the point of reporting in-grammar accuracy if this ignores a large chunk of legitimate user utterances? You tell me.

Just to illustrate, you want to know one of the most effective ways of improving in-grammar accuracy? Just reduce grammar coverage. Sure, your out-of-grammar rate will increase but, hey, you’ll improve in-grammar accuracy! Isn’t that great? This tells you how useless in-grammar accuracy is at telling you whether you improved the grammar.

This is why we always report accuracy by considering every legitimate user utterance (i.e., the ones that contains a valid response to the prompt, regardless of wording or extraneous speech). This way, we make sure that we don’t conveniently ignore the utterances that happen to be the more challenging and we get results that accurately represent the real recognition performance (not some imaginary performance calculated on an idealized set of clean utterances).

But the best reason for doing it our way is that it enables us to truly measure improvements when we tune grammars. The reason is simple. Changing the coverage of a grammar always involves a trade-off. We can improve accuracy by covering more user utterances, but this can reduce overall accuracy if the new grammar paths introduce new speech recognition errors. The only way we can measure improvement is if we measure accuracy on a fixed set of valid utterances that doesn’t depend on the actual grammar coverage.

Testing an Intervoice InVision app with Voxeo Prophecy

I’ve just started working on a DTMF-only VoiceXML application for one of our customers. The application is developed using Intervoice InVsion Studio 3.1 (the native Windows version) and will be deployed on the Intervoice Voice Portal 5. The challenge in this project is three-fold:

  • Development is done in Nu Echo’s premises.
  • Nu Echo does not have IVP5 in its lab.
  • The only way to test the application is to connect to the customer’s network using VPN/pcAnywhere, deploy the application there and test using a local phone number.

Fortunately, except for all the VoiceXML code that handles attached data and transfers to the PBX, everything else can be easily tested on my own machine using only freely available tools.

The VoiceXML platform

InVision Studio is a tool that provides a graphical editor that maps an IVR call-flow to completely static, standards-compliant VoiceXML code (at least it’s the cased for the application I have to develop). Once the application successfully passes the validation tests, it can be exported to VoiceXML code that can then be deployed on any web server.

InVision Studio

InVision Studio

Since the resulting code does not depend on any proprietary extension, I decided to use Voxeo Prophecy to test it. It comes with a really decent ASR engine as well as a good TTS engine, both only for US English. The application is DTMF-only, so the ASR is not needed in my case, but TTS is handy when you don’t want to record all the application prompts (with InVision Studio, you have to specify a text to all the prompts you define).

After installing Prophecy, I had to use Prophecy Commander, the web-based management console, to configure the application and the route to reach the application. The route is used to associate a number to call with the application. In my case, the app is CustomerApp and the route is test-customer-app:

Routing rules in Prophecy Commander

Routing rules in Prophecy Commander

To call the application, I simply use the SIP phone that comes with Prophecy and dial test-customer-app.

Prophecy SIP phone

Prophecy SIP phone

The Web server

For the web server, I use Yaws. It’s a web server written in Erlang. But it could have been Apache, or Tomcat, Jetty, IIS, or any other web server. I chose Yaws mainly because I do some Erlang programming on my spare time and happen to know Yaws a bit more than the alternatives.

I configured Yaws to server static files on port 8080 from the Runtime directory of my InVision project. So whenever I export the VoiceXML code for the project, I just take the SIP phone and make a call to test the application. The Yaws configuration for the virtual server is:

<server localhost>
        port = 8080
        listen = 0.0.0.0
        docroot = "C:/InvisionProjects/CustomerApp/Runtime"
</server>

Extensive logging

First off, let me say that when it comes to debugging an app, the Prophecy logviewer is of tremendous help. I was first a bit overwhelmed by the vast quantity of information logged by the various parts that compose Prophecy, but the filtering capabilities make it easy to focus on only a fraction of it. (I have seen the logs of many VoiceXML platforms, and these ones are certainly among the most comprehensible.)

I’m writing this because I had to use the logviewer at the minute I started testing the application interactively. Why don’t I just listen to the prompts? Well, the problem is that the prompt texts are in French, while the TTS is in English. That’s plainly and simply incomprehensible and trying to figure out where I am in the application is really painful and annoying. So I decided to add VoiceXML log elements extensively in the application, all starting with a very specific pattern: [CustomerApp].

Logging elements in application

Logging elements in application

It is then very easy to filter the logs based on this pattern and see only the progress of the application:

Prophecy Logviewer

Prophecy Logviewer

A final remark

Yes, I could use the debugger that comes with InVision Studio. But frankly, I do not find it very intuitive to use. I prefer making calls and test the user experience at once.

How a great speech application may appear to perform poorly

One of our products, a Canadian address capture VoiceXML module, has been deployed with great success by several of our customers. One of these deployments was done in the context of a change of address application, where the module has to capture the new address, the date when the new address becomes effective, and the new telephone number. Note that all information is entirely obtained through speech recognition.

In this deployment, the contract specified that the application had to achieve a minimum success rate. In order to track performance, two success metrics were jointly defined with the customer:

  • The Raw Success Rate. This is calculated simply by dividing the number of calls for which the change of address was successfully completed (with all collected information confirmed by the caller), divided by the total number of calls for which the change of address module was used.
  • The Real Success Rate. This is calculated similarly, with the exception that certain calls were excluded from consideration, namely calls where the caller provided no input whatsoever and calls where the caller hung up within the first two interactions.

The customer specified that the application had to achieve a Real Success Rate of 75% or more. The rationale for the Real Success Rate is to exclude callers that either don’t want to use the application (for instance because they ended up in the application by mistake) or don’t have the requested information. As a matter of fact, after the initial deployment revealed a fairly high hang-up rate early in the change of address call flow, the customer contacted a number of those callers in order to find out why they had decided to hang up and it turns out that most of them admitted that they had no intention of changing their address; they had simply selected this option in the hope of getting connected to an agent faster.

It’s nonetheless interesting to track both metrics since a large difference between them can indicate problems that occurred earlier in the call (that is, before going into the change of address application).

For instance, at the end of 2008, the customer made some changes in the front menus, which significantly increased the number of callers that incorrectly found themselves in the change of address application. As shown in the graph below, this created a big drop in the Raw Success Rate while the Real Success Rate remained relatively constant. The customer implemented various changes to the front menu throughout 2009 (while the change of address application remained unchanged), with the result that the Raw Success Rate was finally stabilized at around 75% (and the Real Success Rate at 85%).

This shows that, when trying to evaluate the performance of an application, it’s important to focus on the correct metrics. Otherwise, we may end up not only with an incorrect assessment of its real performance, but also with wild variations that have nothing to do with the application itself.