Category Archives: Products

More robust automated test scripts: wraparound mode

Lately, I have been involved in the development of a new reusable VoiceXML dialog module. The module is invoked via a <subdialog> call with a number of parameters, one of which having an impact on the order of the questions asked by the module.

Writing automated test scripts for such parameterized applications or modules is too often a very time-consuming task. One has to take the order of questions into account, leading to an explosion in the number of scenarios and lots of duplication. In such cases, you often end up testing a single configuration, assuming that all others will be only small variations that need not be tested. But is it really safe to do that?

One of the nice features of NuBot is the ability to write test scenarios that are robust to the order in which questions are asked. To do that, test scenarios need only be created in wraparound mode. Each scenario is composed of action groups, each of which consists in an association between a state in the application and an answer to give to the tested application.

In the wraparound mode, when NuBot receives a feedback from the application, it looks at its next group. If the feedback does not match the expected action group, instead of generating an error, it simply skips it and considers the next one, and so on. If it reaches the end of the scenario’s groups, it “wraps around” (thus the mode name) and considers the groups from the start of the scenario in turn. Only if it cannot match a step in the scenario will it generate an error.

What’s the point of having 98% in-grammar accuracy if 40% of user utterances are out-of-grammar?

How many times have you heard people say that they “achieve 95% speech recognition accuracy” (or more)? That sounds really impressive, doesn’t it?

It shouldn’t. What they don’t tell you is that they actually measure “in-grammar accuracy”, which means that accuracy is measured only on utterances that are perfectly covered by the grammar. For instance, for a date grammar, an utterance such as “well, uh, january fourth” would be considered out-of-grammar (and therefore ignored from the accuracy calculation) if “well, uh” is not covered by the grammar.

Unfortunately, in the real world there’s no way to force users to stick to in-grammar utterances. In fact, users usually have no way of even knowing what the grammar covers other than through hints provided by the prompts. Even well-behaved users can hesitate, correct themselves, or use an unexpected formulation (which sounds perfectly natural to them), all of which are likely to be out-of-grammar. They can even say things that they believe will help the machine understand them (for instance using “victor” instead of “v” when spelling).

As a result, it’s not unusual to have between 30% and 50% of user utterances that are considered out-of-grammar, many of which are perfectly legitimate responses to the application prompt. So what’s the point of reporting in-grammar accuracy if this ignores a large chunk of legitimate user utterances? You tell me.

Just to illustrate, you want to know one of the most effective ways of improving in-grammar accuracy? Just reduce grammar coverage. Sure, your out-of-grammar rate will increase but, hey, you’ll improve in-grammar accuracy! Isn’t that great? This tells you how useless in-grammar accuracy is at telling you whether you improved the grammar.

This is why we always report accuracy by considering every legitimate user utterance (i.e., the ones that contains a valid response to the prompt, regardless of wording or extraneous speech). This way, we make sure that we don’t conveniently ignore the utterances that happen to be the more challenging and we get results that accurately represent the real recognition performance (not some imaginary performance calculated on an idealized set of clean utterances).

But the best reason for doing it our way is that it enables us to truly measure improvements when we tune grammars. The reason is simple. Changing the coverage of a grammar always involves a trade-off. We can improve accuracy by covering more user utterances, but this can reduce overall accuracy if the new grammar paths introduce new speech recognition errors. The only way we can measure improvement is if we measure accuracy on a fixed set of valid utterances that doesn’t depend on the actual grammar coverage.

IVR Application Monitoring: What For?

I’ve recently blogged about what any good IVR monitoring service should provide in terms of reports. Now, let me take a step back, and address some of the reasons why you might want to consider monitoring your IVR application in the first place.

Customer Satisfaction

Satisfaction

Photo Credits: Sanja Gjenero

Customer satisfaction should be one of your primary objectives, as always. As part of any good customer satisfaction strategy, you certainly want to have the overall user experience as smooth as possible. While a smooth experience relies on an efficient dialog interaction design, following industry best-practices, continuous tuning, along with a high level of testing, all these efforts become irrelevant the minute your customers call your application and get a busy ring tone, dead-air, latencies, or plain transaction failure.

Our Mirador service continuously monitors your application, calling in and performing transactions with your IVR system, exactly as a real customer would. As soon as one of the selected performance metrics does not meet the configured level, a real-time alarm notification is sent to your operations team, which can instantly take action and therefore limit any potential negative consequences. Instant notification, instant reaction.

Service Level Agreement (SLA)

Contract

Photo Credits: shho

To quote wikipedia,

“A service level agreement (frequently abbreviated as SLA) is a part of a service contract where the level of service is formally defined. In practice, the term SLA is sometimes used to refer to the contracted delivery time (of the service) or performance.” – Wikipedia

Here, performance is the key. And performance is tied not just to the IVR application itself, but to the overall infrastructure. That is, you not only have to make sure that your application meets the performance requirements set by your internal stakeholders, but you also need to be able to assess that your infrastructure providers meet their respective SLA. When you think infrastructure providers, think about the whole solution in terms of point of failures from a customer perspective, including telecommunication pipes, toll-free and local numbers (all of them!), telephony hardware and software, IVR and CTI platforms, databases, etc.

But then again, how can you assess that such requirements are met? Metrics to the rescue! Continuously gathering metrics about the health of the overall application is key to such SLA acceptance or even to back your claim about any unacceptable performance issues from your provider.

Proactive Provisioning

Grocery Cart

Photo Credits: akaak19

You have launched your IVR service a while ago and you’re really satisfied in terms of ROI. However, in the last few months, you have heard about customers starting to complain about latencies. How odd. This situation could be explained easily: your IVR service has gained popularity. While your overall infrastructure was initially provisioned to comfortably handle a maximum of 10K calls on peak periods, your system is now handling 25K calls, thanks to an efficient marketing and communication department! While analyzing Erlang tables for your specific requirements, there is unfortunately no actual way to achieve prescience and predict your overall system popularity and success.

The only way to avoid user experience degradation due to usage growth is to rely on proactive provisioning. By continuously monitoring your application, gathering performance metrics over time gives you a powerful way to see trends and anticipate problems that will unavoidably occur if the infrastructure is not appropriately expanded. A constant increase in terms of response delay constitutes a good indicator of call handling performance degradation, while an increase in terms of both call setup time and failures might highlight a problem in terms of telecommunication capacity.

Revenue

Coins

Photo Credits: Zsuzsanna Kilian

Last but not least: Revenue. There is a good chance that your business is interested in a constant revenue stream, where more transactions mean more profits. From this perspective, there is also a good chance that your IVR application serves as one of your company’s income vehicle, where each customer calling might be converted into a profitable transaction, either directly or indirectly. In this regard, each call lost may also mean a lost transaction and hence, potential revenue gone… forever. While a web transaction is, by its redundant nature, more tolerant to failures, where connection can be re-established, and transactions rolled over and such, it is not quite the case for over-the-phone transactions. Once the communication with your customer is gone, it is really gone. There is no such thing as a callback failover, agents or systems calling back your potential-revenue-customer. That could be a interesting business idea though. But that is another story : )

Mirador monitors your IVR system to make sure it is up and running, ready to accept incoming calls and perform transactions. That is, potential revenue-generating transactions.

Conclusion

Can your customers truly reach your IVR application? What about now? …And now? …now?

Coming up next: What’s in your monitoring alarm notification triggers?

Nu Echo Introduces the Mirador IVR Application Monitoring Service

The most effective way to make sure your speech or Touch-Tone IVR systems are up and running and provide the user experience you expect

MONTREAL, QC, February 8, 2011 – Nu Echo, creator of the NuBot Automated IVR Testing Platform, is introducing Mirador, an IVR Monitoring Service that makes sure your speech or Touch-Tone IVR systems are up and running and provide the user experience you expect.

Mirador continuously calls your IVR applications at regular intervals and simulates callers going through various application transactions, providing real-time notifications of performance degradations or system failures, as well as periodic reports detailing the system’s performance over time. Because this is all done remotely from our hosted platform, there is nothing to install at the IVR premises and there is no need to modify the IVR applications.

Read the full version.

Testing dynamic grammars

In my post on NuGram and CouchDB, I neglected to mention how the dynamic grammar was authored and, most importantly, tested. Having a repeatable process for testing grammars is very important when developing a speech application, as most grammars change and get more complex over time.

Of course, the grammar was authored with NuGram IDE. NuGram IDE has some great features to test grammars, and especially dynamic grammars. Dynamic grammars (like the streets grammar) have always been more difficult to debug than static grammars. They can be very easy to write for small applications or prototypes (or blog posts…), but in real applications their coverage tests are often (and should!) run in batch as part of an automated build process. But this is often too cumbersome in practice. For instance, a dynamic grammar implemented as a JSP page requires a web application server to run and if the JSP page makes queries to a database, the DB must be running somewhere too. This greatly complicates the setup to make batch coverage tests. Moreover, writing and testing the dynamic grammar requires some programming skills that speech scientists don’t always have (at least not in large organizations).

With NuGram’s template language, a dynamic grammar can be tested in NuGram IDE Basic Edition in two different ways:

  • Using predefined data encoded as a JSON object (a JSON context), or
  • Using some custom Java code (a Java context).

Both ways require the creation of an instantiation context. It’s simply a mapping between variable names and values. An instantiation context must provide a value for each and every variable used in the grammar template. The values are used to populate the template and produce the resulting (ABNF or XML) grammar. The way the instantiation context is created depends on the type of context. For a JSON context, the instantiation context is the JSON document itself. For the Java context, some Java code populates a map from strings to objects.

The following video shows how to create a JSON context for the street grammar:

This one shows the steps required to create and use a Java context:

Note: there was a subtle (uncovered) bug in the previous version of NuGram IDE. If you want to create Java contexts like in the video above, please make sure to download the latest version.

The whole project used in the videos is available on github. The Java context initializers use the following open-source libraries:

In the next post, I will show how to use the Java context initializer to deploy the streets grammar on the Java-based version of NuGram Server.

And you, how do you test your dynamic grammars?