Tag Archives: voice applications

IVR unit testing in CVP Studio

In one of my previous posts, I presented the concept of IVR unit testing. Although a very nice concept in theory, I am sure many of you said to yourself: “Great, but I can’t do that since I use a graphical service creation environment (SCE)”. This may be true for some SCEs, but certainly not for all.

There are a few SCEs, like Cisco Unified Call Studio (formerly CVP Studio), that let you extend the environment with Java code. That’s what we did for one of our professional services projects. Let me briefly explain what we did in this project and present some of the benefits as well as a few lessons learned.

The application

The application was a very typical IVR DTMF-only hierarchical menu: lots of options, many optional messages at various places in the menu tree triggered by dynamic configuration options, information messages, etc. Each menu had to support a number of common navigation commands, like * to repeat, # to go back to the previous menu, and so on. The difficulty with such an application is that duplicating the dialog for each menu is quite time consuming and highly error-prone in the presence of constantly evolving customer requirements.

At least, CVP Studio provides a way to define reusable dialog patterns, but unfortunately once the pattern is copied at various places in your application, it cannot be modified in such a way that all its uses are automatically updated. You have to modify each use of the pattern manually.

In addition to those reusable dialog patterns, CVP Studio provides a programmatic API to implement custom elements. These elements can then be added to the SCE’s palette and used to implement nodes at various places in the application, each with its own configuration. Typically, such custom elements simply add some elements to a VoiceXML page (when they implement the VoiceElementBase interface).

For our application, we implemented all the menu nodes using a custom element. The element encapsulates the common behaviors shared by all menus, like how to handle no input, errors, the repeat key, etc. A key advantage of doing this is that when we need to change one of those behaviors, all the nodes in the application are updated at once (saving us a lot of maintenance headaches). Another advantage is that this custom element can be easily reused in other applications as well.

Of course, some will note that we could have used VoiceXML subdialogs rather than custom elements to implement our reusable dialogs. However, due to the design of our Java-based management console interface to configure all the dynamic elements of the application, it was more natural for us to build custom elements also in Java.

But the coolest thing about this approach is that the whole dialog for a single menu is driven by a small state machine that generates objects representing interactions with the caller (instead of plain XML elements) and accepting objects representing interaction results (like a no input, a no match, a recognition result, a DTMF input, etc.). And we ensured that it is possible to interact with the state machine without having to execute the dialog at run time. The state machines are completely decoupled from CVP Studio’s programmatic API!

Guess what? We could unit test all the menus very easily. We just wrote a test controller that injects interaction results programmatically into the state machine, retrieves the next interaction, and asserts some properties (which prompts are played, what options are available, etc.). It is thus very easy to test all the different situations that can be encountered at run time (call center open or closed, optional message activated or not, and so on).

Here is an example of a simple unit test:

The most interesting lines in this method are the ones near the end, beginning with testCase.addInputAssertion. They tell the test case which answers (interaction) from the user to simulate, as well as some assertions that must hold after the application has processed the interaction. For example, the first call simulates a NO MATCH event and the test case will make sure that the next step from the application will be to play a prompt (message) identified by the constant MenuConfiguration.NO_MATCH_1. The next one simulates a no input event and asserts that the generic options are enabled, and the menu prompts will be played. Finally, the third one simulates another no input event and ensures the call will be transferred.

This example only illustrates the testing of a simple generic behavior. The more interesting test cases involve specific nodes in the call-flow depending heavily on the dynamic configuration of the application. By stubbing the clock, for instance, we can make sure that messages telling the contact center is closed are properly played outside of business hours and that the option to transfer the call to an agent is disabled. Other tests ensure that during business hours, the proper transfer reasons are set before transferring to the call center queue manager.

Some lessons learned

Note that there are also some drawbacks to this approach as well. First, this technique does not make it possible to test the sequencing of these customm nodes in the application. For that, we had to rely on manual testing. But that was not such a big deal after all. The way those custom nodes are connected in CVP makes the validation process quite trivial by comparing the call-flow design document with the call-flow in CVP Studio. For instance, if the application goes from a A to B when DTMF 8 is pressed in the former, there is a transition labelled “8″ from custom node A to custom node B in the latter.

Reporting was another issue we faced due to this approach. CVP already provides extensive reporting capabilities when the application uses only predefined elements. When using custom elements, you have to carefully log events in a special table of the Informix DB, and this greatly complicates the consolidation of information to get a precise understanding of what’s happening with the calls.

Was it worth it?

All in all, the advantages of this approach far outweighed the issues just mentioned. At least for this project, in which a large part of the configuration is dynamic and requires a fair amount of Java code anyway. We still maintain the application and it evolves quite rapidly, even several years after its initial deployment.

Hey, I don’t use CVP!

You don’t use CVP? There are other approaches giving you some of these benefits. I’ll outline some of them in upcoming posts. Stay tuned!

And of course, if you have experience implementing unit testing using graphical service creation environments, please share it with us!

The out-of-grammar challenge

The life of speech application developers would so be much simpler if callers were kind enough to only say things that are covered by the grammars. Unfortunately, because life was never meant to be simple, we will always have to deal with people that:

  • use all kinds of creative sentence constructions
  • stutter, correct themselves, or repeat portions of their utterance
  • find it impossible to just answer the question
  • have side conversations
  • don’t listen to the prompts
  • fumble while they look for the requested information
  • express their displeasure in a colorful way
  • say something that makes no sense
  • etc., etc., etc.

Then, of course, there’s all these utterances truncated by the endpointer, all these false barge-ins caused by noises, etc.

All of this explains why so many applications that work so well in demos actually perform so poorly in the field. There’s no avoiding that we have to build applications that real people can use and, unfortunately, real people quite often don’t behave the way we would like them to. And that’s OK. It’s our job to make sure that as many callers as possible get the best possible user experience.

The out-of-grammar impact on tuning

Many of the biggest tuning challenges relate to “out-of-grammar” utterances (see previous post, for a discussion on the different meanings of “out-of-grammar”), which mostly fall into two categories:

  1. Valid utterances —These are perfectly understandable utterances that provide the information that is expected by the application but which, for one reason or another, are not covered (i.e., can’t be parsed) by the grammar.
  2. Invalid utterances —These are utterances that are unusable by the application because they have no useful meaning.

Here is a list of ways in which out-of-grammar utterances can impact tuning:

  • Inflated False Accept rate — Valid utterances that are incorrectly labeled “out-of-grammar” can significantly inflate the False Accept rate and force the use of a high threshold much higher than necessary. See below for details.
  • Computing the reference semantic interpretation — In order to evaluate key performance metrics, we need to have the correct semantic interpretation for each valid utterance in our test set (the “reference semantic interpretation”) so that we can compare with the semantic interpretation obtained from the recognition result. For those utterances whose transcription can be parsed by the grammar, that’s trivial. Unfortunately, there are usually quite a few valid utterances whose transcription produces no parse.
  • Grammar coverage optimization — Careful analysis of field utterances almost always reveals grammar coverage problems that should be addressed. Without tools to suggest improvements to the grammar, though, this can be a lot of work. Moreover, optimum coverage – which is different from maximum coverage – can only be established through iterative experimentation.
  • Avoiding false accepts — Quite often, an invalid utterance will produce a recognition result with a high confidence score, leading to a false accept and, potentially, a dialogue failure. In some cases, this can be a very significant problem.
  • Prior probability considerations — Let’s say we use a speech menu in which a certain choice is used very rarely. If we assume that all choices are equally likely to falsely match out-of-grammar utterances, then the out-of-grammar impact on the rare choice will be proportionally much greater than on the other choices. This should be taken into consideration.
  • When to propose a second choice — Let’s say a user just said no to the confirmation: “I think you said ‘Austin’. Is that correct?” Should we propose the second choice in the N-best list (Boston)? That depends on the probability that this second choice is correct, which to a large extent depends on the proportion of out-of-grammar responses.

In upcoming posts, I’ll discuss each of these issues in more detail. For the time being, I’ll focus on the first one.

The inflated false accept problem

Let me illustrate this problem using a simple speech menu where people can select between three choices: “correct address”, “wrong address”, and “repeat the address”. The grammar naturally supports many variations of these key phrases, with a number of appropriate prefixes and suffixes.

The problem is that, in practice, responses contain a fair proportion of disfluencies (stuttering, corrections, repeats, etc.). As a result, there are quite a few transcriptions for which the grammar produces no parse. In the graph below (showing Correct Accept vs. False Accept, see previous post for definitions), the blue curve shows what happens if these are left “out-of-grammar” while the red curve shows what happens when all valid utterances are classified “in-grammar” and labeled with the correct reference semantic interpretation.

As we can see, the difference is quite significant. Let’s suppose we want to set the high threshold so that we have a maximum false accept rate of 0.5%. In the first case (blue curve), we would need to use a high threshold of 0.98, resulting in a Correct Accept rate of around 50%, while in the second case (red curve), we could get a Correct Accept rate of 96%, using a confidence threshold of 0.05.

In other words, properly managing these OOG utterances can mean the difference between a lot of needless confirmations and almost no confirmation, which makes a huge difference in user experience.

You can only tune what you can measure

This is the first in a series of posts I’ll do on speech application tuning over the coming weeks. Hopefully, this will provoke interesting feedback and, who knows, even spark some lively discussions.

I’m starting with the very important topic of speech recognition metrics because that’s necessary in order set the stage for most of what I’ll talk about next. Although this may not be the most exciting topic, it is clearly a very important one.

Some terminology

Tuning a speech application involves attempts to optimize a certain number of key performance metrics. Improvements or deterioration of these metrics is what tells us whether or not we’re making progress. Although that should be intuitively obvious to most people, what’s perhaps less obvious is how to select metrics that correlate best with the application’s success rate and user experience in the field.

Let me start by defining some terminology:

  • In-grammar / out-of-grammar —This determines whether an utterance is covered or not by the grammar. Later on in this post, I’ll talk at length about the different ways the word “covered” may be interpreted.
  • Accepted / rejected —This determines whether the recognition result’s confidence score is greater (accepted) or smaller (rejected) than the given confidence threshold. Note that there may be more than one confidence threshold for a given recognition context. If we’re talking about the high threshold, then “accepted” usually means that no confirmation is required. If we’re talking about the low threshold, then “accepted” means a confirmation will be required and “rejected” means that the user needs to be re-prompted.
  • Correct / incorrect —This determines whether or not a recognition result is correct. Although the definition of “correct” may vary, it is often interpreted to mean that the top recognition result in the N-best list has the correct semantic result (i.e., it doesn’t matter that not all words were correctly recognized as long as the semantic result is correct). Note that we assume here that only in-grammar utterances can be classified as either correct or incorrect.

When we perform a recognition test for a grammar using utterances collected in the field, we compute a set of 6 counters for each confidence threshold value in a range from 0.0 to 1.0. These counters are:

  • AC — Number of in-grammar utterances that are accepted and correct (often called CA-in in the industry)
  • AI — Number of in-grammar utterances that are accepted and incorrect (often called FA-in)
  • RC — Number of in-grammar utterances that are rejected and correct
  • RI — Number of in-grammar utterances that are rejected and incorrect
  • Aoog — Number of out-of-grammar utterances that are accepted (often called FA-out)
  • Roog — Number of out-of-grammar utterances that are rejected (often called CR-out)

Note that the value FR-in, often seen in the industry, is equal to RC+RI. We like to keep these two values separate since they allow us to distinguish recognition errors from rejection errors. We add two important variables, that are computed from the above counters:

  • ing — Number of in-grammar utterance (= AC+AI+RC+RI)
  • oog — Number of out-of-grammar utterances (= Aoog+Roog)

With these, we define the two key metrics that we’ll use constantly:

  • Correct accept rate (CA-rate) —This is the percentage of in-grammar utterances that are accepted with a correct result. It is computed as CA-rate = AC/ing.
  • False accept rate (FA-rate) — We use two versions of this metric:
    1. The percentage of all utterances that are incorrectly accepted. It is computed as: FA-rate = (AI+Aoog)/(ing+oog) = (AI+Aoog)/all
    2. The percentage of accepted utterances that are incorrect. It is computed as FA-rate = (AI+Aoog)/(AC+AI) = (AI+Aoog)/A

Here’s an example that will hopefully help clarify all this. The following graph plots the CA-rate as a function of the FA-rate for a phone number recognition experiment. I’ll use this type of graph constantly, so you might want to familiarize yourself with it. Note that, in order to avoid any confusion, the axes are labeled with the metric’s definition, not its name.

In the graph, the hidden variable is the confidence threshold. As the confidence threshold decreases from 1.0 to 0.0, both the CA-rate and the FA-rate increase. If we are using two thresholds then we would want to set the high threshold so that the FA-rate is very low (less than 1%, say), while the low threshold would be set in order to have an appropriate balance between confirming too many incorrect results and rejecting too many correct results.

Using a graphical representation of results has several advantages. One advantage is that it provides a visually clear view of how effectively we avoid false accepts. A curve that grows slowly from left to right is a clear indication that we’re not effectively rejecting out-of-grammar utterances. We’d like the curve to initially have a very steep slope and then to taper off when the CA-rate gets close to the maximum value.

Another very important advantage is that it makes it easy to compare results from different experiments. A curve that’s above another immediately tells us that it’s a better result (for a given FA-rate, we have a better CA-rate). That, however, assumes that the results truly are really comparable, which brings us back to an issue that we had earlier postponed: The definition of “in-grammar”.

The importance of correctly defining “in-grammar”

I had mentioned earlier that an in-grammar utterance is an utterance that is “covered” by the grammar. But what does “covered” mean? For much of the industry, this simply means that the sentence transcription can be parsed by the grammar.

This definition turns out to be quite problematic. Fundamentally, the main problem is that the in-grammar utterances are those we consider valid and therefore those that we should recognize as best as possible while out-of-grammar utterances are those the application should be rejecting. However, the definition of “valid” should be based on what application users perceive, not on what we have decided that the grammar should cover. If, for speech recognition accuracy considerations, we decide not to cover certain forms of user responses that are not used very often, this doesn’t make them any less valid. It just makes them less frequent.

Let me illustrate this with a simple example. The graph below shows the results from three date recognition experiments. The blue curve shows the result obtained using a grammar that supports two main forms: <month><day>[<year>] (“January fifth”) and <day><month>[<year>] (“Fifth of January”). Let’s say we want to see what happens if we decide not to support the second, rarer form. The result is the red curve which, as we can see, seems to indicate better performance.

But that’s an illusion. The problem is that the red curve considers all dates of the form “Fifth of January” as out-of-grammar. We’re of course doing a better job of recognizing the now reduced set of in-grammar utterances, but the problem is that a larger proportion of user utterances are considered invalid. In other words, we’re now correctly handling a smaller proportion of valid user utterance. From the user’s perspective, that’s certainly not an improvement. In fact, if we consider both forms as in-grammar (i.e., valid), then we get the green curve, which clearly tells us that we indeed have considerably deteriorated results.

In order to get meaningful results, we need to determine which utterances are in-grammar and which are out-of-grammar based on what should be considered a valid response, regardless of what we decide to include in our grammar. This has several important benefits:

  • It makes it possible to get meaningful comparisons between results obtained with grammars having different coverages since the sets of in-grammar and out-of-grammar utterances is the same in all cases.
  • We get results that are much more representative of user experience. To make sure that this is the case, we normally decide that an utterance is “in-grammar” if a human listener would consider this to be valid and unambiguous response to the question (within an acceptable “domain” of responses).
  • Last, but not least, we get applications that deliver similar or better success rates with fewer confirmations, and therefore better user experience. The reason is that, very often, many sentences that can’t be parsed by the grammar nonetheless give a high-confidence, correct semantic result. If these were considered out-of-grammar, then they would become false accepts. It takes very few of these to have a significant impact on the FA-rate, with the unfortunate consequence that we would end up using confidence thresholds much higher than necessary, resulting in many unnecessary confirmations.

So this pretty much concludes what I wanted to talk about today. In the next post, I’ll talk about tuning challenges related to out-of-grammar utterances.