Author Archives: Dominique Boucher

Introducing Rivr, an open-source Java dialog engine for VoiceXML applications

RivrAfter more than a decade building the most demanding, performance-driven speech applications in the market, Nu Echo is proud to announce that it is open-sourcing its underlying Java dialog engine for VoiceXML: Rivr.

What is it?

Rivr is a lightweight application framework designed to easily create enterprise-grade VoiceXML applications. With Rivr, code is king. No Templates. No Session Management. Rivr allows the experienced developer to use all known OO concepts: abstraction, reuse, composition, modularization, aggregation. No compromise on this. And with your choice of JVM language: Java, Scala, Groovy, JPython, JRuby, Clojure, etc.

Here is a complete dialog written op top of Rivr:

SpeechRecognitionConfiguration speechRecConfig =
  new SpeechRecognitionConfiguration(
    new GrammarReference("builtin:grammar/number"));

InteractionTurn question =
  newInteractionBuilder("question")
    .addPrompt(new SynthesisText("Say a number."))
    .build(speechRecConfig, TimeValue.seconds(5));

VoiceXmlInputTurn answer = doTurn(context, question);

int number = Integer.parseInt(
  answer.getRecognitionInfo()
        .getRecognitionResult()
        .getJsonObject(0)
        .getString("interpretation"));

String feedbackMessage = number + ": That's "
  + (number <= 1000 ? "a big number" : "reasonable.");

doTurn(context, new MessageTurn("feedback",
    new SynthesisText(feedbackMessage)));

return new VoiceXmlExitTurn("exit");

Nice and simple, isn’t it? Rivr completely hides all request/response handling so typical of web applications. With Rivr, your code is straightforward.

Is Rivr a toy project? Absolutely not! It is already in use by several corporate organizations and some of the most demanding speech applications in the market. It has been built by Java developers, for Java developers. Among its main features, Rivr integrates nicely with:

  • Your own workflow and tool set such as code coverage, unit tests, continuous build server, war deployment, and so on.
  • Your own testing and mocking framework including JUnit, TestNG, and Mockito (think about support for multiple environments: dev, staging, production, etc.).
  • Your favorite web application server: JBoss, Jetty, WebSphere, Weblogic, Tomcat, and the like.
  • Your own back-end, the way you want, with the tool you want (JAX-WS, JNDI, JDBC, JAX-RS).
  • Your own dependency injection framework (Spring, Guice, etc).

With Rivr, you have full control over generated VoiceXML for custom platform support. Rivr offers support for all VoiceXML primitives including DTMF, speech rec (no kidding!), recording, TTS, subdialogs, objects, scripts, transfers. It works readily on standards-compliant platforms such as Genesys GVP, Cisco CVP, Avaya Aura, and Voxeo Prophecy.

Rivr is distributed under the Apache 2 License and is hosted on github. Forking the project is only a click away. Go grab the code! And read the wiki to get started. Your contributions and comments are more than welcomed.

And what’s next?

In the upcoming weeks, we will blog more about Rivr, the philosophy that drove its design, its main features, its underlying architecture, some code recipes, and more.

Also, I will give a talk on Rivr at the SpeechTEK conference in New York, on August 19th at 1:15 (track D103).

Stay tuned!

We are hiring!

Nu Echo is on the lookout for dynamic and talented people that are passionate about their work and are motivated by our obsession for delivering products of uncompromising quality and performance and by the excellence of our professional services, for which we are now widely recognized in the industry. We are currently looking to fill three full-time positions:

Interested? Send us your resume at hr@nuecho.com.

Grammar conversion : lessons learned

Lately, I have been involved in a number of grammar conversion projects. This has been a great opportunity to put our process and  tools to the test once again. And since every project has its peculiarities, we learn constantly.

The process we outlined about a year ago omitted  a number of small details. That was OK for small scale conversion projects. But when you have to deal with much larger projects (with thousands of grammars to convert), these details add up significantly. Let me share some of the issues we face daily.

It’s not just semantic tags

When you have tools to automatically convert semantics tags from one format to another, grammar conversion can seem to be a no-brainer. But reality is not that simple. Grammars are not written for an abstract specification, they are written for a very specific recognition engine. They often contain:

  • Words (tokens) that map to very specific pronunciations or that try to model some disfluencies (like hesitations, for instance), but for which the SRGS $GARBAGE rule is more appropriate.
  • Multiword duplicates, with one sequence of space-separated words, and a similar sequence of underscore-separated words to allow cross-word phonetization (like “thirty one” and “thirty_one”).
  • Words that map to very specific, tuned pronunciations. Such words often have an unusual orthography to make sure they are not confused with real words.

All this means that there are a number of transformations either to the original grammar or to the converted grammars that must be applied. This can be by means of regular expression search&replace, or manually inspecting grammars.

Generation of coverage sets

When dealing with hundreds (if not thousands) of grammars, it is not feasible to create initial coverage test sets manually. This is way too time consuming. That means you have to find a way to generate those initial coverage test sets automatically in batch. But how do you do that?

Fortunately, NuGram IDE already provides sophisticated tools to analyze grammars and generate sentences from them. We just built on this foundation a tool to automatically generate coverage tests sets for a set of ABNF grammars. The tool also reports problems found in the grammars, like the use of digits in voice grammars, or words in DTMF grammars.

The coverage set generation tool uses a combination of  configuration and sophisticated analyses to determine how to generate sentences and how many sentences to generate. For example, it’s not possible to generate all sentences from a grammar that covers an infinite number of sentences. When that’s the case (or when the number of sentences covered by the grammar is above a certain threshold), the tool reverts to other generation strategies.

Recognition tests as part of the QA process

Finally, even a syntactically valid grammar may fail to load in the ASR for a variety of reasons, the most common one being a limitation or constraint from the ASR  itself. For this reason, we got to the conclusion that doing recognition tests (ideally benchmarking of the converted grammars) is a very useful addition to the QA process. Of course, simply compiling grammars may catch a number of problems. But doing a “before and after” comparison can detect conversion problems that were not caught by the coverage tests when they are not exhaustive.

Another benefit of doing recognition tests is the ability to check the performance of the converted grammars to identify those needing additional work. Some converted grammars may have words that prove difficult to recognize with the new engine because they are not properly phonetized, thus calling for application-specific (or even grammar-specific) phonetic dictionaries.

What about DTMF?

In the specific case of converting GSL grammars to GrXML or ABNF,  a complication arises with the presence, in the same grammar, of both DTMF sequences and words. I will discuss this issue in a separate post.

Grammar problem #2 – ambiguous grammars

While working on a grammar conversion project from Nuance GSL to SRGS ABNF, I stumbled upon a few grammars all having the same design problem: using optional parts to make a few words repeat a varying number of times. This is a pattern we’ve observed regularly on various projects.

Here is an example of such a grammar for recognizing sequences of 4 to 8 digits (I omitted the semantic tags for clarity):

#ABNF 1.0 ISO-8859-1;

mode voice;
language en-US;
root $digits4To8;

public $digits4To8 =
  $digit $digit $digit $digit [$digit] [$digit] [$digit] [$digit]
;
...

The original GSL grammar looked like this:

Digit4To8 (
  Digit Digit Digit Digit ?Digit ?Digit ?Digit ?Digit
)

The GSL syntax does not support the <N-M> syntax like in ABNF to repeat an expansion from N to M times. That’s a reason why the grammar was written this way in the first place. In ABNF grammar , it would have instead been written as:

#ABNF 1.0 ISO-8859-1;

mode voice;
language en-US;
root $digits4To8;

public $digits4To8 = $digit &amp;lt;4-8&amp;gt;
;
...

In GSL, it would have been better to write the grammar as:

Digit4To8 (
  Digit Digit Digit Digit ?Digit1To4
)

Digit1To4 ( Digit ?Digit1To3 )
Digit1To3 ( Digit ?Digit1To2 )
Digit1To2 ( Digit ?Digit )

Both grammars are equivalent, right? So what’s the problem?

Ambiguities

Well, both grammars recognize the same language (the same set of sentences), but the first grammar has a very different behavior. It is highly ambiguous. That means some sentences can be parsed in two or more different ways. See what you get when you interpret one such sentence in NuGram IDE:

The interpreter tells us (at the top-left of the window) that there are 6 different parses for the sentence. (I’ve seen grammars generating more than 100 parses for a given sentence!).

The problem with ambiguous grammars is they can impact both recognition accurary and recognition performance. Suppose a grammar covers a sentence that is highly ambiguous and another sentence which is not, but is phonetically close to the former. Since speech recognition engines limit their recognition search space, it is possible that the latter be pruned from the search space at the beginning of the recognition window even if it’s the one that would come up with the best score at the end of the recognition.

The other problem is recognition performance. All semantic tags are typically executed at the end of the recognition process, once the user has finished talking. If there are lots of identical hypotheses with the same score, the recognition engine will have to execute all tags (interpreted ECMAScript code), most of them being redundant and useless, thus causing longer delays in the speech application.

Determining that a grammar is ambiguous (or not) is a very hard problem (it’s an undecidable problem). That means, whatever tool you use that’s supposed to decide for ambiguities will inevitably make mistakes. But that doesn’t mean there are no tools available to help detecting ambiguities. For instance, NuGram IDE will tell you if there are two or more different parses for a given sentence. And the sentence generator tool can also be configured to detect sentences that are ambiguous at the semantic level (sentences producing two or more different semantic values).

NuGram IDE new licensing scheme

Last week, we released a new version of NuGram IDE. In addition to supporting UTF-16 and UTF-8 with byte-order mark (BOM), the free Basic Edition also comes with a new licensing scheme.

Was does this mean for you? Well, simply that you will have to request a new license file every 90 days. The installation process is fairly simple:

  1. You install NuGram IDE as before, by adding http://nugram.nuecho.com/update-site to your Eclipse update sites.
  2. You request a new license file from our web site. You will then receive an email containing a link to the license file.
  3. You follow the link and save the downloaded document to a file in the $HOME/nuecho directory (this can be changed in the Eclipse preferences).

It’s as simple as that. And we have an automatic process that will remind you by email to renew your license (steps 2 and 3) just a few days before its expiration.

The rationale

You may wonder why we decided to change the licensing scheme. Downloading and installing the Basic Edition was much more straightforward before. And as a user of free software myself, I don’t like complicated registration processes. I’m usually turned off when I need to enter lots of personal information, I often simply go away.

Then why? NuGram IDE is a relatively specialized tool, but we get downloads at a surprising rate. However, we don’t really know how many people or organizations use it on a regular basic. Do they simply take a look at it and uninstall it? Do they use it only once a year, for a very punctual task? The problem lies in the way the software is obtained (by means of an Eclipse update site). By asking our users to request a new license at a regular interval, we hope to better know our user base. We do think that the free edition of NuGram IDE is a software with real value and it’s not too much asking in return.

Of course, if you don’t like the idea of updating your license every 90 days, you can still buy the Professional Edition and not be annoyed anymore…