Category Archives: Speech Technologies

Bridging NuGram and CouchDB

Mark Headd, a new member of the Voxeo family, published a blog post last week on how to build speech recognition applications with Tropo. Since his post covered things like SRGS grammars and dynamic grammars, I couldn’t resist. I had to enter the fray and show how the dynamic SRGS grammar could be built using NuGram Hosted Server. And while I’m at it, lets use CouchDB instead of an SQL database.

The Database

Mark’s example is a simple address capture dialog. It consists in asking for the zip code, and then asking for the civic number and street name/type. The grammar for the second question is built dynamically based on the entered zip code. All street names/types and their associated zip codes are stored in a SQL database and retrieved by some PHP code.

In my case, I decided to store all the the data in a CouchDB database called “zipcode”. (CouchDB is a nice RESTful, HTTP-based document-oriented database, where documents are stored as plain JSON strings.) Once CouchDB is up and running (I assume here it’s running on the local host, on port 5984, but that could be on any hosting service, like CouchOne), we simply create the database and populate it using the curl commmand-line tool:

% curl -X PUT http://localhost:5984/zipcode
{"ok":true}
% curl -X POST http://localhost:5984/zipcode/_bulk_docs \
       -H 'Content-Type: application/json' \
       -d "`cat zipcodes.json`"

where the file zipcodes.json contains the following data:

{"docs": [
{
    "_id": "18752",
    "type": "zipcode",
    "streets" : [
       {"name":"First", "type":"Avenue"},
       {"name":"Grant", "type":"Avenue"},
       {"name":"Josiah", "type":"Parkway"},
       {"name":"Murphy", "type":"Lane"},
       {"name":"Chery Blossom"," type":"Circle"}
    ]
},
{
    "_id": "19752",
    "type": "zipcode",
    "streets" : [
       {"name":"Milberry", "type":"Extension"},
       {"name":"Jones", "type":"Street"},
       {"name":"Martin Luther King", "type":"Boulevard"},
       {"name":"Halsey", "type":"Place"}
    ]
}
]}

Each document (whose ID is a zip code) contains an attribute streets that lists all street names/types for the given zip code. Here there a only a few streets for two zip codes.

(Of course there are other ways to model the data, but that’s the simplest I could think of.)

The grammar template

Instead of using some code to create the streets grammar dynamically, we create a grammar template that is pushed on www.grammarserver.com (NuGram Hosted Server) that will later be populated with data from the database and rendered in GrXML (or ABNF).

To do that, we just need to register an account (but don’t worry it’s absolutely free).

So here is the grammar template:

#ABNF 1.0 ISO-8859-1;

language en-US;
mode voice;
tag-format <semantics/1.0>;

root $streets;

public $streets =
    $civicNumber $name [$direction]
    {out = rules.civicNumber.number + "," + rules.name + "," + rules.direction}
;

$civicNumber =
    {out.number = ''} ($number {out.number += rules.number}) <1->
;

$name =
    @alt
        @for (street : zipcode.streets)
           (@word street.name @word street.type)
        @end
    @end
;

$number =
     (zero | oh) {out = "0"}
   | one   {out = "1"}
   | two   {out = "2"}
   | three {out = "3"}
   | four  {out = "4"}
   | five  {out = "5"}
   | six   {out = "6"}
   | seven {out = "7"}
   | eight {out = "8"}
   | nine  {out = "9"}
;

$direction =
     north (west {out = 'nw'} | east {out = 'ne'} )
   | south (west {out = 'sw'} | east {out = 'se'} )
;

As you can see, it’s plain ABNF, with the exception of some simple dynamic directives on lines 19-23. And it’s a bit more involved than Mark’s one. It contains semantic tags to better format the recognized utterance.

To publish the grammar, we use curl again:

% curl -X PUT http://www.grammarserver.com/api/grammar/streets.abnf \
       -u username:password \
       -d "`cat streets.abnf`"

We are now ready to write the application.

Connecting the dots

Now that the database is set up and the template published on NuGram Hosted Server, the only thing we need to do is create a simple app that bridges the two. For this, I decided to use Tropo’s web API, and more specifically the Ruby webapi gem (as well as the couchrest and nugramserver-api gems). The app mimics Mark’s one and all CouchDB and NuGram Hosted Server related lines are highlighted below:

require 'rubygems'
require 'sinatra'
require 'tropo-webapi-ruby'
require 'nugramserver-ruby'
require 'couchrest'

couch_server = CouchRest.new "http://localhost:5984"
database = couch_server.database "zipcode"

post '/start.json' do
  tropo = Tropo::Generator.new do
    on :event => 'continue', :next => '/ask_street.json'
    on :event => 'hangup', :next => '/hangup.json'
    ask({ :name => 'zip_code',
          :bargein => 'true' }) do
      say     :value => "Say your 5 digit zip code"
      choices :value => "[5 DIGITS]"
    end
  end
  tropo.response
end

post '/ask_street.json' do
  session = GrammarServer.new.create_session "username", "password"

  tropo_event = Tropo::Generator.parse request.env["rack.input"].read
  zipcode = tropo_event.result.actions.zip_code.value

  grammar = session.instantiate "streets.abnf",
                                :zipcode => database.get(zipcode)

  tropo = Tropo::Generator.new do
    on :event => 'continue', :next => '/say_street.json'
    on :event => 'hangup', :next => '/hangup.json'
    ask({ :name => 'street',
          :bargein => 'true' }) do
      say :value => "What is your street address, beginning with your street number?"
      choices :value => grammar.get_url("grxml")
    end
  end
  tropo.response
end

#...

The app is not complete, some handlers are missing. But you get the idea.

A final note

Of course, this post just covers the basics of integrating a dynamic grammar in a speech app. A real address capture application is certainly a bit more complex than that. For instance, given the large number of streets covered by a single zip code, it may not be desirable to generate grammars dynamically. They may have to be compiled in advance, with a periodic update process. Or you may want to implement some clever grammar caching strategies. Either way, you may instead consider the Java version of NuGram Server (not the hosted one).

What’s in your IVR application monitoring report?

In a recent discussion over Hacker News, someone came up with a request for an IVR application monitoring service, suggesting that this is something which should be rather easy to build. Indeed, the dialing is rather easy. A few hacks with Tropo, Twilio or some custom Asterisk scripts would do the trick, but keep in mind that such monitoring service should interact with the IVR the same way a user would (but that’s another story and an upcoming blog post!).

However, as I have pointed out myself, it is one thing to periodically call a given number, it is another to send daily, weekly, monthly and yearly reports to reflect the actual state of the IVR application over time.

Moreover, those reports needs to provide insightful and reliable information. That’s where Mirador comes handy.

Stability Metrics

Mirador Report - Stability Metrics

First and foremost, your report should give a quick overview of the overall stability of your IVR application over a given time period (daily, weekly, monthly, or yearly). Such metrics essentially provide the overall success rate of your application, where setup failures could be caused by various telephony/network errors such as timeout, busy or congestion, while transaction failures are errors occurring once the connection is established.

Performance Metrics

Mirador Report - Performance Metrics 1


Next, we have some performance metrics, which include average call duration, setup time, transaction duration and greeting delay for both all and successful calls. This raw data is also used to depict an interesting performance over time chart, where one can visually spot specific time periods.

While most data can be gathered quite simply, the greeting delay is totally different beast. It corresponds to the actual delay to get the initial application prompt following a successful call setup, as a user would feel it. To compute such data, we used a few interesting speech recognition tricks of ours :)

Mirador Report - Timing Distributions

How do you know whether a user is waiting 1s or 10s for your application to answer? Or, when a user is supposed to take 2 minutes to complete a given transaction or task, how do you know if that is really the case? Performance metrics would not be complete without some distribution charts to highlight such information. To get a better understanding of how well your IVR application responds to some peak periods in production, we have crafted two distribution charts which not only depict setup times but also transaction durations.

Alarm History

Any serious monitoring service should provide email or SMS notification whenever a defect occurs (otherwise, what’s the point of monitoring?). Mirador can be configured to act upon certain thresholds or specific criteria and send alarm notifications right away, in real-time. While alarm occurrence is one thing, alarm restore is another. Indeed, you not only want to know whenever a problem occurred but also the moment the situation has been acted upon and restored.

Mirador Report - Alarm History

That is why a good monitoring report should present a list of all the alarms for a given time period!

Call Detail Records

Mirador Report - Call Detail Records

Lastly, but not the least: the ability to review call detail records (CDRs). Especially those generating alarms. You might want to know when such calls occurred, what was their actual status, duration and so on. You might even be interested in listening to the complete call recordings while you are at it.

Conclusion

Reports are an integral part of any monitoring service. Plus, you certainly would like to review them within your email client, online in a secure location or as a PDF document, to share with your peers. Ideally, you would have a web dashboard where you could access report history, setup new monitoring configurations, reschedule a configuration, define alarm thresholds and notification targets, and so on.

Mirador - PDF Email Web

Mirador - PDF Email Web

Mirador IVR application monitoring service  features all of the previously mentioned characteristics, except for the dashboard. But we are working on it so stay tuned for more!

So, what’s in your IVR application monitoring report?

Grammar tips & tricks #3 – Use of global tags

Tip #3: In SRGS grammars, use global semantic interpretation (SI) tags to simplify SI tags in rule expansions.

It is fairly common in SRGS grammars to put some form of computation in semantics tags. For example, checksum algorithms (like the Luhn algorithm) are commonly used in credit card number grammars.

When grammars contain those kind of computations, it is good practice to use global SI tags to put functions and constants definitions. These SI tags are declared before the definition of the first rule, as part of the other grammar headers. They must be followed by a semicolon. In ABNF form, this looks like:

#ABNF 1.0;

mode voice;
root $rootRule;

{
  // header tag
};

$rootRule =
 ...
;
...

The use of global tags has several advantages:

  • Functions are more easily testable. The functions declared in the global tags can be developed and tested outside of the grammar file (using a JavaScript interpreter like SpiderMonkey, Rhino, or V8), and later copied into the global tag.
  • It avoids code duplication. The use of functions usually reduces code duplication, which lowers the risk of fixing a problem at one place only and missing one.
  • Semantic interpretation is less CPU-intensive. Another side-effect of using functions is that SI tags usually get smaller, thus reducing the time taken to parse them and interpreting them, leading to faster interpretation and better response time. (Semantic interpretation tags are usually executed after the last word has been uttered so it’s sometimes important to optimize them.)

What about GrXML?

In the XML form, you simply put tag elements before the first rule element. But you don’t really need to know that, right? NuGram IDE can convert ABNF grammars to their XML counterpart so easily!

A concrete example

Let’s illustrate this by considering a simple grammar for a 12-digit account number using the Luhn algorithm to validate the number. Here is a first version of the grammar:

#ABNF 1.0 UTF-8;

language en-US;
mode voice;
tag-format <semantics/1.0>;

root $accountNumber;

public $accountNumber =
    { out.number = ""; var checksum = 0; }
    ( $digit {!{
                 out.number += rules.digit;
                 var digit = parseInt(rules.digit);
                 var doubledigit = digit * 2;
                 if (doubledigit > 9)
                    checksum += (doubledigit % 10) + 1;
                 else
                 	checksum += doubledigit;
              }!}
      $digit {
                 out.number += rules.digit;
                 checksum += parseInt(rules.digit);
             }) <6>
    { out.valid = (checksum % 10) == 0;}
;

private $digit =
    one    {out = "1"} | two         {out = "2"}
  | three  {out = "3"} | four        {out = "4"}
  | five   {out = "5"} | six         {out = "6"}
  | seven  {out = "7"} | eight       {out = "8"}
  | nine   {out = "9"} | (zero | oh) {out = "0"}
;

The code to calculate the checksum is mixed with the rule references to collect the digits. This makes the grammar look much more complex than it really is. And its performance is much worse than it could be.

If we move the checksum computation in a header tag, we obtain the following grammar:

#ABNF 1.0 UTF-8;

language en-US;
mode voice;
tag-format <semantics/1.0>;

root $accountNumber;

{!{
function luhnCheck(digits) {
  var checksum = 0;
  for (var i = 0; i<12; i++) {
    var digit = parseInt(digits.charAt(i));
    if (i % 2 == 0) {
      var doubledigit = digit * 2;
      if (doubledigit > 9)
         checksum += (doubledigit % 10) + 1;
      else
      	checksum += doubledigit;
	}
	else
	  checksum += digit;
  }
  return (checksum % 10) == 0;
}
}!};

public $accountNumber =
    { out.number = "";}
    ( $digit { out.number += rules.digit; }) <12>
    { out.valid = luhnCheck(out.number); }
;

private $digit =
    one    {out = "1"} | two         {out = "2"}
  | three  {out = "3"} | four        {out = "4"}
  | five   {out = "5"} | six         {out = "6"}
  | seven  {out = "7"} | eight       {out = "8"}
  | nine   {out = "9"} | (zero | oh) {out = "0"}
;

Now the accountNumber rule is much simpler and it is clear that it only accepts 12 digits. Moreover, the validation function can be tested independently. If the code is copied to a file named checksum.js, I can launch the SpiderMonkey interpreter and test the function like this:

[tmp] js
js> load("checksum.js")
js> luhnCheck("123456789012")
false
js> luhnCheck("123456789015")
true
js> ^D
[tmp]

In fact, these test cases can be put in the source file along with the code. But you get the idea.

Global scope is read-only

Beware, when writing your SI tags, that the global scope is read-only for SI tags, while it is mutable for all global tags. That means a variable cannot be declared in a global SI tag and modified in a normal SI tag. For example, the following grammar

#ABNF 1.0;

mode voice;
root $rootRule;

{
  var globalVar = 1;
};

$rootRule =
  { globalVar = 2; } some words { out = globalVar; }
;

would raise an exception when “some words” is uttered. That’s because the first SI tag on line 11 tries to modify a read-only variable (globalVar).

There is of course a way to bypass this limitation. Simply declare a global variable, say GLOBAL that holds an object whose properties will represent the variables you would have liked to be global. To illustrate, here is how the previous grammar would be modified:

#ABNF 1.0;

mode voice;
root $rootRule;

{
  var GLOBAL = new Object();
  GLOBAL.globalVar = 1;
};

$rootRule =
  { GLOBAL.globalVar = 2; } some words { out = GLOBAL.globalVar; }
;

This time, the grammar will return 2 when “some words” is uttered.

It should be noted that the IBM engine, which supports an old version of the SISR specification, does allow global variables to be modified in SI tags. It is very important to be aware of that when converting grammars initially written for the IBM engine to another engine supporting the latest SISR specification (like, for instance, Loquendo or Nuance 9).

A proven yet simple grammar conversion process

Grammar Conversion Process

As old speech recognition engines are being replaced by newer ones, we see more and more organizations having to convert their old grammars to standard formats. Given the right process and set of tools, converting grammars from one engine to another should be a straightforward task with mostly no risk of breaking the associated IVR application.

The issues

There are several issues associated with the conversion of grammars:

  • Syntax. First, there is the syntax of the grammar itself. If we are converting a grammar bewteen two engines that support GrXML or ABNF, then there’s not much else to say. But if we are converting from Nuance GSL to GrXML or ABNF, that’s a different story. GSL has very different operator precedences than ABNF, for instance. We have to be careful.
  • Semantics. The second issue is the language used inside the semantic tags. Again, if both engines support SISR, we have nothing to do. But if we convert from GSL to ABNF+SISR, we may have a harder time. For example, SISR does not support the concept of a top-level slot that can be assigned from anywhere in the grammar (using the <slot value> syntax).
  • Pronunciation lexicons. Almost all speech engines use a different format for lexicons. Not to mention that even different versions of the same engine sometimes support different phonetic alphabets.

A proven process

If you follow a rigorous process, the first two issues above can be easily mitigated. Here is one that has proven very effective:

  1. A coverage test set is produced from the original grammar. The test set should ideally ensure that all semantic tags are executed at least once (this is not always sufficient if the semantic tags contain conditional code, but that’s a good starting point.)
  2. The grammar is converted to the new format.
  3. The converted grammar is tested against the coverage test set of the original grammar (and problems are fixed, if any, until all tests pass).

Some tools

Some ASR engines already provide tools to convert grammars from old proprietary formats to the new standard ones. For instance, Nuance ships a tool to automatically convert GSL grammars to GrXML + SISR. It does not support all features of GSL as some of them have no equivalence in GrXML and SISR. And one of the problems with this converter is that the semantic tags produced are not easily maintainable.

NuGram IDE also provides some tools to help with the above process. In particular, it offers:

  • Great support for creating and running coverage tests.
  • A sophisticated sentence generation tool. The tags coverage strategy, for instance, is very effective when converting grammars as it helps generating sentences that will cover all semantic tags.
  • Support for all major semantic tags formats (GSL, Nuance OSR extensions, IBM and Microsoft, etc.).

Of course, to use NuGram effectively, your grammars will need to be converted to ABNF first. No problem! NuGram provides GSL and GrXML to ABNF converters to help you, as well as converters from ABNF to GSL or GrXML. That means all you have to worry about is really the conversion of the semantic tags. In this case, the whole process now becomes:

  1. Grammars are first imported in ABNF.
  2. A coverage test set is produced from the original grammar.
  3. Semantic tags are converted.
  4. The converted grammars are checked for errors by running the coverage tests of the original grammars. In case of errors, they are fixed and all tests are re-run.
  5. Convert the grammars to the desired target format.

What about pronunciation lexicons?

Unfortunately, converting phonetic dictionaries is still a manual and error-prone process, for which there are no good solutions as of this writing. And this task is more part of the tuning process that follows the grammar conversion process anyway. In most cases, a grammar’s pronunciation lexicon is used to fix incorrect or missing pronunciations in the ASR engine’s own dictionary for very specific words. The phonetic dictionary of the target ASR engine may not have the same limitations or deficiencies. At best, the original grammar’s pronunciation lexicon can act as an inspiration for the creation of the new pronunciation lexicon.

Putting a VoiceXML gateway behind Asterisk

I’m a big fan of both Asterisk and VoiceXML. Each has its own sweet spot. Asterisk is great for building complete telephony systems (dial plans, conference calls, queues, voicemail, etc.), while VoiceXML is the standard way to develop full-blown telephony applications for large organisations.

But what if you want to bridge the two? There are situations where that would make sense. Consider a company using Asterisk as their front PBX. Now if they want to add a speech-enabled auto-attendant or some other self-service application, they could use a VoiceXML platform to run it instead of coding it in the Asterisk dialplan language. Of course, one could do the same using the Asterisk Gateway Interface (AGI) protocol, but he would be limited to the capabilities of the Asterisk dialplan language. (For instance, the generic speech recognizer API only returns the matched text of each NBest, not the semantic interpretation. This can be ok for some trivial applications, but that’s clearly inadequate for serious speech application development.)

The other day, I decided to test this idea and try using a VoiceXML gateway (Voxeo Prophecy in this case) from behind Asterisk. Here is how I made things work.

Machine setup

My setup consists of a laptop running Ubuntu 9.04 with Asterisk 1.4.21. Since Prophecy is only supported on CentOS and RedHat Enterprise Edition, I decided to run Prophecy on CentOS 5.5 inside a VMware virtual machine. The guest machine is configured to use a dedicated network between the guest and the host (the Host-only network configuration):

VMware guest network configuration

VMware guest network configuration

Asterisk configuration

On the Ubuntu (host) machine, in /etc/asterisk/sip.conf, I added the following entry:

[prophecy]
type=friend
username=prophecy
host=dynamic
canreinvite=yes
insecure=port,invite
qualify=yes
context=proph
auth=prophecy:none@asterisk

In /etc/asterisk/extensions.conf, I created a context proph with a dialplan that redirects all incoming calls to Prophecy:

[proph]
exten => _[A-Za-z].,1,Dial(SIP/prophecy/${EXTEN})
exten => _[A-Za-z].,n,Hangup

Configuring Prophecy

On the guest CentOS machine, in /opt/voxeo/prophecy/config/config.xml, I added the following lines in the VoIPCT category:

<category name="VoIPCT">
 ...
  <category name="Registrations">
    <category name="asterisk">
      <item name="Username">prophecy</item>
      <item name="AuthUsername">prophecy</item>
      <item name="Password">none</item>
      <item name="Domain">192.168.151.1</item>
      <item name="ContactIP">192.168.151.128:5060</item>
      <item name="ExpirationTimeout" type="int">3600</item>
      <item name="Registrar">192.168.151.1</item>
      <item name="ResolveRegistrar" type="int">0</item>
    </category>
  </category>
 ...
</category>

Here, the IP address 192.168.151.128 is the address assigned automatically by VMware to the guest, while 192.168.151.1 is the address of the host.

To call an application, I use SFLphone, an open-source softphone. One particularly appealing feature of this phone is its support for both the SIP and the IAX protocols. It is thus well suited for use with Asterisk.

Voilà! I am now able to make calls to VoiceXML applications from the comfort of my Ubuntu machine using only free/open-source solutions.