Tag Archives: voicexml

Simple HTML-based IM widget with Phono

The Voxeo team unveiled Phono just a few weeks ago. Phono is an open-source SDK to embed telephones and instant messaging services in your web pages. With Phono, you can (among other things) add click-to-call buttons to your call center or your applications hosted on one of Voxeo’s platforms (and you will eventually be able to host your own server-side infrastructure soon).

To get a feel of how easy it is to embed these services on a web page, I decided to create an IM-like interface and a click-to-call button to a simple prototype application I developed some time ago. The voice application is VoiceXML-based and runs on Voxeo’s Evolution platform, while the IM application runs on IMified and uses grammarserver.com to interpret textual sentences. (Both applications are run by the same code, except for the presentation part. But what the applications does is not relevant to what follows.)

In a matter of minutes, I was able to come up with an IM widget that looks like this:

(Please, no comments on the UI. If I was trying to be paid for my artistic skills, I’d be bankrupt by now…)

The code

The HTML code for this page is quite straightforward:

<html>
  <head>
    <title>Phono Demo</title>
    <link type="text/css" rel="stylesheet" href="im.css" />
    <script src="http://code.jquery.com/jquery-1.4.2.min.js"></script>
    <script src="http://s.phono.com/releases/0.1/jquery.phono.js"></script>
    <script src="im.js"></script>
  </head>
  <body>
    <div id="im_box">
      <h3 align=center>SiteWatch Application</h3>
      <div id="dialog" style="">
      </div>
      <input type="text" id="message" onkeydown="processKey(event)" disabled="true">
    </div>
    <button id="click2call" disabled="true" onclick="callApp()">Call Me!</button>
  </body>
</html>

The real meat is in im.js, which I will now describe.

The file starts with the following code:

var PHONO_OPTIONS = {
    apiKey: MY_API_KEY,
    onReady: function() {
        $("#message")[0].disabled = false;
        $("#click2call")[0].disabled = false;
        $("#message").focus();
    },
    messaging: {
        onMessage: function(event) {
           addInteraction('SiteWatch', event.message.body, 'server');
        }
    }
};

var phono = $.phono(PHONO_OPTIONS);

This code takes care of initializing the phono object. Once it is properly initialized and ready, the input area and the click-to-call button are both enabled. If something goes wrong during the initialization, they will remain disabled.

The next section handles the sending of messages to the application:

function processKey(event) {
  if (event.keyCode == 13) {
      sendMessage();
      return false;
  } else {
      return true;
  }
}

function sendMessage() {
    var msg = $('#message').val();
    if (msg == '') return;

    $('#message').val("");
    phono.messaging.send(MY_IM_ADDRESS, msg);
    addInteraction('Me', msg, 'client');
}

The processKey function is called when a key is pressed in the text input area. If the return key is pressed, the message is sent to the remote IM address (here MY_IM_ADDRESS).

The function that displays both the outgoing and the incoming messages is :

function addInteraction(who, msg, cls) {
    var dialogDiv = $('#dialog');
    dialogDiv.append($("<div class='msg_" + cls + "'><b>"
                       + who + ":</b> " + msg + "</div>"));
    dialogDiv[0].scrollTop = dialogDiv[0].scrollHeight;
}

Finally, the code that takes care of the click-to-call button is here:

var call = null;
function callApp() {
    if (call == null) {
	call = phono.phone.dial("app:MY_VXML_APP", {
		onAnswer: function() {
		    $("#click2call").text("Hangup");
		},
		onHangup: function() {
		    call = null;
		    $("#click2call").text("Call me!");
		}
	    });
    } else {
	call.hangup();
    }
}

Can that be simpler?

Ok, I must admit my first version was not as clean as this one. But it was functionally equivalent. And it was only 50 lines long (HTML and JavaScript included).

Conclusion

The Phono API is simple, yet powerful. The fact that it does not provide a standard UI is a very good thing too. This will let developers innovate and use Phono in very creative ways.

But there’s a single feature that IMHO will drive a lot of innovation: the session ID. Once the Phono object is properly initialized, it has a unique ID that is used as the caller ID when a phone call is made from the web page. And the ID is accessible from the phono object. Which means that the ID can be sent to the application server so both the HTML and the voice applications can share some data and communicate with each other. I’m pretty sure this will open the door to some really cool multi-channel AND multi-modal applications.

Update: Here is the CSS file that I used to create the screen shot above:

#im_box {
    width: 300px;
    border: 1px solid black;
    padding: 0px;
    background-color: #ddd;
}

#dialog {
    height: 300;
    overflow: auto;
    font-size: 10pt;
    font-family: Arial;
    padding: 5px;
}

#message {
    width: 300px;
    border: 1px solid;
    margin: 1px;
}

h3 {
    border-bottom: 1px solid;
    margin-top: 0px;
    background-color: #eee;
}

.msg_client {
    margin: 5px;
    border: 1px solid;
    padding: 3px;
    background-color: white;
}

.msg_client b {
    color: blue;
}

.msg_server {
    margin: 5px;
    margin-left: 15px;
    border: 1px solid;
    padding: 3px;
    background-color: white;
}

.msg_server b {
    color: red;
}

Update 2 (2010/11/01) : The whole source code is available on github.

Putting a VoiceXML gateway behind Asterisk

I’m a big fan of both Asterisk and VoiceXML. Each has its own sweet spot. Asterisk is great for building complete telephony systems (dial plans, conference calls, queues, voicemail, etc.), while VoiceXML is the standard way to develop full-blown telephony applications for large organisations.

But what if you want to bridge the two? There are situations where that would make sense. Consider a company using Asterisk as their front PBX. Now if they want to add a speech-enabled auto-attendant or some other self-service application, they could use a VoiceXML platform to run it instead of coding it in the Asterisk dialplan language. Of course, one could do the same using the Asterisk Gateway Interface (AGI) protocol, but he would be limited to the capabilities of the Asterisk dialplan language. (For instance, the generic speech recognizer API only returns the matched text of each NBest, not the semantic interpretation. This can be ok for some trivial applications, but that’s clearly inadequate for serious speech application development.)

The other day, I decided to test this idea and try using a VoiceXML gateway (Voxeo Prophecy in this case) from behind Asterisk. Here is how I made things work.

Machine setup

My setup consists of a laptop running Ubuntu 9.04 with Asterisk 1.4.21. Since Prophecy is only supported on CentOS and RedHat Enterprise Edition, I decided to run Prophecy on CentOS 5.5 inside a VMware virtual machine. The guest machine is configured to use a dedicated network between the guest and the host (the Host-only network configuration):

VMware guest network configuration

VMware guest network configuration

Asterisk configuration

On the Ubuntu (host) machine, in /etc/asterisk/sip.conf, I added the following entry:

[prophecy]
type=friend
username=prophecy
host=dynamic
canreinvite=yes
insecure=port,invite
qualify=yes
context=proph
auth=prophecy:none@asterisk

In /etc/asterisk/extensions.conf, I created a context proph with a dialplan that redirects all incoming calls to Prophecy:

[proph]
exten => _[A-Za-z].,1,Dial(SIP/prophecy/${EXTEN})
exten => _[A-Za-z].,n,Hangup

Configuring Prophecy

On the guest CentOS machine, in /opt/voxeo/prophecy/config/config.xml, I added the following lines in the VoIPCT category:

<category name="VoIPCT">
 ...
  <category name="Registrations">
    <category name="asterisk">
      <item name="Username">prophecy</item>
      <item name="AuthUsername">prophecy</item>
      <item name="Password">none</item>
      <item name="Domain">192.168.151.1</item>
      <item name="ContactIP">192.168.151.128:5060</item>
      <item name="ExpirationTimeout" type="int">3600</item>
      <item name="Registrar">192.168.151.1</item>
      <item name="ResolveRegistrar" type="int">0</item>
    </category>
  </category>
 ...
</category>

Here, the IP address 192.168.151.128 is the address assigned automatically by VMware to the guest, while 192.168.151.1 is the address of the host.

To call an application, I use SFLphone, an open-source softphone. One particularly appealing feature of this phone is its support for both the SIP and the IAX protocols. It is thus well suited for use with Asterisk.

Voilà! I am now able to make calls to VoiceXML applications from the comfort of my Ubuntu machine using only free/open-source solutions.

Testing an Intervoice InVision app with Voxeo Prophecy

I’ve just started working on a DTMF-only VoiceXML application for one of our customers. The application is developed using Intervoice InVsion Studio 3.1 (the native Windows version) and will be deployed on the Intervoice Voice Portal 5. The challenge in this project is three-fold:

  • Development is done in Nu Echo’s premises.
  • Nu Echo does not have IVP5 in its lab.
  • The only way to test the application is to connect to the customer’s network using VPN/pcAnywhere, deploy the application there and test using a local phone number.

Fortunately, except for all the VoiceXML code that handles attached data and transfers to the PBX, everything else can be easily tested on my own machine using only freely available tools.

The VoiceXML platform

InVision Studio is a tool that provides a graphical editor that maps an IVR call-flow to completely static, standards-compliant VoiceXML code (at least it’s the cased for the application I have to develop). Once the application successfully passes the validation tests, it can be exported to VoiceXML code that can then be deployed on any web server.

InVision Studio

InVision Studio

Since the resulting code does not depend on any proprietary extension, I decided to use Voxeo Prophecy to test it. It comes with a really decent ASR engine as well as a good TTS engine, both only for US English. The application is DTMF-only, so the ASR is not needed in my case, but TTS is handy when you don’t want to record all the application prompts (with InVision Studio, you have to specify a text to all the prompts you define).

After installing Prophecy, I had to use Prophecy Commander, the web-based management console, to configure the application and the route to reach the application. The route is used to associate a number to call with the application. In my case, the app is CustomerApp and the route is test-customer-app:

Routing rules in Prophecy Commander

Routing rules in Prophecy Commander

To call the application, I simply use the SIP phone that comes with Prophecy and dial test-customer-app.

Prophecy SIP phone

Prophecy SIP phone

The Web server

For the web server, I use Yaws. It’s a web server written in Erlang. But it could have been Apache, or Tomcat, Jetty, IIS, or any other web server. I chose Yaws mainly because I do some Erlang programming on my spare time and happen to know Yaws a bit more than the alternatives.

I configured Yaws to server static files on port 8080 from the Runtime directory of my InVision project. So whenever I export the VoiceXML code for the project, I just take the SIP phone and make a call to test the application. The Yaws configuration for the virtual server is:

<server localhost>
        port = 8080
        listen = 0.0.0.0
        docroot = "C:/InvisionProjects/CustomerApp/Runtime"
</server>

Extensive logging

First off, let me say that when it comes to debugging an app, the Prophecy logviewer is of tremendous help. I was first a bit overwhelmed by the vast quantity of information logged by the various parts that compose Prophecy, but the filtering capabilities make it easy to focus on only a fraction of it. (I have seen the logs of many VoiceXML platforms, and these ones are certainly among the most comprehensible.)

I’m writing this because I had to use the logviewer at the minute I started testing the application interactively. Why don’t I just listen to the prompts? Well, the problem is that the prompt texts are in French, while the TTS is in English. That’s plainly and simply incomprehensible and trying to figure out where I am in the application is really painful and annoying. So I decided to add VoiceXML log elements extensively in the application, all starting with a very specific pattern: [CustomerApp].

Logging elements in application

Logging elements in application

It is then very easy to filter the logs based on this pattern and see only the progress of the application:

Prophecy Logviewer

Prophecy Logviewer

A final remark

Yes, I could use the debugger that comes with InVision Studio. But frankly, I do not find it very intuitive to use. I prefer making calls and test the user experience at once.

A wishlist for VoiceXML 3.0

Over our many years working with VoiceXML 2.0/2.1, we at Nu Echo have found a number of annoyances in the specification that we would very much like to be addressed in the upcoming VoiceXML 3.0. These are not far fetched, difficult to implement things. But they would certainly let us implement more easily some frequent requirements from customers and yield much better VUIs in the end.

And fixing these issues is not in contradiction to the new direction that VoiceXML 3.0 is taking.

(I could have entitled this post “some complaints about VoiceXML 2.1″, but I decided to turn it into one with a more positive bias.)

So here is a first attempt at a VoiceXML wishlist:

  1. Flag indicating use of DTMF termchar. Many customers ask us to enforce the use of a termination character like ‘#’  at some point in the dialog (the PIN number, for example). If we simply use the built-in grammars and specify a term char property, it is not possible to know if the key has been pressed at all when we get the result. Of course, the application can use a custom DTMF grammar, but it will have to explicitly strip the term char from the returned DTMF sequence. And custom DTMF grammars sometimes require the use of a speech recognition engine to work so they must be provisioned even for DTMF-only applications (which is a non-sense). Note that this feature exists for the <record> element.
  2. DTMF nomatch or speech nomatch? When a nomatch event occurs in a form allowing either speech or DTMF input, it not possible to know whether it’s the result of a wrong DTMF sequence entered or some speech not matching one of the active grammars. Such information would lead to better reprompting. For instance, suppose you activate some grammars for universal commands, then hitting the wrong key would lead to a prompt like “Invalid command. Please say …” instead of the more generic (and speech-specific) “I didn’t understand. Please say …”. VoiceXML 2.0 specifies that in the case of a nomatch (and a noinput as well), the value of application.lastresult$ is set, but the values are platform-dependent.
  3. DTMF sequence entered on nomatch. When a DTMF nomatch event occurs, the application does not know what DTMF sequence was entered. Having such information would also lead to more precise reprompting. (Well, some platforms may already provide it through the application.lastresult$ object. But that’s highly platform-specific. See point above)
  4. Mark information on hangup. In VoiceXML 2.1, mark elements can be interspersed with the application prompts so the application can know during which segment the caller barged-in. When callers hangup, however, the application cannot know the last mark reached. This information would improve reporting, letting us know whether people actually listen messages before hanging up.
  5. Better support for RESTful services in data element. VoiceXML 2.1 only mandates support for GET and POST as the HTTP method in data elements. It should also support the other HTTP verbs as well (like PUT, DELETE) to enable the integration of VoiceXML applications to RESTful services. (Some platforms, like Voxeo Prophecy, already offer that kind of support.) Also, it would nice if data returned from web services be in JSON format instead of XML. (XML is so 2009, right? just kidding.) VoiceXML interpreters already embed an ECMAScript interpreter and it would be much more convenient to manipulate a JSON object than an XML object.
  6. Dynamic array of grammars. In VoiceXML 2.1, it is possible to build a sophisticated list of prompts using the foreach element. It would be handy to be able to do the same with grammars. Of course, one can generate a base grammar dynamically that references those grammars, but experience showed us that, with certain ASR engines, speech recognition performs differently on such grammars compared to parallel grammars. (This one would certainly be more difficult to specify, as grammars are not part of executable code, and they can appear at different scopes – document, form, field. So it’s a bit more far-fetched.) The main use case for such a feature is the writing of AJAX-like applications in VoiceXML.
  7. Barge-in modality. In some cases, it’d be nice to control the modality of barge-in, like allowing one to barge-in in DTMF, but not in speech. It would be as simple as specifying the barge-in as
    “voice”, “dtmf”, “voice dtmf”, or “none” (instead of only “true” or “false”).
  8. Flushing the prompt queue. Being able to explicitly flush the prompt queue would really be handy for prompts like “one moment please” <flush>.

That’s it for now. We have a couple more in store, but I wanted to keep the list short.

What do you think? Are there other small issues that you would like to be addressed by the upcoming VoiceXML 3.0?

I would like to thank my colleague Jean-Philippe Gariépy for bringing most of these issues to my attention. He’s the one who has to deal the most with the VoiceXML code that our internal dialog framework generates.

Voice APIs: back to basics

We definitely live in interesting times. After years of pushing hard on VoiceXML (2.0 and 2.1), the industry comes up regularly with new approaches departing significantly from the newly proposed VoiceXML 3.0. And these approaches sometimes come from companies working hard on the VoiceXML standardization effort.

For instance, last week Voxeo announced a new interface to its Tropo platform, called Tropo WebAPI. To build a communications application, one has simply to write a web service/application producing JSON documents. These documents contain simple instructions for the communications platform like: play this prompt, ask a question, transfer the call, etc. Very simple instructions, indeed. The results are then sent server-side to the application for further processing and deciding what to do next.

This approach reminds me of TwiML, Twilio‘s own markup language for implementing voice applications, and (to a certain extent) FastAGI, the Asterisk way of developing server-side voice applications (the preferred way of deploying applications on the Cloudvox platform).

What do these approaches have in common? Well, they all offer a much simpler programming model than VoiceXML. In VoiceXML, there is the form-filling algorithm which tries to fill slots in a form automatically. VoiceXML applications can also contain a fair amount of scripting (in ECMAScript) with many scoping rules for variables. It also provides some exception mechanisms (with catch and throw elements), a root document for storing data, etc. No wonder most development environments targeting VoiceXML platforms only make use of a limited subset of VoiceXML.

In fact, the new approaches are not programming models, they essentially provide low-level instructions for the various voice platforms. Much like a virtual machine. It’s up to the user of the platform to implement its own programming model on top of these instruction sets. And this is a very attractive offer, as this will most certainly ignite the development of new application programming environments and frameworks, some of which will be platform agnostic.

We lived a somewhat similar period at the end of the last century. There were many non-interoperable proprietary IVR platforms, and the industry came up with a solution: VoiceXML. Will we see something similar happen with these new approaches? I doubt it. I think that all these approaches are sufficiently similar that a good abstraction layer on the application side can suffice to support them all easily. In the 90′s, porting an application to a new platform was plainly impossible without a complete rewrite.

Strangely, the programming languages community lived something similar a few years ago. From around 1997 to the start of the century, the craze for  Java almost killed research in the field of object-oriented programming language design not targeting Java or the JVM. Then, in 2003 or so, some leading researchers decided consciously that it was time to start a post-Java era. And it’s at about that time that many programming languages started flourishing and that we saw a greater acceptance for dynamic/scripting languages (on the JVM or not). This period also coincided with the rise of the Web 2.0 and a new culture of entrepreneurship, thanks to Paul Graham Y Combinator.

I think we are living something similar today in the communications industry, though a few years later. We see young entrepreneurs and new startups with innovative ideas enter the market. By the way, a few of them presented their ideas at StartupCamp Telephony last week, an event sponsored by Twilio and PhoneTag as part of the ITExpo conference.

The years to come promise to be very exciting.