Keywords voice XML integrated model voice browser
1 Introduction
With the popularization of information services such as e-commerce and customer service, Interactive Voice Response (IVR) has been widely used in various commercial systems. However, this voice interaction method has the following shortcomings [1]: (1) poor portability and flexibility; (2) It is very difficult to develop applications on actual systems, especially when it comes to the compilation and debugging of voice streams; (3) Existing network resources cannot be comprehensively utilized. Internet-based IVR system can increase the chances of system reuse and reduce the cost, which will surely become a major trend of voice application in the future. On the other hand, so far, when people get all kinds of resources from the Internet, they can only do it with the help of computers. In fact, telephones are more popular than computers. If people are allowed to access Internet resources by telephone, it will be a qualitative leap for the application and development of the Internet.
Driven by this application prospect, the VoiceXML [2] standard formulated by the World Wide Web Consortium (W3C) was put forward. With this technology, users can access all kinds of resources on the Internet through telephone keys or voice, which is the core of voice browsing technology and voice Internet. Similar to XML standard, VoiceXML is a text-based language, which only defines the access mode of data. Users must write programs to interpret, generate and transmit VoiceXML documents.
VoiceXML shows a broad prospect for the voice application field, which is widely used in voice portal, voice call center, voice information service, voice e-commerce and other fields. These applications or services can be easily combined with the original data system, and can even be easily extended from various original applications. In the application system using VoiceXML, users can flexibly extend new services without learning complicated high-level languages. There is no need to contact the developer again and customize the development again. You only need to write a few VoiceXML pages to realize the new business process. Moreover, the prepared VoiceXML script can be added to the system anytime and anywhere, without affecting the normal operation of the system.
This paper briefly introduces the specifications and main terms of VoiceXML, and gives a voice and data integration model based on VoiceXML. The model accesses VoiceXML documents and databases on the Internet through VoiceXML interpreters and browsers, so as to realize the integration of voice and data and achieve the purpose of voice browsing.
2 VoiceXML specification
2. 1 structural model
The structural model of VoiceXML [2] is shown in figure 1. It mainly includes document server, VoiceXML interpreter program, VoiceXML interpreter environment and execution platform.
Structural model of figure 1 voice XML
A document server, which may be a WEB server, processes the request packet of the VoiceXML interpreter, and the document server generates a VoiceXML document and sends it to the VoiceXML interpreter. The interpreter separates the tags in the document, generates corresponding data or action commands, and guides and controls the interaction between users and the execution platform. At the same time, the VoiceXML interpreter environment and the interpreter monitor the user's input together. For example, the interpreter environment can listen to the user's request for operational help; Another environment can listen to the user's request to change the volume or some characteristics of text speech output.
The execution platform is controlled by interpreter environment and interpreter. For example, in an interactive voice response application, the VoiceXML interpreter environment can reliably monitor the call, obtain the initial VoiceXML document, and answer the call. After the answer, the VoiceXML interpreter guides the dialogue. The execution platform generates events in response to user actions (voice or character input) and system events (such as timer overflow). Some of these events are executed according to the interpretation of the corresponding VoiceXML document and VoiceXML interpreter, while others are controlled by the VoiceXML interpreter environment. The execution platform provides character and voice input and audio output, including TTS (text to voice), audio file playback, automatic speech recognition (ASR), DTMF key recognition, voice input recording and so on.
2.2 Terminology
The basic terms [2] in VoiceXML mainly include:
Dialog dialog and sub-dialog:
Conversation is used to describe the tips that the application tells users, define and collect the responses made by users, and describe the process of application control. Users and application systems interact through sessions in turn. There are two types of conversations: tables and menus. This table is responsible for performing all the operations described in the session definition and encapsulating the user's input and output related commands. In a table, you can include some fields whose values can be obtained from the table. And each field can specify a grammar definition that the user is allowed to input. Menus allow users to make choices and enter the selected dialogue. A sub-session is similar to a function call, which causes a new interaction and returns to the upper form.
For example, a sub-session can be used to create a confirmation sequence required for a database query; Create a batch of components shared by multiple documents in a single request; Or create a reusable session library that can be shared among multiple requests.
Session:
The conversation begins with the interaction between the user and the VoiceXML interpreter context, and continues to load and process documents until the user, document or interpreter environment issues a termination request.
Request:
A request is a set of documents that share the same request root document. In a request, whenever a user interacts with a document, the request root document is always loaded. After loading the request root document, its variables are used as request variables by other documents, and its syntax is always valid during the request. When users switch between different documents in the same request, the request root document is always loaded, and only when users switch to documents in other requests can the request root document be uninstalled.
Grammar (grammar):
Each session has one or more voice and/or DTMF grammars. In the application of directional dialogue, the grammar of dialogue only works when the user interacts with the dialogue. In mixed active conversations, the computer and the user alternately control the next operation, and some conversations are marked so that their syntax (such as monitoring calls) can work even when the user is in other conversations in the same document. In this case, if the operation performed by the user matches the valid syntax of another session, the execution will be transferred to another session.
Event (event):
VoiceXML provides a form filling mechanism to handle "ordinary" user input. In addition, VoiceXML also defines a mechanism for handling abnormal events. If the user does not respond within a certain period of time and requests system help, the platform will generate an event. If the interpreter finds semantic errors in the VoiceXML document, it will also generate events.
Link:
Links support mixed active conversations, and the syntax it specifies works when the user is within the link range. If the user's input matches the syntax of the link, control is transferred to the destination URI of the link. < link > can be used to generate an event that jumps to the destination URI.
Application:
An application consists of many documents with the same application root. When one of the documents is activated, the application root is loaded. At the same time, when jumping between different documents of the same application, the root document still resides in the memory, and will be discarded only when the user jumps between different applications. Variables and syntax definitions of the application root document can be accessed by the document in it.
3 voice and data integration based on VoiceXML
3. 1 overall structure model
VoiceXML application model, as shown in Figure 2. It mainly consists of the following parts: VoiceXML gateway, WEB server and database server. The functions of each part are introduced as follows.
Figure 2 VoiceXML application model
3.2 file structure and its implementation process
VoiceXML establishes the application structure with application, session and document as the unit, and takes session as the unit of interaction, completing conversations one by one and determining the process orientation. & ltvxml & gt can be regarded as a container containing conversations, and all VoiceXML documents are composed of a series of conversations. A group of VoiceXML documents can jump between each other to form a finite state machine of a conversation. The user is always in one session, and each session decides the next session to transfer to. This transport is specified by URIs, which defines the next document and session to be used.
The root document is the beginning of VoiceXML program, which can include forms, scripts, variables, grammar and other elements. VoiceXML programs always start from the form of elements, and when the program needs to jump, it will also jump from one form to another. Usually, a multi-document application structure is adopted, and one application has a root document.
Examples of applications are as follows:
Application root document (app-root.vxml)
& lt? xml version=" 1.0 "? & gt
& ltvxml version="2.0 " >
& ltvar name = " test " expr = " " Man "/& gt;
& ltlink next = " operator _ xfer . vxml " & gt;
& lt grammar & gt
& ltrule id = " root " scope = " public " & gt; Operator ;
& lt/grammar & gt;
& lt/link & gt;
& lt/vxml & gt;
Leaf document (leaf.vxml)
& lt? xml version=" 1.0 "? & gt
& ltvxml version = " 2.0 " application = " app-root . vxml " & gt;
& ltform id = " say _ hello " & gt
& ltfield name = " answer " type = " boolean " & gt;
& lt hint & gt We can say< valueexpr = "application.test"/> ? & lt/prompt & gt;
& lt has been filled & gt
& ltif cond = " answer " & gt
& ltexit/>;
& lt/if & gt;
& ltclear name list = " answer "/& gt;
& lt/filled & gt;
& lt/field & gt;
& lt/form & gt;
& lt/vxml & gt;
VoiceXML application is a collection of VoiceXML documents. Every application contains a "root document", a bit like default.asp or index.asp of a dynamic website. When calling a VoiceXML application, the "root document" is always called.
3.3 VoiceXML gateway
3.3. 1 speech recognition
Speech recognition enables the computer to understand the user's voice commands, generate corresponding text results, and send them back to the VoiceXML parser for processing. In VoiceXML gateway, the speech recognition engine is an imperative recognition engine, which recognizes the user's speech signal according to the limited grammar and produces the recognition result corresponding to the grammar definition. In VoiceXML voice browser, grammar determines what users can say and how to say it. Good grammar can give users a good interactive feeling and logically improve the recognition rate of speech recognition engine.
In VoiceXML gateway, speech recognition not only needs to deal with the recognition of user's voice signal, but also needs to deal with the recognition of user's keys. Keystroke and voice are processed and transmitted by the same mechanism.
A typical speech recognition process [3] is shown in Figure 3.
Fig. 3 Typical speech recognition process
Some adjustments can be made to the software structure of data compression and transmission [4]. Fig. 4 is an improved method based on client/server mode.
Fig. 4 Speech recognition based on client/server mode