Future Emu Development Plans

This document attempts to set out our plans for future development of the Emu Speech Database System.

Current Version 1.x

Release 1.6, July 2001

Double click template fiies to open database.

This has some consequences since now Emu will accept a template file name as a database name as well as the basename of the template. Templates need not be in the standard directory and so can be distributed on CDROM with a database. Other changes needed are to make the paths for AutoBuild and EmulabelModules relative to the template file; in fact, the labeller will now look in a few obvious places for these files if they are given as relative paths.

Template editor

A simple text editor built in to the Emu labeller. On Windows this can be invoked via the 'edit' option after right clicking on a template file on the desktop.

Dialogue Annotation and Transcriber

Some features have been added to Emu to make annotation of dialogue feasable. One such feature is an active link between Transcriber and Emu. Another is a plugin for Transcriber which allows export of annotations in Emu HLB format. A new module for the labeller has been added which shows, eg. turn level labels in a list and loads speech data only for one turn at a time. In this way, long utterances which have already been segmented with Transcriber can be annotated in detail with the Emu Labeller.

Future 1.x releases

There may be some future releases in this series if there is demand for a particular feature of if a serious bug is found. Currently the following items are possible contenders for features to be added here.

Macintosh port

We are very close to having a macintosh version of Emu. A version 1.x release will be made for the Macintosh.

Cut & paste in labeller

To be able to cut and past label text in the labeller.

Emu Version 2.x

The next version of Emu will have a radically different internal organisation. The new design incorporates lessons we have learned in the deployment of Emu 1.x and integrates the work of a number of different annotation projects. Emu 2.x is part of an effort to establish standards by which annotation tools can inter-operate. It does this on two levels: firstly by making use of the Annotation Graph library from the LDC we support a standard data model for annotations and make use of shared file input/output code for different kinds of annotations. Secondly, the new Emu labeller is based on a new modular architecture which has a public interface; the outcome of this is that Emu can use components of other annotation system where appropriate.

Another major change in the architecture of Emu is the separation of data file handling and signal processing code from the Emu database core. Emu has never done much signal processing itself but it has included code to read SSFF and WAV format data files. The new version of Emu will make use of the Edinburgh Speech Tools library (C++) and/or the Snack library (Tcl) to read, play and process speech data files.

Annotation Graph based internal representation.

Steven Bird's group at LDC has developed an Annotation Graph library written in C++ which can be linked in to other applications. The next version of Emu will use this library as the core internal representation of annotations. This library has a Tcl interface and so can be used immediately to build annotation tools as part of the Emu system.

Tasks

File Handling
Currently the AG module reads annotations given a single filename. In many cases we'll want to read annotations from sets of files (eg. .hlb, .lab, .tone) and the full filenames will come via the Emu template library. We need to investigate what chages might be needed to the AG library to enable this kind of interface.SC
HLB file import
Write an import filter to enable the AG library to read (and possibly write) Emu HLB formatted annotation files. This could make use of the existing Emu file input routines. SC

Modular Labeller Architecture

The replacement for the Emu labeller will be made from components which are responsible for displaying one kind of signal or one kind of view on the annotation. Features of the new labeller will include:

  • Support for labelling arbitrarily long files.
  • Able to display signals after some DSP, eg. gradient of a track or dynamically generated pitch track.
  • Simple linear (score) annotation view.
  • Emu style hierarchy annotation view.
  • Transcriber style annotation view.
  • Spectrograms with editable overlaid formant tracks.

All views on an annotation should if possible allow in-place editing of the annotation.

Tasks

The following components need to be developed:

Manager Widgets
The modular display architecture calls for manager widgets to bind groups of display together and coordinate thier activity. SC, SH
Labeller
A labeller built from these components will be a manager widget with additional functionality such as choosing databases and files to edit. SC, SH
Basic signal & spectrogram display.
This can be a port of the existing Emu signal display code but modules based on Snack and Wavesurfer are also possible SH, SC, MS
Basic single level score label view.
Similar to the current Emu Labeller signal label view. Needs to take into account the added complications introduced by the Annotation Graph format, in particular overlapping and non-contiguous labels. SH
Hierarchy View.
A view of the annotation as a hierarchy. This will use the Annotation Graph model and the modular display architecture. Care needs to be taken where the underlying annotation is not a true hierarchy. The view should be editable. SH
Transcriber style display/labeller.
A transcriber style dialogue annotation tool. It's possible that some of the existing Transcriber code could be used here. Claude et al should contribute here if possible. SH, SC, CB?
EPG display
A display for EPG signals derived from various sources. Modelled on the XASSP display but with support for the modular display architecture and if possible a range of EPG capture systems. MS
Video Display
A display for video data using the modular display architecture. A good resource will be the QuickTimeTcl package which gives access to video on Windows and Macintosh. DS, SH, MS, SC

XML Based Database Templates

Emu version 2 uses a new extended database template format based on XML. This template duplicates all of the information stored in current Emu template files but is designed to be more easily extended and used in other tools. Important features of the new template file system are as follows:

  • Provides a simple upgrade path from existing Emu templates: no information will be lost in upgrading and tools will be provided to convert existing template files.
  • A template file editor will be developed which will ensure that template files are valid and which will hide the details of the file format from the casual user.
  • Templates can be derived from other templates. For example a ToBI annotation template might define the shape of ToBI annotations (levels and thier relationships). A real database template could refer to this and add information about the location of files etc.
  • Configuration information for various tools can be included in the template. Currently this is done via template variables (eg. LabellerLevels). The new format allows much richer configuration to be stored and passed to any kind of annotation tool.
  • Maintain platform independance. You should be able to use a single database template file on any platform by modifying only the location of files. We will look at ways of making even this modification unnecessary so that, for example, a database on CDROM which includes a template file can be used on any platform without change.

Tasks

Define core XML DTD
The DTD defines the syntax of the template file. There is already a prototype DTD but my recent reading has convinced me that RDF is a better way to go. SC
C++ template interface.
A C++ library which reads and writes XML template files and provides an interface to the rest of the system. First version already implemented. At the moment this uses the Expat XML parser, it may be useful to port this to the Xerces parser since that is used by the Annotation Graph library and by tclxml. SC
Tcl binding
This will allow template manipulation from Tcl scripts. Required to build the template editor and for all other user level tools in Emu. SC
Python, Perl or other language bindings?
Although we use Tcl to build tools, others prefer different languages. We should investigate using SWIG to build the scripting language interface to see if we can generate alternate interfaces at no additional cost.
Template editor
Since the template is an XML application, we may be able to use a general XML editor here. Look at Zveno's waxml (nee swish) which has a plugin architecture that might be useful.
Database Installer
Look at what is needed to install a database+template file on a system. This might include modifying the template file (simplified editor). We might also look at packaging issues like using MetaKit to pack up all of the label files for distribution.

Database Query

The current query language in Emu is sufficient for many needs but doesn't have sufficient power to properly query annotation graphs. With the new AG core library we will lose the current query language implementation and so a new query system will be needed. This is not a simple undertaking as the design of a new QL has many complex considerations. As a stopgap we might consider implementing a variation on the current QL on the annotation graph system. This would at least allow a useful range of queries to make the system useful in the short term. In the longer term a new QL proposal needs to be made and implemented.

Tasks

Query Langauge
Implement some kind of query system as a stopgap measure to make the system useable. Probably in Tcl to save on implementation time. SC

Speech Data Handling

The current Emu system does very little signal processing and relies on third party packages to perform formant and pitch tracking etc. This is a significant problem since there is little integration between signal processing software and Emu database functions. Data file input has been handled by some Emu-only routines (SSFF and WAV formats) and by third part libraries (ESPS for the Entropic sd and fea formats, NIST Sphere for the NIST format and the Edinburgh Speech Toolkit for a variety of other formats). Emu has never had a good interface for writing speech data files of any kind.

There are two areas where Emu makes use of data file input: display of signals in the labeller and data extraction for analysis and signal processing. The requirements for both of these areas will be discussed here.

Signal Display

Components of the Emu labeller need to be able to display all or part of a speech data file aligned with the annotation. These data files can be speech waveforms, physiological data, formants, pitch traces or any other time series data either derived from a speech waveform or captured separately. This data is stored in a number of file formats (wav, sd, au, Entropic FEA, SSFF etc.), an important capability is to be able to add support for a new file relatively easily.

For signal display, the requirement is to be able to read data from these many file formats into memory and then pass the data to the appropriate visualisation routines. We already have a good times series display widget for Tcl (padgraph) which offers all of the features needed by Emu and supports a C interface for adding data points to the display. The only new work needed here is to improve the file input/output interface.

Another option for signal display is to make use of the Snack toolkit. Snack is a Tcl extension package which handles input and output of speech signal data in various file formats and supports a number of signal processing operations. Snack provides a very good platform independant (unix, Windows and Macintosh) audio recording and playback facility.

One issue with Snack is that it only supports reading and writing of sampled speech data and not other data such as pitch or formant tracks. As such it is very useful for display of speech signals (includeing spectrograms as it includes a very flexible spectrogram display system) and for writing waveform editing and manipulation programs. My current feeling is that we should make use of Snack where appropriate to write utility scripts and display modules for Emu but not make it a central part of the Emu system.

Signal Processing and Analysis

In the current Emu system, data can be extracted for each segment in a segment list (the result of a database query) and is generally written to a text file or imported into the Splus/R environment for further processing. In some cases C++ code has been written to perform some signal processing on raw speech data corresponding to segments.

The idea of performing signal processing operations on the results of queries is very powerful and support for it should be extended in the new system. For example, one could query the database for vowels and then calculate a series of spectral calculations (eg. ERB weighted spectra) on each vowel, storing the data for later analysis. To do this will require a high level interface to a signal processing library.

The Edinburgh Speech Tools library provides most of the facilities that might be used in this kind of application. It supports input and output of data files in many formats, resampling of time series data, windowing of data and many signal processing operations. The library is well written and modular and would be easily extended if, for example, we wanted to add a formant tracker or other signal processing operations to the library.

In order to make writing new signal processing programs easy, a scripting interface to the Edinburgh Speech Tools library is needed. The obvious choice here is to interface to Tcl and this should be a first step. This would enable Tcl scripts to be written to do complex signal processing operations and for GUI interfaces to be constructed so that common operations (eg. doing one of a set of DSP operations on each segment in a query result) could be presented to users in a relatively simple way. I envisage the Tcl interface enabling scripts such as:

set segments [$dbase query "Phonetic=vowel"]
set dft_data [emu_data -format <some kind of format specification>]

for {set i 0} {$i < [llength $segments]} {incr i} {
      ## retrieve sampled speech data for this segment
      set data [emu_get_data [lindex $segments $i] "samples"]
      
      ## calculate DFT 
      set dft [$data process -operation dft -window hamming -ncoeffs 5]
      
      ## append dft data to a new file
      $dft_data append $dft
}
## write out the dft data
$dft_data write "newdata.dft"
    

Another option is to look at a direct interface between ESTools and Splus/R so that Splus scripts can read data directly from speech data files and perform DSP operations on the data before importing it into Splus/R for analysis and visualisation. We would need to study the low level interfaces between Splus/R and C/C++ to see if this was feasable and/or desireable.

Tasks

Develop a Tcl interface to ESTools.
This will make all of the functions and datatypes in ESTools available to tcl scripts. SC, DS
Port ESTools to Macintosh
ESTools currently runs on Unix and Windows, we need to port the file i/o and DSP parts of ESTools at least to the Macintosh platform. If this proves impractical, we will need to rethink the proposed reliance of Emu on ESTools. DS
Develop a framework for DSP programs operating on segment lists.
This will mimic the structure of the existing get_bark and get_fft programs by Catherine Watson but allow script level control over what operations are applied to each signal. MS, SC
Investigate the interface to Splus and/or R.
It may be possible to link ESTools into Splus/R at the C code level. If not we need to look at how else Splus code could perform DSP operations on large datasets. SC, DS
Formant tracker
Port the XASSP/Keil formant tracker to the Emu framework. At a minimum this would mean that it could output data readable by EStools, a more complete port would enable any fragment of data to be formant tracked by integrating the tracker into EStools. MS

Splus/R Interface

Emu currently provides a library for Splus or R which includes an interface to database query and data extraction as well as a large collection of functions for data analysis and visualisation. The majority of this code will be retained in the next version but we should look at better integration between Splus/R and the Emu core and at any extensions to the library that might be appropriate.

Tasks

R/Splus integration
Develop the interface between Emu and R/Splus to replace the current emu.query and emu.track functions. These should be backward compatible where possible to enable old programs to work on new databases. We might look at using the tcltk library in R to provide a tighter interface between the two systems and build a grapical interface to some common operations.
R/Splus library development
We need to continue developing the library to make it more useable and functional.

New User Level Applications

In the current version of Emu the user level applications are the labeller (emulabel) and Splus or R. Some special purpose applications have been written such as the segmentation tool and the speech capture tool but these have not been well documented or made widely available to the user community. As the functionality in Emu is extended, we forsee a need for a new set of applications using the core Emu components. Some of these are outlined here.

Tasks

A waveform editor
A common task in speech corpus work is to edit waveforms to remove sections or to perform some signal processing (eg. filtering) on a waveform and then save it to a new file. While we don't wish to duplicate the functionality of programs like Cool Edit, it would be useful to have a waveform editing program in the Emu suite of tools.
Generating Stimuli for Perception Experiments
An old program called muppf was designed to take a list of speech segments, perform some normalisation on them and then output them to a composite file possibly separated by beeps or pauses. This functionality can be provided in the new Emu system by an application that makes use of the underlying DSP facilities.
Speech Synthesis Interface
It would be useful to be able to synthesise or resynthesise speech segments for perception experiments. For example, we might want to resynthesise vowels from the first two formant tracks or resynthesise segments from LPC coefficients using a fixed pitch trace. Again the underlying DSP facilities should be capable of this. The functionality could be integrated into the application mentioned in the previous task.

For more information, please send mail to Steve.Cassidy@mq.edu.au.

Copyright © 2001, Department of Linguistics, Macquarie University.