Chapter 13. Statistics and Classification Experiments

Table of Contents

Gaussian Classification
Data Reduction

The Splus system provides most of the statistical analysis software that you will need in speech analysis. However, Emu provides some specialist functions which are often used in speech research and are either not present in Splus or are not suited to the large amounts of data common in speech problems.

Gaussian Classification

A common methodology in speech research in evaluating a set of features with respect to their discriminatory power is to carry out a gaussian classification analysis. In a gaussian model, a set of data is characterised by the mean and covariance for each class within the data along a number of dimensions. New data points can then be classified by measuring their distance from each centroid and assigning the class of the closest centroid. Emu provides functions to build a gaussian from a set of multi-dimensional data and for two types of distance measure for classification.

The function train takes a vector or matrix of data representing a number of segments (one segment per row) such as returned by track or muspec, and a parallel label vector. It returns the class centroid and covariance matrix for each unique label in the label vector. Obviously, if the dimensionality of the input data is high this procedure will take some time. For such data it is often useful to first carry out a data-reduction step such as principal components analysis or canonical discriminant analysis and then build the gaussian model. This will be discussed further later in this chapter.

Bayesian Classification

Once you have a gaussian model, you can use one of two procedures to classify new data points: Bayesian distance or Mahalanobis distance. The Bayesian distance measure treats each centroid and covariance matrix as the specification of a probability distribution for that class. For each new data point we calculate the probability that that point came from each class; the data point is then assigned to the class which gave the highest probability. To illustrate consider the one-dimensional example shown in Figure 13.1, “A one dimensional probability distribution for two classes A and B.”. Here we have two probability distributions: class A is centered at 5 and has a narrow distribution while class B is centered at 10 and has a wider distribution. The y-axis shows the probability density for each distribution. The point P is intermediate between the two centroids but we can see that the probability that it was derived from class B is larger than that from class A. Consequently, this point would be classified as B. On the other hand, point Q is closer to the center of A and so has a higher probability in the A distribution than in that of B; it would be classified as A. The Bayesian distance measure is similar to the straight line (Euclidean) distance measure but takes into account the shape of the probability distribution for each class. This probability distribution is estimated by the train function which finds the centroid and covariance matrix of the training data for each class.

The Bayesian distance measure is defined as: Need some maths here!

Figure 13.1. A one dimensional probability distribution for two classes A and B.

A one dimensional probability distribution for two classes A and B.

To classify a set of data using the Bayesian distance measure use the bayes.lab function, which takes two arguments: a gaussian model as returned by train and a matrix of data with the same dimensionality as that used to generate the model. As an example we can attempt to distinguish the vowels [A], [O], and [V] based on the first two formant values at the midpoint. First we extract the data using track and then we classify using train and bayes.lab:

segs <- emu.query("demo", "*", "Phonetic=A|O|V")
data <- track(segs, "fm", cut=0.5)
labs <- label(segs)
model <- train(data[,1:2], labs)
blabs <- bayes.lab(data[,1:2], model)
confusion(labs, blabs)

   O  V  A 
O 16  0  0
V  0 10  0
A  0  0 15

With two dimensional data we can also visualise the distribution of data using eplot:

eplot(data, labs, dopoints=T, formant=T)

The result is shown in Figure 13.2, “The distribution of [A], [O], and [V] in the F1/F2 plane.”.

Figure 13.2. The distribution of [A], [O], and [V] in the F1/F2 plane.

The distribution of [A], [O], and [V] in the F1/F2 plane.

The Mahalanobis Distance Measure

An alternative distance measure that is in common use is the Mahalanobis distance. This is similar to the Bayesian distance in that it takes into account the shape of the covariance matrix of the class model. However, the derivation of the Mahalanobis distance formula assumes that the covariance matrices of each class are the same in order to simplify the calculations involved. Thus it is valid to use the Mahalanobis distance measure if the data for each class is similarily distributed, however, nothing prevents you using it if they are not. The Mahalanobis distance is defined as:

The mahal function takes a gaussian model generated by train and a matrix of data with the same dimensionality as that used to build the model, and assigns a label to each data point. In the following example we classify the data derived above using the Mahalanobis distance measure:

mlabs <- mahal(data, model)
confusion(labs, mlabs)

   O V  A 
O 16 0  0
V  2 8  0
A  0 0 15

Compare these results with those given by bayes.lab above. Although in this case the Bayesian distance measure provided better results, this is not universally the case. The decision as to which distance measure to use in a given experiment should be based on the shape of the class distributions; if they are similar then use of Mahalanobis distance is justified (it performs significantly faster), if they are vastly different then the Bayesian is more properly used.

Open and Closed Testing

In the examples above the same data was used to build the gaussian model (using train) and to evaluate it (using mahal or bayes.lab). This is known as a closed test of the model since the set of data being considered is closed. In a true open test, the test data should be independant of the training data, for example, from a different set of speakers. To perform an open test with the functions described above it is only neccessary to derive two segment lists (for training and testing segments) and the corresponding track data from each. The model is then trained using the first set of data and tested using the second, as in the following example:

train.segs <- emu.track("demo", "msajc*", "Phonetic=A|O|U") <- emu.track(train.segs, "fm", cut=0.5)

test.segs <- emu.track("demo", "msadb*", "Phonetic=A|O|U") <- emu.track(train.segs, "fm", cut=0.5)

model <- train([,1:2], label(train.segs))

blabs <- bayeslab([,1:2], model)    # perform open test
confusion(label(test.segs), blabs)    

   U  O A 
U 11  0 0
O  0 11 0
A 12  2 7