# Computer Science 401 Homework Assignment #3 Speech solved

\$35.00

## Description

5/5 - (1 vote)

1 Introduction
This assignment introduces you to Gaussian mixture modelling, and two basic tasks in speech technology:
speaker identification, in which we try to determine who is talking, and speech recognition, in which we try
to determine what was said.
The assignment is divided into two sections. In the first, you will experiment with speaker identification
by training mixtures of Gaussians to the acoustic characteristics of individual speakers, and then identify
speakers based on these models. In the second section, you will evaluate two speech recognition engines.
The data come from the CSC Deceptive Speech corpus, which was developed by Columbia University, SRI International, and University of Colorado Boulder. It consists of 32 hours of audio interview
from 32 native speakers of Standard American English (16 male,16 female) recruited from the Columbia
University student population and the community. The purpose of the study was to distinguish deceptive
speech from non-deceptive speech using machine learning techniques on extracted features from the corpus.
Data are in /u/cs401/A3/data/; each sub-folder represents speech from one speaker and contains raw
audio, pre-computed MFCCs, and orthographic transcripts. Further file descriptions are in Appendix A.
2 Speaker Identification
Speaker identification is the task of correctly identifying speaker sc from among S possible speakers si=1..S
given an input speech sequence X, consisting of a succession of d-dimensional real vectors. In the interests
of efficiency, d = 13 in this assignment. Each vector represents a small 25 ms unit of speech called a frame.
Speakers are identified by training data that are ascribed to them. This is a discrete classification task
(choosing among several speakers) that uses continuous-valued data (the vectors of real numbers) as input.
Gaussian Mixture Models
Gaussian mixture models are often used to generalize models from sparse data. They can tightly constrain
large-dimensional data by using a small number of components but can, with many more components,
model arbitrary density distributions. Sometimes, they are simply used because the domain being modelled
appears to have multiple modes.
Given M components, GMMs are modelled by a collection of parameters, θ = {ωm=1..M, µm=1..M, Σm=1..M},
where ωm is the probability that an observation is generated by the mth component. These are subject
1
to the constraint that P
m ωm = 1 and 0 ≤ ωm ≤ 1. Each component is a multivariate Gaussian distribution, which is characterized by that component’s mean, µm, and covariance matrix, Σm. For reasons
of computational efficiency, we will reintroduce some independence assumptions by assuming that every
component’s covariance matrix is diagonal, i.e.:
Σm =


Σm[1] 0 · · · 0
0 Σm[2] · · · 0
.
.
.
0 0 · · · Σm[d]


for some vector Σ~ m. Therefore, only d parameters are necessary to characterize a component’s (co)variance.
2.1 Utility functions [10 marks]
First, we implement three utility functions in /u/cs401/A3/code/a3 gmm.py. First, implement log b m x,
which implements the log observation probability of xt for the mth mixture component, i.e., the log of:
bm (~xt) =
exp ”

1
2
X
d
n=1
(xt
[n] − µm[n])2
Σm[n]
#
(2π)
d/2
qQd
n=1 Σm[n]
(1)
Next, implement log p m x, which is the log probability of m given xt using model θ, i.e., the log of:
p (m|~xt
; θ) = ωmbm (~xt)
PM
k=1 ωkbk (~xt)
(2)
Finally, implement logLik, which is the log likelihood of a set of data X, i.e.:
log P

X˜; θs

=
X
T
t=1
log p (~xt
; θs) (3)
where
p (~xt
; θ) = X
M
m=1
ωmbm(~xt) (4)
and bm is defined in Equation 1. For efficiency, we just pass θ and precomputed bm(~xt) to this function.
2.2 Training Gaussian mixture models [5 marks]
Now we train an M-component GMM for each of the speakers in the data set. Specifically, for each speaker
s, train the parameters θs = {ωm=1..M, µm=1..M, Σm=1..M} according to the method described in Appendix B. In all cases, assume that covariance matrices Σm are diagonal. Start with M = 8. You’ll be asked
to experiment with that in Section 2.4. Complete the function train in /u/cs401/A3/code/a3 gmm.py.
2.3 Classification with Gaussian mixture models [5 marks]
Now we test each of the test sequences we’ve already set aside for you in the main function. I.e., we check
if the actual speaker is also the most likely speaker, ˆs:
sˆ = argmax
s=1,…,S
log P

X˜; θs

(5)
2
Complete the function test in /u/cs401/A3/code/a3 gmm.py. Run through a train-test cycle, and save
the output that this function writes to stdout, using the k = 5 top alternatives, to the file gmmLiks.txt.
2.4 Experiments and discussion [10 marks]
Experiment with the settings of M and maxIter (or  if you wish). For example, what happens to
classification accuracy as the number of components decreases? What about when the number of possible
speakers, S, decreases? You will be marked on the detail with which you empirically answer these questions
and whether you can devise one or more additional valid experiments of this type.
• How might you improve the classification accuracy of the Gaussian mixtures, without adding more
training data?
• When would your classifier decide that a given test utterance comes from none of the trained speaker
models, and how would your classifier come to this decision?
• Can you think of some alternative methods for doing speaker identification that don’t use Gaussian
mixtures?
Put your experimental analysis and answers to these questions in the file gmmDiscussion.txt.
3 Speech Recognition [10 marks]
Automatic speech recognition (ASR) is the task of correctly identifying a word sequence given an input
speech sequence X. To simplify your lives, we have ran two popular ASR engines on our data: the opensource and highly customizable Kaldi (specifically, a bi-directional LSTM model trained on the Fisher
corpus), and the neither-open-source-nor-particularly-customizable Google Speech API.
We want to see which of Kaldi and Google are the most accurate on our data. For each speaker in
our data, we have three transcript files: transcripts.txt (the gold-standard transcripts, from humans),
transcripts.Kaldi.txt (the ASR output of Kaldi), and transcripts.Google.txt (the ASR output of
Complete the file at /u/cs401/A3/code/a3 levenshtein.py. Specifically, in the Levenshtein function, accept lists of words r (Reference) and h (hypothesis), and return a 4-item list containing the floatingpoint WER, the number of substitutions, the number of insertions, and the number of deletions where
W ER =
(numSubstitutions + numInsertions + numDeletions)
numReferenceW ords
Assume that the cost of a substitution is 0 if the words are identical and 1 otherwise. The costs of insertion
and deletion are both 1.
In the main function, iterate through each of the speakers, and iterate through each line i of their
transcripts. For each line, preprocess these transcripts by removing all punctuation (other than [ and ])
and setting the text to lowercase. Output the following to stdout:
[SPEAKER] [SYSTEM] [i] [WER] S:[numSubstitutions], I:[numInsertions], D:[numDeletions]
where [SYSTEM] is either ‘Kaldi’ or ‘Google’.
Save this output and put it into asrDiscussion.txt.
On the second-to-last line of asrDiscussion.txt, in free text, summarize your findings by reporting
the average and standard deviation of WER for each of Kaldi and Google, separately, over all of these
3
lines. If you want to be fancy, you can compute a statistical test of significance to see if one is better than
the other, but you don’t need to.
On the last line of asrDiscussion.txt, add a sentence or two describing anything you observe about
the types of errors being made by each system, by manually examining the transcript files.
4
4 Bonus [up to 10 marks]
We will give up to 10 bonus marks for innovative work going substantially beyond the minimal requirements.
These marks can make up for marks lost in other sections of the assignment, but your overall mark for
this assignment cannot exceed 100%. You may decide to pursue any number of tasks of your own design
related to this assignment, although you should consult with the instructor or the TA before embarking
on such exploration. Certainly, the rest of the assignment takes higher priority. Some ideas:
Voice banking
We are running a large study or ‘normative’ data in which people from the general population donate their
speech (and language) data so that we can learn subtle differences in pathological populations. If you go
to https://www.cs.toronto.edu/talk2me/, you can obtain 5 bonus points if you complete 10 sessions,
each at least 1 day apart. There is no limit on your age, or first language. Currently, only the Chrome
and most recent Firefox browser are supported. Create a new username for this assignment, and indicate
Dimensionality reduction
Principal components analysis (PCA) is a method that converts some multivariate representation of data
into a lower-dimensional representation by transforming the original data according to mutually orthogonal
principal components.
Implement an algorithm that discovers a d×d
0 matrix W that transforms a d-dimensional vector, ~x into
a d
0
-dimensional vector ~y through a linear transformation based on PCA, where d
0 < d. Repeat speaker
identification using data that has been transformed by PCA and report on what you observe, e.g., for
different values of d
0
. Submit all code and materials necessary to repeat your experiments.
ASR with sequence-to-sequence models
Try to do better than Kaldi or Google by implementing:
Chiu C-C, Sainath TN, Wu Y, et al. (2017) State-of-the-art Speech Recognition With Sequence-toSequence Models. http://arxiv.org/abs/1712.01769.
Consider using open-source end-to-end ASR using TensorFlow, e.g., deepSpeech.
Truth-and-lie detection
Each of the utterances has been labelled as either truthful or deceitful (see Appendix A). Train and test
models to tell these utterances apart using the provided data. E.g.,
• Train a GMM for each of the Truth and Lie categories, using your code from Section 2.
• Try recurrent neural networks that read one MFCC vector at a time.
• Extract engineered features, such as those extracted in Assignment 1, from the text transcripts and
classify using discriminative models in scikit-learn. Are words more discriminative than the audio?
• Consider how errors in ASR transcripts affect those extracted features and therefore overall system
accuracy in a manner not dissimilar to Zhou L, Fraser KC, Rudzicz F. (2016) Speech recognition in
Alzheimer’s disease and in its assessment. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH.
5
5 General specification
We may test your code on different training and testing data in addition to those specified above. Where
may run your programs using test harness Python scripts. It is therefore important that your code precisely
meets the specifications and formatting requirements, including arguments and file names.
If your code uses a file or helper script, it must read it either from the directory in which that code is
being executed (i.e., probably ‘pwd’), or it must read it from a subdirectory of /u/cs401 whose absolute
path is completely specified in the code. Do not hardwire the absolute address of files in your home
documentation is not required.
This assignment is in Python 3.
5.1 Submission requirements
This assignment is submitted electronically. Submit your assignment on MarkUs. You should submit:
1. All your code for a3 gmm.py and a3 levenshtein.py (including helper files, if any).
2. The output file gmmLiks.txt
3. Your discussion files gmmDiscussion.txt and asrDiscussion.txt
4. The ID file available from the course web-site.
Do not tar or compress your files, and do not place your files in subdirectories.
If you want to do some or all of this assignment on your laptop or other computer, you will have to do
the extra work of downloading and installing the requisite software and data. You take on the risk that
of your work to teach.cs, so that if your machine fails or proves to be inadequate, you can immediately
continue working on the assignment at teach.cs. When you have completed the assignment, you should
try your programs out on teach.cs to make sure that they run correctly there. A submission that does
not work on teach.cs will get zero marks.
6
A Appendix: Details on CSC data set
Each utterance is represented by the following file types:
*.wav The original speech waveform sampled at 16kHz.
*.mfcc.py The Mel-frequency cepstral coefficients obtained from an analysis of the waveform,
in numPy format. Each row represents a 25ms frame of speech and consists of 13
floating point values.
*.txt Label and orthographic transcription of each utterance, for each of Kaldi and Google
ASR, and human gold-standard.
Participants were told that they were participating in a communication experiment which sought to
identify people who fit the profile of top entrepreneurs in America. To this end, participants performed
of those areas and did not fit the profile. The subjects then participated in an interview where they were
told to convince the interviewer that they had actually achieved high scores in all areas and that they did
indeed fit the profile. The interviewer’s task was to determine how he thought the subjects had actually
performed, and he was allowed to ask them any questions other than those that were part of the subjects’
tasks. For each question from the interviewer, subjects were asked to indicate whether the reply was true
or contained any false information by pressing one of two pedals hidden from the interviewer under a table.
Interviews were conducted in a double-walled sound booth and recorded to digital audio tape on
two channels using Crown CM311A Differoid headworn close-talking microphones, then downsampled
to 16 kHz. Interviews were orthographically transcribed by hand using the NIST EARS transcription
guidelines. Labels for local lies were obtained automatically from the pedal-press data and hand-corrected
for alignment, and labels for global lies were annotated during transcription based on the known scores of
the subjects versus their reported scores.
MFCCs were obtained using the python speech features module using default parameters, i.e., 25
ms windows, 13 cepstral coefficients, and 512 fast Fourier transform coefficients
Each transcript file has the same format, where the i
th line is:
[i] [LABEL] [TRANSCRIPT]
where i corresponds to i.wav and i.mfcc.npy, [LABEL] is the Global Lie label, and [TRANSCRIPT]
is the actual transcript orthography. Global Lie valence and the version of the pre-interview task for
the utterance appears before the colon (e.g., T/H) and the section name appears after the colon (e.g.,
INTERACTIVE).
Global Lie valence is indicated as: T == Truth; LU == Lie Up (subject claims better performance
than was actually achieved); and LD == Lie Down (subject claims worse performance). Task version is
indicated as: H == Hard; and E == Easy. So, for example, T/H:INTERACTIVE indicates that the
subject is telling the truth based on having performed the hard version of the Interactive task.
7
B Appendix: Training Gaussian mixture models
Input: MFCC data X, number of components M, threshold , and maxIter
begin
Initialize θ ;
i := 0 ;
prev L := −∞ ; improvement = ∞;
while i =< maxIter and improvement >=  do
ComputeIntermediateResults ;
L := ComputeLikelihood (X, θ) ;
θ := UpdateParameters (θ, X, L) ;
improvement := L − prev L ;
prev L := L ;
i := i + 1 ;
end
end
Algorithm 1: GMM training algorithm.
For ComputeIntermediateResults, it is strongly recommended that you create two M × T numPy
arrays – one to store each value from Equation 1 and the other to store each value from Equation 2. In
fact, we’ve set up the function logLik to encourage you to do this, to avoid redundant computations. You
will use these values in both ComputeLikelihood and updateParameters, where the latter is accomplished
thus:
ωˆm =
PT
t=1 p (m|~xt
; θ)
T
ˆ~µm =
PT
t=1 p (m|~xt
; θ) ~xt
PT
t=1 p (m|~xt
; θ)
Σˆm =
PT
t=1 p (m|~xt
; θ) ~x2
t
PT
t=1 p (m|~xt
; θ)
− ˆ~µ
2
m
(6)
In the third equation, the square of a vector on the right-hand side is defined as the component-wise square
of each dimension in the vector. Note that you don’t need to break up Algorithm 1 into separate functions
as implied – it is only written that way above to emphasize the sequence of steps
8