Description
P1: DCT and PCA [5 points]
1. I like walking in the B-Line trail (although it doesn’t mean that I have time to walk there
frequently). IMG 1878.JPG is the photo I took there. Load it and divide the 3D array
(1024 × 768 × 3) into the three channels. Let’s call them XR, XG, and XB.
2. Randomly choose a block of 8 consecutive (entire) rows from XR, e.g. XR
(113:120,1:768) (See
the red box in Figure 1). This will be a matrix of 8 ×768. Collect another 2 such blocks, each
of which starts from a randomly chosen first row position. Move on to the green channel and
extract another three 8 × 768 blocks. Blue channel, too. You collected 9 blocks from all three
channels. Now, concatenate all of them horizontally. This will be a matrix of 8 × 6912 pixels.
Let’s call it R.
3. Subtract the mean vector of size 8 × 1 from all 6912 vectors.
4. Calculate the covariance matrix, which will be an 8 × 8 matrix.
5. Do eigendecomposition on the covariance matrix (feel free to use a toolbox) and extract 8
eigenvectors, each of which is with 8 dimensions. Yes, you did PCA. Imagine that you convert
the original 8×6912 matrix into the other space using the learned eigenvectors. For example,
if your eigenvector matrix is W, than W⊤R will do it. Plot your W⊤ and compare it to the
DCT matrix shown in M02-S21. Similar? Submit your plot and code.
1
Figure 1: B-Line trail
6. We just saw that PCA might be able to replace DCT. But, it seems to depend on the quality
of PCA. One way to improve the quality is to increase the size of your data set, so that you
can start from a good sample covariance matrix. To do so, go back to the R matrix generation
procedure. But, this time, increase the total number of blocks to 90 (30 blocks per channel).
Note that each block is with 8 × 768 pixels once again. See if the eigenvectors are better
looking (submit the plot).
P2: Instantaneous Source Separation [6 points]
1. From x ica 1.wav to x ica 4.wav are four recordings we observed at an audio scene. In this
audio scene, there are three speakers saying something at the same time plus a motorcycle
passing by. You may want to listen to those recordings to check out who says what, but
I made it very careful so that you guys cannot understand what they are saying. In other
words, I multiplied a 4 × 4 mixing matrix A to the four sources to create the four channel
mixture:
x1(t)
x2(t)
x3(t)
x4(t)
= A
s1(t)
s2(t)
s3(t)
s4(t)
(1)
2. But, as you’ve learned how to do source separation using ICA, you should be able to separate
them out into four sources: three clean speech signals and the motorcycle noise. Listen to
your separated sources and transcribe what they are saying. Submit your separated .wav files
along with your transcription.
2
3. At every iteration of the ICA algorithm, use these as your update rules:
∆W ←
NI − g(Y )f(Y )
′
W (2)
W ← W + ρ∆W (3)
Y ← W Z (4)
where
W : The ICA unmixing matrix you’re estimating (5)
Y : The 4 × N source matrix you’re estimating (6)
Z : Whitened version of your input (using PCA) (7)
g(x) : tanh(x),(works element-wise) (8)
f(x) : x
3
,(works element-wise) (9)
ρ : learning rate (10)
N : number of samples (11)
4. Don’t forget to whiten your data before applying ICA!
5. Implementation notes: Depending on the choice of the learning rate the convergence of the
ICA algorithm varies. But I always see the convergence in from 5 sec to 90 sec in my desktop
computer.
P3: Ideal Masks [4 points]
1. piano.wav and ocean.wav are two sources you’re interested in. Load them separately and
apply STFT with 1024 point frames and 50% overlap. Use Hann windows. Let’s call these
two spectrograms S and N, respectively. Discard the complex conjugate part, so eventually
they will be an 513 × 158 matrix1
. Later on in this problem when you recover the time
domain signal out of this, you can easily recover the discarded half from the existing half so
that you can do inverse-DFT on the column vector of full 1024 points. Hint: Why 513, not
512? Create a very short random signal with 16 samples, and do a DFT transform to convert
it into a spectrum of 16 complex values. Check out their complex coefficients to see why you
need N/2 + 1, not N/2.
I will allow you to use other implementations, such as librosa.stft, but I strongly encourage
you to reuse your code from Homework 2.
2. Now you build a mixture spectrogram by simply adding the two source spectrograms: X =
S + N. Note that all the numbers here are complex values.
3. Since you know the sources, the source separation job is trivial. One way is to calculate the
ideal masks M =
S
S+N
(once again, note that they are all complex valued and the division
is element-wise). By the definition of the mixture spectrogram, S = M ⊙ X, where ⊙ stands
for a Hadamard product. But we won’t use this one today.
1The exact number of columns may be different depending on your STFT setup. If it’s in the same ball park, it’s
okay.
4. Sometimes we can only estimate a nonnegative real-valued masking matrix M¯ especially if
we don’t have an access to the phase of the sources. For example, M¯ =
|S|
2
|S|
2+|N|
2
. Go
ahead and calculate M¯ from your sources, and multiply it to your mixture spectrogram, i.e.
S ≈ M¯ ⊙ X. Convert your estimated piano spectrogram back to the time domain. Submit
the .wav file of your recovered piano source.
5. Listen to the recovered source. Is it too different from the original? One way to objectively
measure the quality of the recovered signal is to compare it to the original signal by using a
metric called Signal-to-Noise Ratio (SNR):
SNR = 10 log10 P
t
{s(t)}
2
P
t
{s(t) − sˆ(t)}
2
, (12)
where s(t) is the t-th sample of the original source and ˆs(t) is that of the recovered one.
Evaluate the SNR between piano.wav and your reconstruction for it. Note: their lengths
could be slightly different. Just ignore the small difference in the end.
6. Yet another masking scheme is something called Ideal Binary Masks (IBM). This time, we
use a binary (0 or 1) masking matrix B, which is definded by
Bf t =
1 if |S|f t > |N|f t
0 otherwise (13)
7. Create your IBM from the sources, and apply it to your mixture spectrogram, S ≈ B ⊙ X.
Do the inverse STFT. How does it sound? What’s its SNR value?
8. Don’t forget to create audio players for the sound examples in your iPython (i.e., jupyter or
Google Colab) notebook. Check if they play sound in the .html version.
4