This HTML version of is provided for convenience, but it is not the best format for the book. In particular, some of the symbols are not rendered correctly.
You might prefer to read the PDF version.
Chapter 6 Discrete Cosine Transform
The topic of this chapter is the Discrete Cosine Transform (DCT), which is used in MP3 and related formats for compressing music; JPEG and similar formats for images; and the MPEG family of formats for video.
Here are the steps to get there:
Suppose I give you a list of amplitudes and a list of frequencies, and ask you to construct a signal that is the sum of these frequency components. Using objects in the thinkdsp module, there is a simple way to perform this operation, which is called synthesis:
def synthesize1(amps, fs, ts): components = [thinkdsp.CosSignal(freq, amp) for amp, freq in zip(amps, fs)] signal = thinkdsp.SumSignal(*components) ys = signal.evaluate(ts) return ys
amps is a list of amplitudes, fs is the list of frequencies, and ts is the sequence of times where the signal should be evaluated.
Finally, evaluate computes the value of the signal at each time in ts.
We can test this function like this:
amps = np.array([0.6, 0.25, 0.1, 0.05]) fs = [100, 200, 300, 400] framerate = 11025 ts = np.linspace(0, 1, framerate) ys = synthesize1(amps, fs, ts) wave = thinkdsp.Wave(ys, framerate)
This example makes a signal that contains a fundamental frequency at 100 Hz and three harmonics (100 Hz is a sharp G2). It renders the signal for one second at 11,025 frames per second and puts the results into a Wave object.
Conceptually, synthesis is pretty simple. But in this form it doesn’t help much with analysis, which is the inverse problem: given the wave, how do we identify the frequency components and their amplitudes?
6.2 Synthesis with arrays
Here’s another way to write synthesize:
def synthesize2(amps, fs, ts): args = np.outer(ts, fs) M = np.cos(PI2 * args) ys = np.dot(M, amps) return ys
This function looks very different, but it does the same thing. Let’s see how it works:
Figure 6.1 shows the structure of this computation. Each row of the matrix, M, corresponds to a time from 0.0 to 1.0 seconds; tn is the time of the nth row. Each column corresponds to a frequency from 100 to 400 Hz; fk is the frequency of the kth column.
I labeled the nth row with the letters a through d; as an example, the value of a is cos[2 π (100) tn].
The result of the dot product, ys, is a vector with one element for each row of M. The nth element, labeled e, is the sum of products:
And likewise with the other elements of ys. So each element of y is the sum of four frequency components, evaluated at a point in time, and multiplied by the corresponding amplitudes. And that’s exactly what we wanted.
We can use the code from the previous section to check that the two versions of synthesize produce the same results.
ys1 = synthesize1(amps, fs, ts) ys2 = synthesize2(amps, fs, ts) max(abs(ys1 - ys2))
The biggest difference between ys1 and ys2 is about 1e-13, which is what we expect due to floating-point errors.
Writing this computation in terms of linear algebra makes the code smaller and faster. Linear algebra provides concise notation for operations on matrices and vectors. For example, we could write synthesize like this:
Now we are ready to solve the analysis problem. Suppose I give you a wave and tell you that it is the sum of cosines with a given set of frequencies. How would you find the amplitude for each frequency component? In other words, given ys, ts and fs, can you recover amps?
In terms of linear algebra, the first step is the same as for synthesis: we compute M = cos(2 π t ⊗ f). Then we want to find a so that y = M a; in other words, we want to solve a linear system. NumPy provides linalg.solve, which does exactly that.
Here’s what the code looks like:
def analyze1(ys, fs, ts): args = np.outer(ts, fs) M = np.cos(PI2 * args) amps = np.linalg.solve(M, ys) return amps
The first two lines use ts and fs to build the matrix, M. Then np.linalg.solve computes amps.
In this example, we have only 4 frequencies, but we evaluated the signal at 11,025 times. So we have many more equations than unknowns.
In general if ys contains more than 4 elements, it is unlikely that we can analyze it using only 4 frequencies.
For simplicity, I’ll use the first 4 samples from the signal. Using the values of ys, fs and ts from the previous section, we can run analyze1 like this:
n = len(fs) amps2 = analyze1(ys[:n], fs, ts[:n])
And sure enough, amps2 is
[ 0.6 0.25 0.1 0.05 ]
This algorithm works, but it is slow. Solving a linear system of equations takes time proportional to n3, where n is the number of columns in M. We can do better.
6.4 Orthogonal matrices
One way to solve linear systems is by inverting matrices. The inverse of a square matrix M is written M−1, and it has the property that M−1M = I. I is the identity matrix, which has the value 1 on all diagonal elements and 0 everywhere else.
So, to solve the equation y = Ma, we can multiply both sides by M−1, which yields:
On the right side, we can replace M−1M with I:
If we multiply I by any vector a, the result is a, so
This implies that if we can compute M−1 efficiently, we can find a with a simple matrix multiplication (using np.dot). That takes time proportional to n2, which is better than n3.
Inverting a matrix is slow, in general, but some special cases are faster. In particular, if M is orthogonal, the inverse of M is just the transpose of M, written MT. In NumPy transposing an array is a constant-time operation. It doesn’t actually move the elements of the array; instead, it creates a “view” that changes the way the elements are accessed.
So let’s see what the matrix looks like in synthesize2. In the previous example, M has 11,025 rows, so it might be a good idea to work with a smaller example:
def test1(): amps = np.array([0.6, 0.25, 0.1, 0.05]) N = 4.0 time_unit = 0.001 ts = np.arange(N) / N * time_unit max_freq = N / time_unit / 2 fs = np.arange(N) / N * max_freq ys = synthesize2(amps, fs, ts)
ts is a vector of equally spaced sample times in the range from 0 to 1 time unit. I chose the time unit to be 1 millisecond, but it is an arbitrary choice, and we will see in a minute that it drops out of the computation anyway.
Since the frame rate is N samples per time unit, the Nyquist
With these values of ts and fs, the matrix, M, is:
[[ 1. 1. 1. 1. ] [ 1. 0.707 0. -0.707] [ 1. 0. -1. -0. ] [ 1. -0.707 -0. 0.707]]
You might recognize 0.707 as an approximation of √2/2, which is cosπ/4. You also might notice that this matrix is symmetric, which means that the element at (j, k) always equals the element at (k, j). This implies that M is its own transpose; that is, MT = M.
But sadly, M is not orthogonal. If we compute MTM, we get:
[[ 4. 1. -0. 1.] [ 1. 2. 1. -0.] [-0. 1. 2. 1.] [ 1. -0. 1. 2.]]
And that’s not the identity matrix.
One simple option is to shift ts and fs by a half unit. This version is called DCT-IV, where “IV” is a roman numeral indicating that this is the fourth of eight versions of the DCT.
Here’s an updated version of test1:
def test2(): amps = np.array([0.6, 0.25, 0.1, 0.05]) N = 4.0 ts = (0.5 + np.arange(N)) / N fs = (0.5 + np.arange(N)) / 2 ys = synthesize2(amps, fs, ts)
If you compare this to the previous version, you’ll notice
two changes. First, I added 0.5 to ts and fs.
Second, I canceled out
With these values, M is
[[ 0.981 0.831 0.556 0.195] [ 0.831 -0.195 -0.981 -0.556] [ 0.556 -0.981 0.195 0.831] [ 0.195 -0.556 0.831 -0.981]]
And MTM is
[[ 2. 0. 0. 0.] [ 0. 2. -0. 0.] [ 0. -0. 2. -0.] [ 0. 0. -0. 2.]]
Some of the off-diagonal elements are displayed as -0, which means that the floating-point representation is a small negative number. This matrix is very close to 2I, which means M is almost orthogonal; it’s just off by a factor of 2. And for our purposes, that’s good enough.
def analyze2(ys, fs, ts): args = np.outer(ts, fs) M = np.cos(PI2 * args) amps = np.dot(M, ys) / 2 return amps
Instead of using np.linalg.solve, we just multiply by M/2.
Combining test2 and analyze2, we can write an implementation of DCT-IV:
def dct_iv(ys): N = len(ys) ts = (0.5 + np.arange(N)) / N fs = (0.5 + np.arange(N)) / 2 args = np.outer(ts, fs) M = np.cos(PI2 * args) amps = np.dot(M, ys) / 2 return amps
Again, ys is the wave array. We don’t have to pass
ts and fs as parameters;
If we’ve got it right, this function should solve the analysis problem; that is, given ys it should be able to recover amps. We can test it like this.
amps = np.array([0.6, 0.25, 0.1, 0.05]) N = 4.0 ts = (0.5 + np.arange(N)) / N fs = (0.5 + np.arange(N)) / 2 ys = synthesize2(amps, fs, ts) amps2 = dct_iv(ys) max(abs(amps - amps2))
6.6 Inverse DCT
def inverse_dct_iv(amps): return dct_iv(amps) * 2
amps = [0.6, 0.25, 0.1, 0.05] ys = inverse_dct_iv(amps) amps2 = dct_iv(ys) max(abs(amps - amps2))
Again, the biggest difference is about 1e-16.
6.7 The Dct class
signal = thinkdsp.TriangleSignal(freq=400) wave = signal.make_wave(duration=1.0, framerate=10000) dct = wave.make_dct() dct.plot()
The result is the DCT of a triangle wave at 400 Hz, shown in Figure 6.2. The values of the DCT can be positive or negative; a negative value in the DCT corresponds to a negated cosine or, equivalently, to a cosine shifted by 180 degrees.
import scipy.fftpack # class Wave: def make_dct(self): N = len(self.ys) hs = scipy.fftpack.dct(self.ys, type=2) fs = (0.5 + np.arange(N)) / 2 return Dct(hs, fs, self.framerate)
The results from dct are stored in hs. The corresponding frequencies, computed as in Section 6.5, are stored in fs. And then both are used to initialize the Dct object.
wave2 = dct.make_wave() max(abs(wave.ys-wave2.ys))
The biggest difference between ys1 and ys2 is about 1e-16, which is what we expect due to floating-point errors.
# class Dct def make_wave(self): n = len(self.hs) ys = scipy.fftpack.idct(self.hs, type=2) / 2 / n return Wave(ys, framerate=self.framerate)
For the following exercises, I provide some starter code in chap06starter.ipynb. Solutions are in chap06soln.ipynb.
Exercise 1 In this chapter I claim that analyze1 takes time proportional to n3 and analyze2 takes time proportional to n2. To see if that’s true, run them on a range of input sizes and time them. In Jupyter, you can use the “magic command”
If you plot run time versus input size on a log-log scale, you should get a straight line with slope 3 for analyze1 and slope 2 for analyze2.
Exercise 2 One of the major applications of the DCT is compression for both sound and images. In its simplest form, DCT-based compression works like this:
Implement a version of this algorithm and apply it to a recording of music or speech. How many components can you eliminate before the difference is perceptible?
In order to make this method practical, you need some way to store a sparse array; that is, an array where most of the elements are zero. NumPy provides several implementations of sparse arrays, which you can read about at http://docs.scipy.org/doc/scipy/reference/sparse.html.
Exercise 3 In the repository for this book you will find a Jupyter notebook called
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.