The digital state variable filter

The digital state variable filter was described in Hal Chamberlin’s Musical Applications of Microprocessors. Derived by straight-forward replacement of components from the analog state variable fiter with digital counterparts, the digital state variable is a popular synthesizer filter, as was its analog counterpart.

The state variable filter has several advantages over biquads as a synthesizer filter. Lowpass, highpass, bandpass, and band reject are available simultaneously. Also, frequency and Q control are independent and their values calculated easily.

The frequency control coefficient, f, is defined as

where Fs is the sample rate and Fc is the filter’s corner frequency you want to set. The q coefficient is defined as

where Q normally ranges from 0.5 to inifinity (where the filter oscillates).

Like its analog counterpart, and biquads, the digital state variable has a cutoff slope of 12 dB/octave.

The main drawback of the digital state variable is that it becomes unstable at higher frequencies. It depends on the Q setting, but basically the upper bound of stability is about where f reaches 1, which is at one-sixth of the sample rate (8 kHz at 48 kHz). The only way around this is to oversample. A simple way to double the filter’s sample rate (and thereby double the filter’s frequency range) is to run the filter twice with the same input sample, and discard one output sample.

As a sine oscillator

The state variable makes a great low frequency sine wave oscillator. Just set the Q to infinity, and make sure it has an impulse to get it started. Simply preset the delays (set the “cosine” delay to 1, or other peak value, and the other to 0) and run, and it will oscillate forever without instability, with fixed point or floating point. Even better, it gives you two waves in quadrature—simultaneous sine and cosine.

Simplified to remove unecessary parts, the oscillator looks like this:

For low frequencies, we can reduce the calculation of the f coefficient equation to

Here’s an example in C to show how easy this oscillator is to use; first initialize the oscillator amplitude, amp, to whatever amplitude you want (normally 1.0 for ±1.0 peak-to-peak output):

// initialize oscillator
sinZ = 0.0;
cosZ = amp;

Then, for every new sample, compute the sine and cosine components and use them as needed:

// iterate oscillator
sinZ = sinZ + f * cosZ;
cosZ = cosZ – f * sinZ;

The sine purity is excellent at low frequencies (becoming asymmetrical at high frequencies).

Posted in Digital Audio, Filters, IIR Filters | 25 Comments

Biquads

One of the most-used filter forms is the biquad. A biquad is a second order (two poles and two zeros) IIR filter. It is high enough order to be useful on its own, and—because of coefficient sensitivities in higher order filters—the biquad is often used as the basic building block for more complex filters. For instance, a biquad lowpass filter has a cutoff slope of 12 dB/octave, useful for tone controls; if you need a 24 dB/octave slope, you can cascade two biquads, and it will have less coefficient-sensitivity problems than a single fourth-order design.

Biquads come in several forms. The most obvious, a direct implementation of the second order difference equation (y[n] = a0*x[n] + a1*x[n-1] + a2*x[n-2] – b1*y[n-1] – b2*y[n-2]), called direct form I:

Direct form I

Direct form I is the best choice for implementation in a fixed point processor because it has a single summation point (fixed point DSPs usually have an extended accumulator that allows for intermediate overflows).

We can take direct form I and split it at the summation point like this:

We then take the two halves and swap them, so that the feedback half (the poles) comes first:

Now, notice that one pair of the z delays is redundant, storing the same information as the other. So, we can merge the two pairs, yielding the direct form II configuration:

Direct form II

In floating point, direct form II is better because it saves two memory locations, and floating point is not sensitive to overflow in the way fixed point math is. We can improve this a little by transposing the filter. To transpose a filter, reverse the signal flow direction—output becomes input, distribution nodes become summers, and summers become nodes. The characteristics of the filter are unchanged, but in this case it happens that the floating point characteristics are a little better. Floating point has better accuracy when intermediate sums are with closer values (adding small numbers to large number in floating point is less precise than with similar values). Here is the transposed direct form II:

Transposed direct form II

Notes and recommendations

Again, direct form I is usually the best choice for fixed point, and transposed direct form II for floating point.

At low frequency settings, biquads are more susceptible to quantization error, mainly from the feedback coefficients (b1 and b2) and the delay memory. Lack of resolution in the coefficients makes precise positioning of the poles difficult, which is particularly a problem when the poles are positioned near the unit circle. The second problem, delay memory, is because multiplication generates more bits, and the bits are truncated when stored to memory. This quantization error is fed back in the filter, causing instability. 32-bit floating point is usually good enough for audio filters, but you may need to use double precision, especially at very low frequencies (for control filtering) and at high sample rates.

For fixed point filters, 24-bit coefficients and memory work well for most filters, but start to become unstable below about 300 Hz at 48 kHz sample rate (or twice that at 96 kHz). Double precision is always costly on a fixed point processor, but fortunately there is a simple technique for improving stability. Looking at the direct form I drawing, the quantization occurs when the higher precision accumulator is stored in the lower precision delay memory on the right side. By taking the quantization error (the difference between the full accumulated value and its value after storing it to memory) and adding it back in for the next sample calculation, the filter performs nearly as well as using full double precision calculations, but at a much lower computational cost. This technique is called first order noise shaping. There are higher order noise shapers, but this one works well enough to handle almost all audio needs, even at high sample rates.

Direct form I with first-order noise shaping

In general, 16-bit fixed point processing is not suitable for audio without double precision coefficients and computation.

Finally, biquads are just one of a DSP programmers tools—they aren’t always the best filter form. There are other filters that don’t share the biquad’s low-frequency sensitivities (in general, biquad coefficient precision is very good at high frequencies, and poor at low ones; there are other filter forms that spread the precision out more evenly, or trade off reduced high frequency performance for better low frequency performance). However, biquads are well known and design tools are plentiful, so they are usually the first choice for an IIR filter unless you find a reason to use another.

There are too many filter forms to cover, but one other filter form that is popular for synthesizers is the state variable filter. It has very excellent low frequency performance, and limitations in the high frequencies that have to be worked around, but most importantly frequency and Q coefficients are separate and easy to change for dynamic filtering. It also make a great low frequency sine wave generator.

Posted in Biquads, Digital Audio, Filters, IIR Filters | 34 Comments

Pole-Zero placement

Use the new, improved pole-zero calculator—but be sure to read the “Experiments with standard biquads” section below for tips on placing poles and zeros for standard filters.

Here’s a Java applet that illustrates pole-zero placement. It lets you design a filter with two poles and two zeros, while showing the resulting frequency response and filter coefficients. It’s also handy for learning more about how poles and zeros work.

You can set the two poles (or zeros) independently, or as complex conjugate pairs by using the Pair checkboxes. Note that the frequency response plot and the coefficients are gain compensated automatically, so that the maximum output is 0 dB.

A pole or zero located at the origin has no effect, so position them there if you want to disable them (to examine single pole or zero filters, for instance).


Experiments with standard biquads

Here are some experiments that show how the standard biquads (derived with the bilinear transform) relate to the z plane.

For each of these filters, the pole angle dictates filter frequency, and the pole radius dictates Q:

Experiment with pole and zero placement to better understand how these filters work. See what happens when you swap the poles and zeros. Change the pole angle and radius and see how it affects the frequency response. Think about why the poles and zeros are positioned where they are: For lowpass, the zeros are at -1 to pull down the response at the highest frequency; for highpass, they are at 1 to pull down the lowest; for bandpass, they pull down the response at each end of the spectrum. For bandreject (notch), the zeros are on the unit circle at the notch frequency to completely remove it, and the poles are at the same angle; as the poles move closer to the zeros, they get closer to canceling them, and the notch narrows.

Posted in Biquads, Digital Audio, Filters, IIR Filters, Widgets | 8 Comments

A gentle introduction to the FFT

Some terms: The Fast Fourier Transform is an algorithm optimization of the DFT—Discrete Fourier Transform. The “discrete” part just means that it’s an adaptation of the Fourier Transform, a continuous process for the analog world, to make it suitable for the sampled digital world. Most of the discussion here addresses the Fourier Transform and its adaptation to the DFT. When it’s time for you to implement the transform in a program, you’ll use the FFT for efficiency. The results of the FFT are the same as with the DFT; the only difference is that the algorithm is optimized to remove redundant calculations. In general, the FFT can make these optimizations when the number of samples to be transformed is an exact power of two, for which it can eliminate many unnecessary operations.

Background

From Fourier we know that periodic waveforms can be modeled as the sum of harmonically-related sine waves. The Fourier Transform aims to decompose a cycle of an arbitrary waveform into its sine components; the Inverse Fourier Transform goes the other way—it converts a series of sine components into the resulting waveform. These are often referred to as the “forward” (time domain to frequency domain) and “inverse” (frequency domain to time domain) transforms. For most people, the forward transform is the baffling part—it’s easy enough to comprehend the idea of the inverse transform (just generate the sine waves and add them). So, we’ll discuss the forward transform; however, it’s interesting to note that the inverse transform is identical to the forward transform (except for scaling, depending on the implementation). You can essentially run the transform twice to convert from one form to the other and back!

Probing for a match

Let’s start with one cycle of a complex waveform. How do we find its component sine waves? (And how do we describe it in simple terms without mentioning terms like “orthogonality”? oops, we mentioned it.) We start with an interesting property of sine waves. If you multiply two sine waves together, the resulting wave’s average (mean) value is proportional to the sines’ amplitudes if the sines’ frequencies are identical, but zero for all other frequencies.

Take a look: To multiply two waves, simply multiply their values sample by sample to build the result. We’ll call the waveform we want to test the “target” and the sine wave we use to test it with the “probe”. Our probe is a sine wave, traveling between -1.0 and 1.0. Here’s what happens when our target and probe match:

See that the result wave’s peak is the same as that of the target we are testing, and its average value is half that. Here’s what happens when they don’t match:

In the second example, the average of the result is zero, indicating no match.

The best part is that the target need not be a sine wave. If the probe matches a sine component in the target, the result’s average will be non-zero, and half the component’s amplitude.

In phase

The reason this works is that multiplying a sine wave by another sine wave is balanced modulation, which yields the sum and difference frequency sine waves. Any sine wave averaged over an integral number of cycles is zero. Since the Fourier transform looks for components that are whole number multiples of the waveform section it is analyzing, and that section is also presumed to be a single cycle, the sum and difference results are always integral to the period. The only case where the results of the modulation don’t average to zero is when the two sine waves are the same frequency. In that case the difference is 0 Hz, or DC (though DC stands for Direct Current, the term is often used to describe steady-state offsets in any kind of waveform). Further, when the two waves are identical in phase, the DC value is a direct product of the multiplied sine waves. If the phases differ, the DC value is proportional to the cosine of the phase difference. That is, the value drops following the cosine curve, and is zero at pi/2 radians, where the cosine is zero.

So this sine measurement doesn’t work well if the probe phase is not the same as the target phase. At first it might seem that we need to probe at many phases and take the best match; this would result in the ESFT—the Extremely Slow Fourier Transform. However, if we take a second measurement, this time with a cosine wave as a probe, we get a similar result except that the cosine measurement results are exactly in phase where the sine measurement is at its worst. And when the target phase lies between the sine and cosine phase, both measurements get a partial match. Using the identity

for any theta, we can calculate the exact phase and amplitude of the target component from the sine and cosine probes. This is it! Instead of probing the target with all possible phases, we need only probe with two. This is the basis for the DFT.

Completing the series

Besides probing with our single cycle sine (and cosine), the presumed fundamental of the target wave, we continue with the harmonic series (2x, 3x, 4x…) through half the sample rate. At that point, there are only two sample points per probe cycle, the Nyquist limit. We also probe with 0x, which is just the average of the target and gives us the DC offset.

We can deduce that having more points in the “record” (the group of samples making up our target wave cycle) allows us to start with a lower frequency fundamental and fit more harmonic probes into the transform. Doubling the number of target samples (higher time resolution) doubles the number of harmonic probes (higher frequency resolution).

Getting complex

By tradition, the sine and cosine probe results are represented by a single complex number, where the cosine component is the real part and the sine component the imaginary part. There are two good reasons to do it this way: The relationship of cosine and sine follows the same mathematical rules as do complex numbers (for instance, you add two complex numbers by summing their real and complex parts separately, as you would with sine and cosine components), and it allows us to write simpler equations. So, we refer to the resulting average of the cosine probe as the real part (Re), and the sine component as the imaginary part (Im), where a complex number is represented as Re + i*Im.

To find the magnitude (which we have called “amplitude” until now—magnitude is the same as amplitude when we are only interested in a positive value—the absolute value):

In the way we’ve presented the math here, this is the magnitude of the average, so again we’d have to multiply that value by two to get the peak amplitude of the component we’re testing for.

Many computer languages and math packages support the atan2 function. Basically, this gives the arc tangent of Im/Re, while handling the special cases of the four quadrants and divide by zero for you. This give you the phase shift of each harmonic in radians. Since the real part corresponds to cosine, you can see that a harmonic with an imaginary part of zero results in a phase of zero—corresponding to a cosine.

Making it “F”

Viewing the DFT in this way, it’s easy to see where the algorithm can be optimized. First, note that all of the sine probes are zero at the start and in the middle of the record—no need to perform operations for those. Further, all the even-numbered sine probes cross zero at one-fourth increments through the record, every fourth probe at one-eighth, and so on. Note the powers of two in this pattern. The FFT works by requiring a power of two length for the transform, and splitting the the process into cascading groups of two (that’s why it’s sometimes called a radix-2 FFT). Similarly, there are patterns for when the sine and cosine are at 1.0, and multiplication is not needed. By exploiting these redundancies, the savings of the FFT over the DFT are huge. While the DFT needs N^2 basic operations, the FFT needs only NLog2(N). For a 1024 point FFT, that’s 10,240 operations, compared to 1,048,576 for the DFT.

Let’s take a look at the kinds of symmetry exploited by the FFT. Here’s an example showing even harmonics crossing at zero for integer multiples of pi/2 on the horizontal axis:

Here we see that every fourth harmonic meets at 0, 1, 0, and -1, at integer multiples of pi/2:

Caveats and Extensions

The Fourier transform works correctly only within the rules laid out—transforming a single cycle of the target periodic waveform. In practical use, we often sample an arbitrary waveform, which may or may not be periodic. Even if the sampled waveform is exactly periodic, we might not know what that period is, and if we did it may not exactly fit our transform length (we may be using a power-of-two length for the FFT).

We can still get results with the transform, but there is some “spectral leakage.” There are ways to reduce such errors, such as windowing to reduce the discontinuities at the ends of the group of sample points (where we snipped the chunk to examine from the sampled data). And for arbitrarily long signals (analyzing a constant stream of incoming sound, for instance), we can perform FFTs repeatedly—much in the way a movie is made up of a constant stream of still pictures—and overlap them to smooth out errors.

There is a wealth of information on the web. Search for terms used here, such as Fourier, FFT, DFT, magnitude, phase… The purpose here is to present the transform in an intuitive way. With an understanding that there is no black magic involved, perhaps the interested reader is encouraged to dig deeper without fear when it’s presented in a more rigorous and mathematical manner. Or maybe having a basic idea of how it works is good enough to feel more comfortable with using the FFT. You can find efficient implementations of the FFT for many processors, and links to additional information, at http://www.fftw.org. For another source on the transform and basic C code, try Numerical Recipes in C.

Posted in Digital Audio, FFT | 11 Comments

A bit about reverb

Reverb is one of the most interesting aspects of digital signal processing effects for audio. It is a form of processing that is well-suited to digital processing, while being completely impractical with analog electronics. Because of this, digital signal processing has had a profound affect on our ability to place elements of our music into different “spaces.”

Before digital processing, reverb was created by using transducers—a speaker and a microphone, essentially—at two ends of a physical delay element. That delay element was typically a set of metal springs, a suspended metal plate, or an actual room.The physical delay element offered little variation in the control of the reverb sound. And these reverb “spaces” weren’t very portable; spring reverb was the only practically portable—and generally affordable—option, but they were the least acceptable in terms of sound.

First a quick look at what reverb is: Natural reverberation is the result of sound reflecting off surfaces in a confined space. Sound emanates from its source at 1100 feet per second, and strikes wall surfaces, reflecting off them at various angles. Some of these reflections meet your ears immediately (“early reflections”), while others continue to bounce off other surfaces until meeting your ears. Hard and massive surfaces—concrete walls, for instance—reflect the sound with modest attenuation, while softer surfaces absorb much of the sound, especially the high frequency components. The combination of room size, complexity and angle of the walls and room contents, and the density of the surfaces dictate the room’s “sound.”

In the digital domain, raw delay time is limited only by available memory, and the number of reflections and simulation of frequency-dependent effects (filtering) are limited only by processing speed.

Two possible approaches to simulating reverb

Let’s look at two possible approaches to simulating reverb digitally. First, the brute-force approach:

Reverb is a time-invariant effect. This means that it doesn’t matter when you play a note—you’ll still get the same resulting reverberation. (Contrast this to a time-variant effect such as flanging, where the output sound depends on the note’s relationship to the flanging sweep.)

Time-invariant systems can be completely characterized by their impulse response. Have you ever gone into a large empty room—a gym or hall—and listened to its characteristic sound? You probably made a short sound—a single handclap works great—then listened as the reverberation tapered off. If so, you were listening to the room’s impulse response.

The impulse response tells everything about the room. That single handclap tells you immediately how intense the reverberation is and how long it takes to dies out, and whether the room sounds “good.” Not only is it easy for your ears to categorize the room based on the impulse response, but we can perform sophisticated signal analysis on a recording of the resulting reverberation as well. Indeed, the impulse response tells all.

The reason this works is that an impulse is, in its ideal form, an instantaneous sound that carries equal energy at all frequencies. What comes back, in the form of reverberation, is the room’s response to that instantaneous, all-frequency burst.


An impulse and its response

In the real world, the handclap—or a popping balloon, an exploding firecracker, or the snap of an electric arc—serves as the impulse. If you digitize the resulting room response and look at it in a sound-editing program, it looks like decaying noise. After some density build-up at the beginning, it decays smoothly toward zero. In fact, smoother sounding rooms show a smoother decay.

In the digital domain, it’s easy to realize that each sample point of the response can be viewed as a discrete echo of the impulse. Since, ideally, the impulse is a single non-zero sample, it’s not a stretch to realize that a series of samples—a sound played in the room—would be the sum of the responses of each individual sample at their respective times (this is called superposition).

In other words, if we have a digitized impulse response, we can easily add that exact room characteristic to any digitized dry sound. Multiplying each point of the impulse response by the amplitude of a sample yields the room’s response to that sample; we simply do that for each sample of the sound that we want to “place” into that room. This yields a bunch—as many as we have samples—of overlapping responses that we simply add together.

Easy. But extremely expensive computationally. Each sample of the input is multiplied individually by each sample of the impulse response, and added to the mix. If we have n samples to process, and the impulse response is m samples long, we need to perform n×m multiplications and additions. So, if the impulse response is three seconds (a big room), and we need to process one minute of music, we need to do about 350 trillion multiplications and the same number of additions (assuming a 44.1KHz sampling rate).

This may be acceptable if you want to let your computer crunch the numbers for a day before you can hear the result, but it’s clearly not usable for real-time effects. Too bad, because its promising in several aspects. In particular, you can accurately mimic any room in the world if you have its impulse response, and you can easily generate your own artificial impulse responses to invent your own “rooms” (for instance, a simple decaying noise sequence gives a smooth reverb, though one with much personality).

Actually, there’s a way to handle this more practically. We’ve been talking about time-domain processing here, and the process of multiplying the two sampled signals is called “convolution.” While convolution in the time domain requires many operations, the equivalent in the frequency domain requires drastically reduced computation (convolution in the time domain is equivalent to multiplication in the frequency domain). I won’t elaborate here, but you can check out Bill Gardner’s article, “Efficient Convolution Without Input/Output Delay” for a promising approach. (I haven’t tried his technique, but I hope to give it a shot when I have time.)

A practical approach to digital reverb

The digital reverbs we all know and love take a different approach. Basically, they use multiple delays and feedback to built up a dense series of echoes that dies out over time. The functional building blocks are well known; it’s variations and how they are stacked together that give an digital reverb units its characteristic sound.

The simplest approach would be a single delay with part of the signal fed back into the delay, creating a repeating echo that fades out (the feedback gain must be less than 1). Mixing in similar delays of different sizes would increase the echo density and get closer to reverberation. For instance, using different delay lengths based on prime numbers would ensure that each echo fell between other echoes, enhancing density.

In practice, this simple arrangement doesn’t work very well. It takes too many of these hard echoes to make a smooth wall of reverb. Also, the simple feedback is the recipe for a comb filter, resulting in frequency cancellations that can mimic room effects, but can also yield ringing and instability. While useful, these comb filters alone don’t give a satisfying reverb effect.


Comb filter reverb element

By feeding forward (inverted) as well as back, we fill in the frequency cancellations, making the system an all-pass filter. All-Pass filters give us the echoes as before, but a smoother frequency response. They have the effect of frequency-dependent delay, smearing the harmonics of the input signal and getting closer to a true reverb sound. Combinations of these comb and all-pass recirculating delays—in series, parallel, and even nested—and other elements, such as filtering in the feedback path to simulate high-frequency absorption, result in the final product.


All-Pass filter reverb element

I’ll stop here, because there are many readily available texts on the subject and this is just an introduction. Personally, I found enough information for my own experiments in “Musical Applications of Microprocessors” by Hal Chamberlin, and Bill Gardner’s works on the subject, available here on the web.

Posted in Convolution, Digital Audio, Impulse Response, Reverb | 11 Comments

The Fourier series

(Java is no longer supported by many popular browsers, and can be difficult to enable others…)

Experiment with harmonic (Fourier) synthesis with this Java applet! The sliders represent the levels of the first eight harmonics in the harmonic series. The second harmonic is twice the frequency of the first, the third is three times that of the first, and so on. The graph shows one cycle of the resulting waveform.



If you had a Java-equiped browser, you’d see as applet here that looks like this.

Press the Sawtooth button to get an eight-harmonic approximation of a sawtooth waveform. A sawtooth waveform contains all harmonics; the second harmonic is one-half the level of the first, the third harmonic is one-third the level of the first, and so on. (Continuing the series yields a more accurate sawtooth.)

Similarly, press the Square button for a square-wave approximation. A square wave is made of only odd-numbered harmonics, in the same relationship as those of the sawtooth.

One way of looking at this is that the sliders represent the frequency domain of a waveform (the level of its frequency components—how we hear), and the graph represents its conversion to the time domain (the signal as it is routed through audio equipment and speakers, only to be converted back to the frequency domain by our ears!).

Posted in Digital Audio, FFT, Fourier, Widgets | 3 Comments

A question of phase

If you’ve paid attention for long enough, you’ve seen heated debate in online forums and letters to the editor in magazines. One side will claim that it has been proven that people can’t hear the effects of phase errors in music, and the other is just as adamant that the opposite is true.

Much of the confusion about phase lies with the fact that there are several facets to this issue. Narrow arguments on the subject can be much like the story of the blind men and the elephant—one believes that the animal is snake-like, while another insists that it’s more like a wall. Both sides may be right, as far as their knowledge allows, but both are equally wrong because they’re hampered by a limited understanding of the subject.

What is phase?

Phase is a frequency dependent time delay. If all frequencies in a sound wave (music, for instance), are delayed by the same amount as they pass through a device, we call that device “phase linear.” A digital delay has this characteristic—it simply delays the sound as a whole, without altering the relationships of frequencies to each other. The human ear is insensitive to this kind of phase change of delay, as long as the delay is constant and we don’t have another signal to reference it to. The audio from a CD-player is always delayed due to processing, for instance, but it has no effect on our listening enjoyment.

Relative phase

Now, even if the phase is linear (simply an overall delay), we can easily detect a phase difference if we have a reference. For instance, you can get closer to one of your stereo speakers than the other; even if you use the stereo balance control to even the relative loudness between speakers, it won’t sound the same as being equidistance between them.

Another obvious case is when we have a direct reference to compare to. When you delay music and mix it with the un-delayed version, for instance, it’s easy to hear the effect; short delays cause frequency-dependent cancellation between the two signals, while longer delays result in an obvious echo.

If you connect one of your stereo speakers up backwards, inverting the signal, you’ll get phase cancellation between many harmonic components simultaneously as they cancel in the air. This is particularly noticeable with mono input and at low frequencies, where the distance between the speakers has less effect.

The general case

Having dispensed with linear phase, let’s look at the more general case of phase as a frequency-dependent delay.

Does it seem likely that we could hear the difference between a music signal and the same signal with altered phase?

First, I should point out that phase error, in the real world, is typically constant and affects a group of frequencies, usually by progressive amounts. By “constant”, I mean that the phase error is not moving around, as in the effect a phase shifter device is designed to produce. By “group of frequencies”, I mean that it’s typically not a signal frequency that’s shifted, or unrelated frequencies; phase shift typically “smears” an area of the music spectrum.

Back to the question: Does it seem likely that we could hear the difference between an audio signal and the same signal with altered phase? The answer is… No… and ultimately Yes.

No: The human ear is insensitive to a constant relative phase change in a static waveform. For instance, you cannot here the difference between a steady sawtooth wave (which contains all harmonic frequencies) and a waveform that contains the same harmonic content but with the phase of the harmonics delayed by various (but constant) amounts. The second waveform would not look like a sawtooth on an oscilloscope, but you would not be able to hear the difference. And this is true no matter how ridiculous you get with the phase shifting.

Yes: Dynamically changing waveforms are a different matter. In particular, it’s not only reasonable, but easy to demonstrate (at least under artificially produced conditions) that musical transients (pluck, ding, tap) can be severely damaged by phase shift. Many frequencies of short duration combine to produce a transient, and phase shift smears their time relationship, turning a “tock!” into a “thwock!”.

Because music is a dynamic waveform, the answer has to be “yes”—phase shift can indeed affect the sound. The second part is “how much?” Certainly, that is a tougher question. It depends on the degree or phase error, the area of the spectrum it occupies, and the music itself. Clearly we can tolerate phase shift to a degree. All forms of analog equalization—such as on mixing consoles—impart significant phase shift. It’s probably wise, though, to minimize phase shift where we can.

Posted in Digital Audio, Phase | 5 Comments

The jitters

When samples are not output at their correct time relative to other samples, we have clocatz jitter and the associated distortion it causes. Fortunately, the current state of the art is very good for stable clocking, so this is not a problem for CD players and other digital audio units. And since the output from the recording media (CD, or DAT, for instance) is buffered and servo-controlled, transport variations are completely isolated from the digital audio output clocking.

Clocking external sources

Clock jitter can arise when we combine multiple units, though. When each unit runs on its own clock, compensating for small differences between the clocks can cause output errors. For instance, even if both clocks are at exactly the same frequency, they will almost certainly not be in phase.

For example, consider connecting the digital output of your computer-based digital recording system to a DAT recorder, and monitoring the analog output of the DAT unit. Because the digital output (S/PDIF or AES/EBU) doesn’t carry a separate clock signal, the DAT unit must output the audio using its own clock.

Since the DAT player can’t synchronize its clock to that of the source, it has to either derive a clock signal from the digital input (using a Phase Locked Loop—PLL), or make the digital input march to its own clock (buffering and reclocking, or sample rate conversion). The PLL method will certainly be subject to jitter on playback, dependent on the quality of the digital signal at the input. In other words, poor cables would make the audio sound worse! It’s important to note that this will only affect monitoring; if you record the signal and play it back, there will be no change from the original (barring serious problems with the cabling or other transfer factors). This because the recorder will store the correct sample values, despite jitter, then reclock the digital stream on playback.

If the clock rate of the input digital stream and the playback unit differ (44.1 KHz and 48 KHz, for instance), the playback unit has no choice but to sample rate convert. If they are the same, the playback unit may use sample rate conversion to oversample the input, then pick the samples that “line up” with its own clock, or it may simply buffer the incoming digital stream and reclock it for output. Either method will not be subject to jitter, since the D/A convertor is using its own local clock.

Note that the resampling (sample rate conversion) techniques actually change the digital stream before converting it to analog, whereas buffering does not. This is a particularly important distinction when making digital copies and transfers.

Be sure to check out Bob Katz’s web article on the subject for a more detailed look.

Posted in Digital Audio, Jitter | 2 Comments

What is aliasing?

It’s easiest to describe aliasing in terms of a visual sampling system we all know and love—movies. If you’ve ever watched a western and seen the wheel of a rolling wagon appear to be going backwards, you’ve witnessed aliasing. The movie’s frame rate isn’t adequate to describe the rotational frequency of the wheel, and our eyes are deceived by the misinformation!

The Nyquist Theorem tells us that we can successfully sample and play back frequency components up to one-half the sampling frequency. Aliasing is the term used to describe what happens when we try to record and play back frequencies higher than one-half the sampling rate.

Consider a digital audio system with a sample rate of 48 KHz, recording a steadily rising sine wave tone. At lower frequency, the tone is sampled with many points per cycle. As the tone rises in frequency, the cycles get shorter and fewer and fewer points are available to describe it. At a frequency of 24 KHz, only two sample points are available per cycle, and we are at the limit of what Nyquist says we can do. Still, those two points are adequate, in a theoretical world, to recreate the tone after conversion back to analog and low-pass filtering.

But, if the tone continues to rise, the number of samples per cycle is not adequate to describe the waveform, and the inadequate description is equivalent to one describing a lower frequency tone—this is aliasing.

In fact, the tone seems to reflect around the 24 KHz point. A 25 KHz tone becomes indistinguishable from a 23 KHz tone. A 30 KHz tone becomes an 18 KHz tone.

In music, with its many frequencies and harmonics, aliased components mix with the real frequencies to yield a particularly obnoxious form of distortion. And there’s no way to undo the damage. That’s why we take steps to avoid aliasing from the beginning.

Posted in Aliasing, Digital Audio | 16 Comments

What is dither?

To dither means to add noise to our audio signal. Yes, we add noise on purpose, and it is a good thing.

How can adding noise be a good thing??!!!

We add noise to make a trade. We trade a little low-level hiss for a big reduction in distortion. It’s a good trade, and one that our ears like.

The problem

The problem results from something Nyquist didn’t mention about a real-world implementation—the shortcoming of using a fixed number of bits (16, for instance) to accurately represent our sample points. The technical term for this is “finite wordlength effects”.

At first blush, 16 bits sounds pretty good—96 dB dynamic range, we’re told. And it is pretty good—if you use all of it all of the time. We can’t. We don’t listen to full-amplitude (“full code”) sine waves, for instance. If you adjust the recording to allow for peaks that hit the full sixteen bits, that means much of the music is recorded at a much lower volume—using fewer bits.

In fact, if you think about the quietest sine wave you can play back this way, you’ll realize it’s one bit in amplitude—and therefore plays back as a square wave. Yikes! Talk about distortion. It’s easy to see that the lower the signal levels, the higher the relative distortion. Equally disturbing, components smaller than the level of one bit simply won’t be recorded at all.

This is where dither comes in. If we add a little noise to the recording process… well, first, an analogy…

An analogy

Try this experiment yourself, right now. Spread your fingers and hold them up a few inches in front of one eye, and close the other. Try to read this text. Your fingers will certainly block portions of the text (the smaller the text, the more you’ll be missing), making reading difficult.

Wag your hand back and forth (to and fro!) quickly. You’ll be able to read all of the text easily. You’ll see the blur of your hand in front of the text, but definitely an improvement over what we had before.

The blur is analogous to the noise we add in dithering. We trade off a little added noise for a much better picture of what’s underneath.

Back to audio

For audio, dithering is done by adding noise of a level less than the least-significant bit before rounding to 16 bits. The added noise has the effect of spreading the many short-term errors across the audio spectrum as broadband noise. We can make small improvements to this dithering algorithm (such as shaping the noise to areas where it’s less objectionable), but the process remains simply one of adding the minimal amount of noise necessary to do the job.

An added bonus

Besides reducing the distortion of the low-level components, dither lets us hear components below the level of our least-significant bit! How? By jiggling a signal that’s not large enough to cause a bit transition on its own, the added noise pushes it over the transition point for an amount statistically proportional to its actual amplitude level. Our ears and brain, skilled at separating such a signal from the background noise, does the rest. Just as we can follow a conversation in a much louder room, we can pull the weak signal out of the noise.

Going back to our hand-waving analogy, you can demonstrate this principle for yourself. View a large text character (or an object around you), and view it by looking through a gap between your fingers. Close the gap so that you can see only a portion of the character in any one position. Now jiggle your hand back and forth. Even though you can’t see the entire character at any one instant, your brain will average and assemble the different views to put the characters together. It may look fuzzy, but you can easily discern it.

When do we need to dither?

At its most basic level, dither is required only when reducing the number of bits used to represent a signal. So, an obvious need for dither is when you reduce a 16-bit sound file to eight bits. Instead of truncating or rounding to fit the samples into the reduced word size—creating harmonic and intermodulation distortion—the added dither spreads the error out over time, as broadband noise.

But there are less obvious reductions in wordlength happening all the time as you work with digital audio. First, when you record, you are reducing from an essentially unlimited wordlength (an analog signal) to 16 bits. You must dither at this point, but don’t bother to check the specs on your equipment—noise in your recording chain typically is more than adequate to perform the dithering!

At this point, if you simply played back what you recorded, you wouldn’t need to dither again. However, almost any kind of signal processing causes a reduction of bits, and prompts the need to dither. The culprit is multiplication. When you multiply two 16-bit values, you get a 32-bit value. You can’t simply discard or round with the extra bits—you must dither.

Any for of gain change uses multiplication, you need to dither. This means not only when the volume level of a digital audio track is something other than 100%, but also when you mix multiple tracks together (which generally has an implied level scaling built in). And any form of filtering uses multiplication and requires dithering afterwards.

The process of normalizing—adjust a sound file’s level so that its peaks are at full level—is also a gain change and requires dithering. In fact, some people normalize a signal after every digital edit they make, mistakenly thinking they are maximizing the signal-to-noise ratio. In fact, they are doing nothing except increasing noise and distortion, since the noise level is “normalized” along with the signal and the signal has to be redithered or suffer more distortion. Don’t normalize until you’re done processing and wish to adjust the level to full code.

Your digital audio editing software should know this and dither automatically when appropriate. One caveat is that dithering does require some computational power itself, so the software is more likely to take shortcuts when doing “real-time” processing as compared to processing a file in a non-real-time manner. So, an applications that presents you with a live on-screen mixer with live effects for real-time control of digital track mixdown is likely to skimp in this area, whereas an application that must complete its process before you can hear the result doesn’t need to.

Is that the best we can do?

If we use high enough resolution, dither becomes unnecessary. For audio, this means 24 bits (or 32-bit floating point). At that point, the dynamic range is such that the least-significant bit is equivalent to the amplitude of noise at the atomic level—no sense going further. Audio digital signal processors usually work at this resolution, so they can do their intermediate calculations without fear of significant errors, and dither only when its time to deliver the result as 16-bit values. (That’s OK, since there aren’t any 24-bit accurate A/D convertors to record with. We could compute a 24-bit accurate waveform, but there are no 24-bit D/A convertors to play it back on either! Still, a 24-bit system would be great because we could do all the processing and editing we want, then dither only when we want to hear it.)

Posted in Digital Audio, Dither | 43 Comments