What we perceive as sound corresponds to vibrations in the air. Though we don't often think much about it, air is a gas, made up of mostly nitrogen and oxygen molecules moving around rapidly. These molecules are extremely light, so their rapid motion and constant collisions keep them from all settling to the ground. We are swimming in a sea of particles. Just like swimming in a sea of water, as we move around, we move the sea around with us.

When an object (like your vocal chords or a guitar string or a flute) vibrates, it creates waves of higher and lower pressure in the air around it. These waves ripple out from their source the way still water forms ripples when we drop a stone into it. As you read this, take a moment to pay attention to the sounds around you. As I'm writing, I can hear the hum of my fridge, a bird outside, and a car passing by. Each sound you hear is the result of something disturbing the air around it and those disturbances making their way to your ears.

The Cocktail Party Effect

To get a idea of how remarkable our sense of hearing is, consider a phenomenon that we all experience frequently, called the *Cocktail Party Effect*. Picture yourself in a crowded place where everyone around you is talking simultaneously - it could be a party, a concert, a plane, a sporting event. As we've said, each person talking is sending vibrations out through the air. So, there are different patterns of vibration moving out from every person in every direction, like a pond with thirty stones dropped into it. And yet, somehow, you are still able to hear and understand what the person next to you is saying. Even more remarkably, you can shift your focus to what someone ten feet away from you is saying. You are able to selectively filter out sounds you don't care about. How is that possible?

You've probably heard the phrase *sound waves*. The image of ocean waves rising and falling might come to mind. While the image of ocean waves is very useful, that image doesn't quite capture what's happening in the air.

Waves like those on the ocean are called *transverse waves.* These are waves that move up and down. By contrast, gradations in air pressure create longitudinal waves. Here's a sketch comparing the two types:

Notice that with transverse waves, the medium moves perpendicular to the direction of the wave. With ocean waves, the water (the medium) moves up and down as the wave moves horizontally (towards shore, let's say). With longitudinal waves, the medium is moving in the direction of the wave. In the case of sound waves, the pressure changes are occurring in the same direction that the wave is moving.

In the figure above, we could say the peaks of the transverse wave correspond to higher pressure, which is represented by tightly bunched lines in the longitudinal wave. Valleys in the transverse wave correspond to lower pressure, which we can represent with spread out lines. Because transverse waves are much easier to draw, analyze, transform, and make sense of, we can still use them as an *abstraction* of longitudinal waves.

Going forward, we'll use transverse waves to represent sound waves, but we should always remember that they are an abstraction.

So, now we might ask, “how do we *hear* sound (air pressure) waves?” The short answer is that our ears convert pressure waves into electrical signals that our brain interprets as sound. Here's a simplified description of that process ^{1}:

Sound waves enter the outer ear and pass through the ear canal until they hit the ear drum.

The ear drum vibrates, passing the waves along to three bones in the middle ear: the malleus, incus, and stapes.

The malleus, incus, and stapes amplify the vibrations and send them to the cochlea. The cochlea is like a tube filled with fluid, curled up into a snail-shape. Inside the cochlea there is a partition, called the basilar membrane, which separates the cochlea into an upper and lower portion.

The amplified vibrations create ripples in the fluid in the cochlea, and those ripples then form waves along the basilar membrane.

Hair cells in the cochlea move with the waves. One end of the cochlea picks up faster vibrations while the other end picks up slower vibrations. This allows us to hear higher and lower pitches.

Ultimately, the cochlea turns the spectrum of higher and lower vibrations into electrical signals, which the brain interprets, creating our perception of sound.

Our perception of sound can be broken down into four main attributes, each of which corresponds to physical features of sound waves.

Loudness: The loudness of a sound corresponds to the

*amplitude*of a sound wave. In a transverse wave, the amplitude is the height of the wave (as measured from the center).Pitch: We use the word pitch to refer to the “highness” or “lowness” of a sound - e.g., a bird's chirp is high-pitched; Darth Vader's voice is low-pitched. Our sense of pitch corresponds to the frequency of a vibration. On average, humans can hear frequencies between \(20\) Hz and \(20,000\) Hz - i.e., between \(20\) vibrations per second and \(20,000\) vibrations per second.

Duration: How long a sound lasts is its duration. Our perception of duration corresponds to the length of a wave. When you snap your fingers, you create a very short-lived vibration, which you perceive as a quick percussive sound. When you let out a long audible “ahhhh” you create a long wave.

Timbre: Timbre is the most difficult of the four qualities to describe. We can think of timbre as the fingerprint of a sound. It is the quality that allows us to distinguish between different sounds. For example, if a saxophone, a flute, and a singer all play/sing the same note, at the same volume, for the same length of time, timbre is the characteristic that allows you to tell the difference between the three sounds. Timbre corresponds to the

*frequency spectrum*of a sound. We'll discuss spectrum in greater detail, but for now let's say that spectrum is the unique pattern of simple waves that make up a more complex wave.

In summary:

Perception | Physical characteristic |
---|---|

Loudness | Amplitude |

Pitch | Frequency |

Duration | Length |

Timbre | Spectrum |

We can split the sounds we hear into two main types: pitched and not pitched. A pitched sound is one that you can sing, whistle, or hum along with. Singing is pitched. The sound of a vibrating guitar string is pitched. An ambulance siren is pitched. Conversely, clapping is not pitched. The sound of rocks hitting the ground is not pitched.

Write down three examples of pitched sounds and three examples of non pitched sounds, none of which come from musical instruments.

But what makes some sounds pitched and others not? The answer: *pattern.* When a sound wave contains a repeating pattern, we hear that as pitch. Without repeating patterns, we get sounds that are not pitched. *Frequency* refers to how many times the pattern repeats over a given length of time. For sound, we measure frequency in *hertz* (Hz), or repetitions per second. E.g., \(220\) Hz means \(220\) times per second.

The average human can hear frequencies between \(20\) Hz and \(20,000\) Hz. Different animals have different ranges of hearing. Dogs, for example, can hear frequencies well above \(20,000\) Hz (in some cases, up to \(60,000\) Hz). A dog whistle is inaudible to the human ear because it generate a frequency that is greater than \(20,000\) Hz.

In the figures below, we have four waveforms. Remember, these are transverse waves that we're using to represent longitudinal waves. The first two are pitched sounds - a piano and a banjo, respectively. The last two are non pitched sounds - a snare drum and a tambourine. Notice that, while the patterns look different, both the piano and the banjo produce repeating patterns. The snare drum and tambourine, on the other hand, have more erratic-looking waveforms.

Going back to the four attributes we looked at in the previous section, we can say something about each of those attributes based on these pictures. The height of the each wave tells us the amplitude or loudness. We might guess that the banjo played its note louder than the piano did. In the next section we'll approximate pitch by measuring the length of each repeating portion of the waveform. For duration, we could zoom all the way out and see how long the wave is from start to finish. And finally, we'll see in the last section how we can analyze a waveform to say something about the timbre - should we expect the sound to be harsh, smooth, bright, warm?

This brings us to two important definitions:

- Period: The length of time of instance of the smallest repeating part of a repeating pattern. We refer to the portion of a pattern that takes up one period as a
*cycle.* - Frequency: The frequency of a repeating pattern is the number of times it repeats in a second.

Let's take a closer look at the piano waveform from above.

We've highlighted one *cycle*. It begins at about \(0.08755\) seconds and ends at about \(0.08975\) seconds. This gives us the following:

\[\begin{aligned} \text{Period } &\approx 0.08975 - 0.08755\\ &\approx 0.0022\; \text{seconds}\\ \text{Frequency } &\approx \frac{1}{0.0022}\\ &\approx 454.5 Hz\end{aligned}\]

Of course, these are rough estimates done by visual inspection, so we've written them as approximations.

Take a few moments to complete the following exercises where you'll convert between period and frequency.

Convert a period of \(0.004\) seconds to frequency.

\[\begin{aligned} f &= \frac{1}{0.004}\\ &= \boxed{250 \; \text{Hz}} \end{aligned} \]

Convert a period of \(0.02\) seconds to frequency.

\[ \begin{aligned} f &= \frac{1}{0.02}\\ &= \boxed{50 \; \text{Hz}} \end{aligned} \]

Convert a frequency of \(440\) Hz to period.

\[ \begin{aligned} p &= \frac{1}{440}\\ &= \boxed{0.002272727 \ \text{seconds}} \end{aligned} \]

Convert a frequency of \(160\) Hz to period.

\[ \begin{aligned} p &= \frac{1}{160}\\ &= \boxed{0.00625 \ \text{seconds}} \end{aligned} \]

In physics, we learn that when two different forces are applied to a common point, the resultant force is the sum of the two applied forces. For example, suppose you push on a box with a force of 70 pounds facing east, and your friend pushes on the same box, at the same time, with a force of 60 pounds facing west. The resultant force is 10 pounds (70 pounds - 60 pounds) to the east. In other words, the end result is the same as one person pushing with a force of 10 pounds to the east.

When two waves move through the same medium at the same time, we get a similar effect: the result of two concurrent waveforms is the sum of the two waveforms. Let's clarify what we mean by “the sum of two waveforms.” We can think of a waveform as a function of time. If we have two functions \(f\) and \(g\), their sum \(f + g\) is a new function whose output at each point in time is the sum of the outputs of \(f\) and \(g\). Consider the following example:

Below, we have the graphs of three functions, \(f(t)\) (red), \(g(t)\) (blue) and \(h(t)\) (purple), each of which is a sine wave. The function \(h\) is the sum of \(f\) and \(g\): i.e., \(h(t) = f(t) + g(t)\). We can also write this as \((f+g)(t)\).

Notice that at \(t=1\), we have \(f(1) = 0.052\), \(g(1) = 0.707\) and \(h(1) = 1.227\). So, \(h(1) = f(1) + g(1)\). The same applies for \(t = 2\). ^{2}

This principle of superposition has another hugely important consequence: any periodic function (waveform) can be broken down into a sum of simple sine waves. Here's an example.

In this example, we have five sine waves, all of different intensities, being added up. In the first graph, we see the five individual sine waves. In the second, we see the resulting sum.

Here's another example of superposition in the audio application, Audacity:

In this case, we're adding a *sine wave*, \(f\), and a *square wave*, \(g\) - we'll discuss the different types of waves in greater detail in the chapter on synthesis. The horizontal line in each graph is the \(x\)-axis (\(y=0\)). Notice that the \(y\)-value for \(f+g\) at each point is the sum of the \(y\)-values for \(f\) and \(g\).

Below, we have a table with values for three functions \(a\), \(b\), and \(c\). Complete the table for the function values of \(a+b\), \(a+c\), \(b+c\) and \(a+b+c\).

\(t\) | \(a(t)\) | \(b(t)\) | \(c(t)\) | \((a+b)(t)\) | \((a+c)(t)\) | \((b+c)(t)\) | \((a+b+c)(t)\) |
---|---|---|---|---|---|---|---|

\(1\) | \(0.81\) | \(-0.2\) | \(0.4\) | ||||

\(2\) | \(0.1\) | \(0.73\) | \(-0.1\) |

Here's the completed table.

\(t\) | \(a(t)\) | \(b(t)\) | \(c(t)\) | \((a+b)(t)\) | \((a+c)(t)\) | \((b+c)(t)\) | \((a+b+c)(t)\) |
---|---|---|---|---|---|---|---|

\(1\) | \(0.81\) | \(-0.2\) | \(0.4\) | \(0.61\) | \(1.21\) | \(0.2\) | \(1.01\) |

\(2\) | \(0.1\) | \(0.73\) | \(-0.1\) | \(0.83\) | \(0\) | \(0.63\) | \(0.73\) |

You may have noticed that for \(t=2\), we got \((a+c)(2) = 0\). So, at that time, the values of \(a\) and \(c\) cancel one another out. It's possible for two (non silent) sound waves to cancel one another out at every point in time. Explore that in the following activity using Audacity.

Open up a new file in Audacity and do the following.

Create a new

`Mono Track`

. In the menu bar, go to`Tracks > Add New > Mono Track`

.Set the recording to Mono.

Hit the

`record`

button (red circle) and record yourself saying something. Hit stop when you're done.Duplicate your track. To do this, hit the

`Select`

button on the bottom of the track (you can also just double-click on the waveform). You should see the waveform background highlight.Hit play. You should hear your recording.

Now, select your second track and invert it by going to

`Effect > Invert`

in the menu bar.Hit play. You should hear silence. But, if you hit

`solo`

on either track and then play, you'll find that both tracks are producing sound!

When you inverted the wave, you took each point \((t,f(t))\) on the original wave and generated a new wave with points of the form \((t,-f(t))\). The result is a new wave where the \(y\)-value is \(f(t) + (-f(t)) = 0\) at every point in time. Noise-canceling headphones use superposition to “cancel out” external sounds by playing the inverse wave, just like you did above. How that's accomplished in real-time is a complicated technical feat, but the underlying principle is superposition.

Superposition can yield some other weird and interesting auditory phenomena. One of these is the *beats phenomenon*. This occurs when we play two frequencies that are close together. For example, if we play a frequency of \(200\) Hz and a frequency of \(204\) Hz, we'll hear pulsing. This pulsing can be worked out exactly.

If we play two frequencies, \(f_1\) and \(f_2\), simultaneously, we'll hear beats at a rate of \(|f_1 - f_2|\).

The frequency of those beats will be the average of \(f_1\) and \(f_2\): \(\frac{f_1 + f_2}{2}\).

If you've ever tuned a guitar, ukulele, violin, cello, or other string instrument by ear, you've made use of the beats phenomenon. As two frequencies get closer, the beats slow down; as they get further apart, the beats speed up. To tune a string instrument by ear, we play the same note on two different strings; if we hear beats, we change the tension one of the strings until the beats disappear.

Suppose we play frequencies of \(300\) Hz and \(308\) Hz.

We should expect to beats at \(|308 - 300| = 8\) Hz (\(8\) beats per second).

The frequency of each beat will be \(\frac{308 + 300}{2} = 304\) Hz.

Suppose we want to generate \(6\) beats per second with a frequency of \(400\) Hz. What two frequencies should we play to make that happen?

To get \(6\) beats per second, we need to play two frequencies that are \(6\) Hz apart. To get a frequency of \(400\) Hz for the pulse, we need the average of the frequencies to be \(400\). Putting these two facts together, we want \(f_1\) to be \(3\) less than \(400\) and \(f_2\) to be \(3\) more than 400:

\[\begin{aligned} f_1 &= 400 - 3 = 397\\ f_2 &= 400 + 3 = 403 \end{aligned}\]

Playing frequencies of \(397\) Hz and \(403\) Hz will generate the desired beats.

Suppose we play frequencies of \(220\) Hz and \(230\) Hz. How many beats per second should we hear? What will the frequency of each beat be?

Beats: \(230 - 220 = 10\) beats per second. Frequency: \(\frac{220 + 230}{2} = 225\) Hz.

Suppose we play frequencies of \(380\) Hz and \(381\) Hz. How many beats per second should we hear? What will the frequency of each beat be?

Beats: \(381- 380 = 1\) beat per second. Frequency: \(\frac{381 + 380}{2} = 0.5\) Hz.

Suppose we want to generate \(2\) beats per second, where each beat has a frequency of \(110\) Hz. What two frequencies should we play?

You've already done the math to create \(2\) beats per second with a frequency of \(110\) Hz; now you're going to do it.

Open Audacity.

Generate your two tones (frequencies) on two separate mono tracks.

- Mix and render your tracks to a new track.

- Zoom in enough so that the sine waves are visible and take a screenshot. Add the screenshot to your document.

A frequency of \(220\) Hz corresponds to the note we call \(A\). If you're singing this note, your vocal chords vibrate the air passing through them at a rate of \(220\) times per second. But the sound you produce is not quite that simple. There are actually many, many vibrations, all of different frequencies occurring at the same time. The lowest frequency vibration is called the *fundamental frequency*. All the rest are called *overtones*.

With pitched sounds, the fundamental frequency is what we perceive as the pitch of a sound. The overtones are all integer (whole-number) multiples of the fundamental frequency. Overtones will all have different amplitudes, which tend to decrease over time.

Suppose you sing a frequency of \(220\) Hz. The overtones you naturally generate will all be (very close to) integer multiples of \(220\). Let's call the fundamental \(f = 220\). The overtones would be \(2f, \; 3f, \; 4f, \; 5f, \; 6f, \dots\). Here are the first five overtones ^{3}

\(f\) | \(2f\) | \(3f\) | \(4f\) | \(5f\) | \(6f\) |
---|---|---|---|---|---|

\(220\) Hz | \(440\) Hz | \(660\) Hz | \(880\) Hz | \(1100\) Hz | \(1320\) Hz |

Write down the first five overtones for a frequency of \(150\) Hz.

Starting with \(f = 150\) Hz, we multiply \(150\) by each integer \(2 - 6\).

\(f\) | \(2f\) | \(3f\) | \(4f\) | \(5f\) | \(6f\) |
---|---|---|---|---|---|

\(150\) Hz | \(300\) Hz | \(450\) Hz | \(600\) Hz | \(750\) Hz | \(900\) Hz |

The unique pattern of overtones for a given instrument or voice is what we perceive as *timbre*. Timbre, as we discussed earlier, is the quality of a sound that we can think of as its audio fingerprint; when you describe a voice or instrument as piercing or grating (e.g., nails on a chalk board) and another as smooth or warm (e.g., a harp), you're describing timbre. This attribute of sound is independent of pitch or loudness. Sounds at any pitch can be harsh, pleasant, warm, smooth, abrasive, etc. Similarly, sounds of volume can be harsh, pleasant, warm, smooth, abrasive, etc.

Here's a plot of the frequency spectrum for a cello playing a fundamental frequency of \(440\) Hz (an \(A\) note).

The peaks correspond to the pitch and the overtones. Notice that the first peak is near \(440\) Hz, the next one is near \(880\) Hz, the next one is near \(1320\) Hz. We can also see that the amplitudes of the overtones fall toward \(0\) over time, so that the very highest frequencies are barely audible.

A non pitched sound will create overtones that are more erratic. When the overtones do not occur at integer multiples of the fundamental frequency, we lose the ability to hear the fundamental frequency - i.e., we don't hear a pitch. Remember, we said that non-pitched sounds do *not* have a repeating pattern.

A waveform without a repeating pattern corresponds to a sound with overtones that *are not integer multiples of the fundamental frequency.*

Here's the frequency spectrum for a tambourine. Notice how the locations of the peaks are highly erratic; they do not occur at integer multiples of the lowest frequency and their amplitudes do not tend to fall off toward zero as the frequencies get higher. As a result, we do not hear an identifiable pitch.

For your final task, you are going to record your voice and look at its frequency spectrum. Timbre, remember, is our perception of the unique overtone sequence that a sound produces.

Create a blank track in Audacity.

Take a deep breath and then record your self singing

*ah*for a few seconds.Estimate the frequency you sang the

*ah*by zooming in a looking at the period, just like we did above with the piano.Now, zoom back out, and highlight a part of the wave form where you were singing

*ah*. Then select*Plot Spectrum*from the Analyze menu.

- Write down the fundamental frequency and the first 4 overtones. You can find these by putting your cursor near each peak in the spectrum and then looking the
*peak*.

Sound waves are longitudinal waves of variations in air pressure. When a sound wave enters our ear, it vibrates our ear drum. Those vibrations make their way to the cochlea, which converts those mechanical vibrations into electrical signals that our brain interprets as sound.

We perceive four main attributes of sound: loudness, pitch, duration, and timbre.

Our perception of pitch corresponds to frequency.

Frequency = 1 / Period

Period = 1 / Frequency

Superposition is when two or more waves are added together. It also implies that complex sounds can be broken down into sums of simpler waves.

The beats phenomenon occurs when two frequencies are close together. Given frequencies \(f_1\) and \(f_2\), we get \(|f_2 - f_1|\) beats per second with a frequency of \(\frac{f_1 + f_2}{2}\).

Timbre is the character, color, texture, or quality of a sound. It corresponds to the overtone spectrum of a sound.