What comes to your mind first when I tell you to create image of sound?
If you’re a musician you might think about score, which allows us to see the sequence of individual sounds, their pitch and length. Or if you used some old audio players you can still remember crazy waves shown on your monitor, which represented the changes of acoustic pressure that we, as humans, interpret as sound.
So how is this possible that even if we think about sound in terms of a sequence, we do most of its research using… image?
In this article (which is part 1), I’ll tell you a story of:
- representing sound as an image and what those images mean?
- my process of considering – is audio merely a computer vision in disguise?
Let’s dive into it.
Starting with easy mathematics (promise)
To understand the mechanism of sound in computers, it’s really good to firstly look at our human mechanism.
Our ear consists of 3 parts: outer, middle and inner ear. When we listen to something, the sound wave goes into the outer ear in order to cause a vibration of a small membrane that sits in our middle ear. When the membrane starts vibrating it moves other elements: this time 3 bones with absolutely adorable names of malleus, incus and stapes. With every step the wave is being magnified through those mechanisms until it lands in its almost final destination – cochlea.
Here is where the magic happens. Cochlea is full of fluid which distributes the wave deeper and deeper. In each part it registers different frequencies of sound, so the deeper the wave goes, the higher frequency is registered. But what are those frequencies intuitively?
Picture is worth 1000 words – so let’s analyse the idea of frequency on the plot.
To make it simple, frequency tells us how many times some repetitive event happened in a specific time frame. It is measured in Hertz, and for example 10 Hz means 10 events.
Let’s see an example. Assume our time frame is 1 second. Our repetitive event is 1 cycle of sinus function (so you know this up, down, up movement that goes from 0 to -1 to 1 and then again to 0). On our first plot we can see that only 1 cycle happened during 1 second, so the frequency of this wave is 1 Hz.
Similarly you can analyse plot 2 and 3: in the former we have 2 cycles in 1 second (so 2 Hz) and in the latter – 3 Hz. But what happens on plot 4?
It appears that 1 signal can be a sum of other signals with various frequencies. Our 4th plot is a sum of frequencies 1 Hz, 2 Hz and 3 Hz – and here we know it, because we constructed it ourselves. But what happens when we don’t know it upfront?
Going back to the human ear, the cochlea does almost exactly this type of decomposition – because different parts register different frequencies, we get the information which frequencies were present in sound. And the similar technique is also known in mathematics – it’s called Fourier Transform.
Let’s perform this decomposition on our sum of sinuses. We’ll be using the so-called Discrete Fourier Transform – because our signal only looks like it’s continuous, but in fact it’s discrete (it has a countable number of elements).
Without using the information about frequencies, we were able to deconstruct the signal and see what frequencies were inside the signal.
I deliberately omit the problem of phase here – I apologise to those being more into the topic, but wanted to make this one intuitive as possible.
Going back to decomposition. You’re probably thinking now “Hey Zuzanna, you lied to me – I just moved from one signal to another, where’s my image?”.
To generate it we will use the help of so-called Short-Time Fourier Transform (STFT).
The idea behind STFT is very intuitive – instead of calculating DFT on the whole signal, we do it on a smaller window and then we move the window over the signal.
This way we obtain multiple DFT vectors with frequencies for some time points and we can merge them using the time axis. Doing this on some example audio signals we’ll obtain this.
Well, I don’t see much here. But we can see that there is an almost invisible yet existing activity on the bottom of our plot (so in low frequencies). In order to visually highlight it we can turn the y axis from linear to logarithmic.
Ok that’s a bit better. Both of those plots are called spectrograms – we only visually changed the axis. But what is the value for a given pair of (time, frequency) in this plot?
So far I deliberately omitted 1 mathematical fact – the result of DFT (so also STFT) is a complex number. This complex number can be decomposed in many ways – some people use their real or imaginary part, some use the phase, but in a classical spectrogram we use squared modulus of this number.
In order to generate both STFT plots above I had to make a specific choice of hyperparameters – what is the length of my window (the one I move over my signal) and how much I move it. And this is where it gets complicated, because it introduces us to some nasty trade-off.
Let’s go back to how STFT works. When we had our sinus function we had exact information when exactly each value happened – for any time you could choose, like 1 second or 3.5 seconds, you were able to tell me the exact value of this function. When we used DFT we totally lost this information – on x axis we suddenly had information about frequencies – and this was the most precise information we could get. In STFT we will also obtain a DFT vector, but calculated from less values – and it will make it less precise.
You may already see what is happening here:
- when we have long window, we have more precise information about frequencies, but less precise information about time,
- for a short window, we have less precise info for frequencies, and more precise for time.
In the graph above we had a really small window, this is why we saw a lot of vertical stripes – because the time information was precise. Below you can see the version with a large window.
Is this all I can get?
In many cases this is already the representation you can use as your input to the model, but there are some other operations you can use to increase the quality even more.
We already learnt that a spectrogram shows us squared modulus of complex numbers – doesn’t sound much interpretable, does it?
The measure we think about more often in audio is loudness, measured in decibels (dB). We can easily transform values into decibels (which are logarithms), which means in the plot below not only y axis is logarithmic, but also values itself are transformed with logarithm.
And noooow we have something to work with. We can see that dog barking gives us a high energy in a short period (a single bark, it reminds kind of an impact happening at once) and then the black spot simply means… silence. With a siren or vacuum cleaner we can see the sound is more persistent over time, using all frequencies – although in a siren it’s more structured. For church bells we see a high single impact and then it slowly fades – meaning the sound spreads in the air.
However, one flaw that those representations so far have, may be the dimensionality. If we have a sound of good quality which is additionally long, we can easily obtain a spectrogram having 1000-2500 dimensions only in y axis. But we have some more compact representations too.
One of such representations is the mel spectrogram. Despite reducing dimensionality, it also transforms y axis to the scale similar to how people perceive sounds. It appears that we as humans not only perceive sound logarithmically, but also in different frequencies we perceive differences between sounds differently. In the plot you can see how decibel mel-spectrogram looks like.
The problem with mel-spectrograms is that it adds another hyperparameter – number of mels. The whole mel transform uses the idea of mapping ranges of frequencies into one representation, which means we have to choose how many of them we will have.
As you can see, if we have a low number of mels we get a quite pixelated image. With more mels it appears more and more continuous.
We could describe more and more representations that make images out of sound, but the more interesting question here is – so is sound only computer vision in disguise? Let’s see.
Image vs. Image of Sound
Unfortunately, sound has its own problems – sometimes similar and sometimes totally different than in computer vision. Here are 4 of them you should definitely keep in mind when you’re just starting out.
The differences between training, test and evaluation sets are often a pain point of Machine Learning – regardless if it’s audio or computer vision.
When we record sound and want to use it as input to our network we can encounter the problems like:
- recording with different microphones (different quality, sampling rate, etc.),
- recording in different circumstances (like different rooms or even outside).
The solution to this problem is often augmenting the audio, but in a way that mimics that it was recorded with a different mic or in a different room (you can look for the “impulse response” term if you want to know more). We can also add some noise to add variety in data. In vision it can resemble the problem with light or camera quality.
Dimensionality of input
Nasty problems for people working especially in audio classification.
Imagine you have to classify if something is piano, glass breaking or dog barking. One characteristic of those classes is they all are of different lengths – and we can understand it two fold:
- not only glass breaking may be shorter than piano concert (variety between classes),
- also barking of various dog breeds may be different (variety inside a class).
To reduce the problem we often use 1 of 3 solutions:
- make all sounds equal – unfortunately, when we have a significant difference between shortest and longest sound, we either lose a lot of information (by cutting) or overflow dataset with silence (when we e.g. pad shortest with zeros),
- don’t care about length – we can define convolutional networks in a way they won’t care about length (e.g. using global pools), but we may have a hard time with adjusting batches with varying examples length,
- make shorter frames and aggregate the output – in this case we cut sound into small chunks and we use those significant chunks as input to the model. In inference we cut sound into chunks and perform a classification on each chunk – then aggregate the output. Unfortunately this solution may introduce a high imbalance of classes, especially if our task is audio classification.
Similarly to other domains we also normalise model input in audio. The difference is we can do it in 2 places – directly on sound and on its representation.
2 popular on-sound normalisation methods are:
- Peak Norm – we normalise with respect to highest peak (absolutely, so highest peak either negative or positive),
- RMS norm – normalisation with respect to Root Mean Square of the signal.
A major difference between image and image of sound is not having rotation invariance assumption.
In image, if you rotate or flip the dog – it’s, most likely, still a dog.
But in audio, if we play the recording backwards we can’t really say if it’s the same sound anymore (although there might be some exceptions). Moreover, we can’t do rotations on frequency scale – this could lead to a total degradation in representation of our sound.
As you can see, although audio can be easily transformed into images and lots of methods from computer vision will probably work on it, it has its own problems you need to be careful about.
In the second part of this article we will try to build an audio events classifier in the shortest time possible and deploy it in the web app which will display the representations we talked about in this part.
And if you’re more interested why the heck knowledge of sound even matters – check out how audio deep fakes can actually improve our world for the better.
- How The Ear Works, John Hopkins Medicine
- 2-Minute Neuroscience: The Cochlea, Neuroscientifically Challenged
- The Sound of AI channel
- But what is the Fourier Transform? A visual introduction., 3Blue1Brown
This post was originally published in Polish on DeepDrivePL.