Sound_Classification_using_Fastai

7 min readSep 5, 2020

These images are known as ***Spectrograms***.

A Spectrogram is a visual representation of the frequencies of a signal as it varies with time.

Now, sound classification or audio tagging have various applications.

• Content-based multimedia indexing and retrieval
• Assisting deaf individuals in their daily activities
• Smart home use cases such as 360-degree safety and security capabilities
• Industrial uses such as predictive maintenance

This is an oscillogram and spectrogram of the boatwhistle call of the toad fish Sanopus astrifer . The upper blue-colored plot is an oscillogram presenting the waveform and amplitude of the sound over time, X-axis is Time (sec) and the Y-axis is Amplitude. The lower figure is a plot of the sounds’ frequency over time, X-axis is Time (sec) and the Y-axis is Frequency (kHz). The amount of energy present in each frequency is represented by the intensity of the color. The brighter the color, the more energy is present in the sound at that frequency.

Problem Statement:

Apply Deep Learning techniques to the classification of environmental sounds, specifically focusing on the identification of particular urban sounds.

When given an audio sample in a computer readable format (such as a .wav file) of a few seconds duration, we want to be able to determine if it contains one of the target urban sounds with a corresponding Accuracy score.

Dataset

For this we will use a dataset called Urbansound8K. The dataset contains 8732 sound excerpts (<=4s) of urban sounds from 10 classes, which are:

• Air Conditioner

• Car Horn

• Children Playing

• Dog bark

• Drilling

• Engine Idling

• Gun Shot

• Jackhammer

• Siren

• Street Music

Source : https://urbansounddataset.weebly.com/download-urbansound8k.html

Audio file overview

These sound excerpts are digital audio files in .wav format. Sound waves are digitised by sampling them at discrete intervals known as the sampling rate (typically 44.1kHz for CD quality audio meaning samples are taken 44,100 times per second).

Listen to one of the audio files in Jupyter

import IPython
IPython.display.Audio('/content/drive/My Drive/AV_Hack/Sound/UrbanSound8K/audio/fold1/101415-3-0-2.wav')

Data pre-processing

We would require the following audio properties in order to ensure consistency across the whole dataset:

• Audio Channels
• Sample rate
• Bit-depth

Librosa is a Python package for music and audio processing that will allow us to load audio in our notebook as a numpy array for analysis and manipulation.

Extract features

The next step is to extract the features we will need to train our model. To do this, we are going to create a visual representation of each of the audio samples which will allow us to identify features for classification, using the same techniques used to classify images with high accuracy.

Spectrograms are a useful technique for visualizing the spectrum of frequencies of a sound and how they vary during a very short period of time. Technique used here is known as Mel-Frequency Cepstral Coefficients (MFCC).

Extracting a MFCC

For this we will use Librosa’s mfcc() function which generates an MFCC from time series audio data.

mfccs = librosa.feature.mfcc(y=librosa_audio, sr=librosa_sample_rate, n_mfcc=40)
print(mfccs.shape)#Output(40, 173)

This shows librosa calculated a series of 40 MFCCs over 173 frames.

Visualize the Spectogram

import librosa.display
librosa.display.specshow(mfccs, sr=librosa_sample_rate, x_axis='time')

For much of the preprocessing we will be able to use Librosa’s load() function.We will compare the outputs from Librosa against the default outputs of scipy’s wavfile library using a chosen file from the dataset.

Convert .wav files to .png files

src ='/content/drive/My Drive/AV_Hack/Sound/train/'
dest = '/content/drive/My Drive/AV_Hack/Sound/Spectrogram/'
for image in train['slice_file_name'].values.tolist():
  audio_file = src+image
  samples,sample_rate = librosa.load(audio_file,duration=5.0)
  fig = plt.figure(figsize=[0.72,0.72])
  ax =fig.add_subplot(111)
  ax.axes.get_xaxis().set_visible(False)
  ax.axes.get_yaxis().set_visible(False)
  ax.set_frame_on(False)
  filename = dest+ image.replace('.wav','.png')
  S = librosa.feature.melspectrogram(y=samples,sr=sample_rate)
  librosa.display.specshow(librosa.power_to_db(S,ref=np.max))

  plt.savefig(filename,dpi=400,bbox_inches="tight",pad_inches=0)
  plt.close('all')

Import Fastai libraries

from fastai import *
from fastai.vision import *
import warnings
warnings.filterwarnings('ignore')

Set path variable

path = '/content/drive/My Drive/AV_Hack/Sound/Spectrogram'

Fastai has a really nice class for handling everything related to the input images for vision tasks. It is called ImageDataBunch and has different functions, respective of the different ways data can be presented to the network. Since our images are placed in folders whose names correspond to the image labels, we will use the ImageDataBunch.from_folder() function to create an object that contains our image data. This is super useful and makes it incredibly easy to read the data into our model, as you will see in a bit.

What’s even more handy is that Fastai can automatically split our data into train and validation sets, so we don't even need to create these on our own.

data = ImageDataBunch.from_df(path=path, df=sample_df,
                              fn_col=0, 
                              label_col=1,
                              ds_tfms=get_transforms(),
                           valid_pct=0.2,bs=8,size=64).normalize(imagenet_stats)ImageDataBunch;

Train: LabelList (6124 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: CategoryList
dog_bark,children_playing,children_playing,children_playing,children_playing
Path: /content/drive/My Drive/AV_Hack/Sound/Spectrogram;

Valid: LabelList (1530 items)
x: ImageList
Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64),Image (3, 64, 64)
y: CategoryList
siren,engine_idling,dog_bark,siren,street_music
Path: /content/drive/My Drive/AV_Hack/Sound/Spectrogram;

Test: None

Let’s have a random look at those images

Fastai supplies us with a function called create_cnn() from its vision module. This function creates what is called a learner object, which we'll put into a properly named variable. See here that we specify the ResNet architecture as our base model for transfer learning. Upon call, the trained architecture will be downloaded via the Fastai API and stored locally.

learn = cnn_learner(data,models.resnet34,metrics=[accuracy])

Finding the learning rate

The learner object we create comes with a build-in function to find the optimal learning rate, or range of learning rates, for training. It achieves this by fitting the model for a few epochs and saving for which learning rates the loss decreases the most.

We want to choose a learning rate, for which the loss is still decreasing, i.e. we do not want the learning rate with the minimum loss, but with the steepest slope.

learn.lr_find()
learn.recorder.plot()

First fit and evaluation

Now let’s fit our model for 10 epochs, with default learning rate

learn.fit_one_cycle(10)

Second fit and evaluation

Now let’s fit our model for 10 epochs, with learning rate between 0.001 and 0.01

learn.fit_one_cycle(10,max_lr=slice(1e-3,1e-2))

Let’s see how the algorithm is performing on the validation images:

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

Improving the Model

Unfreezing and fine-tuning

Before we unfreeze the layers and learn again, we save the weights so that we can go back in case we mess up.

learn.save('stage-1')learn.freeze_to(-3)
learn.fit_one_cycle(5,1e-4,1e-3/5)

learn.recorder.plot_losses()

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

learn.fit_one_cycle(5,1e-4,1e-4/5)

learn.save('stage-2')
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

Train the model again for 5 epochs with a different learning rate

learn.fit_one_cycle(1,slice(5e-4/(2.6 **4),1e-4/5))

The model would start to overfit the training data if we continue training from this point onwards!

Show some predictions results

Predict on unseen data

Add test folder

After training the model, in order to get the test predictions I used the following code:

test_data = ImageList.from_df(X_test, '/content/drive/My Drive/AV_Hack/Sound/test')
data.add_test(test_data)test_predictions = []
for test_image in test_data:
    test_predictions.append(data.classes[learn.predict(test_image)[1]])

Sample predictions made of first test spectogram

pred_class,pred_idx,outputs = learn.predict(data.test_ds[0][0])pred_class,pred_idx,outputs

Output:

(Category tensor(9),
 tensor(9),
 tensor([1.7286e-03, 2.1696e-04, 1.6861e-03, 4.4558e-04, 4.0958e-05, 2.2941e-04,
         4.0688e-05, 4.0342e-05, 3.3810e-04, 9.9523e-01]))
data.classes[pred_idx]