Machine Learning Two

By the end of this lecture, we'll have learnt about:
- how to represent audio numerically and visually
- how to use Python to represent and visualize audio data, with help from some libraries!

What is audio?🔈

Sound is the physical vibration that travels through a medium like air.

Audio refers to the electronic recording, transmission, or reproduction of that sound.

Example of audio: recorded/streamed speech, music, environment sound, etc. how about when someone says OK?

What is sound? Dive deeper!

Sound is the physical vibration that propagates through medium (e.g. air) as pressure waves.

- A YTB tutorial on sound(DIY and well-made)

Like images and texts, to be able to do all things with audio in computer including reading, editting, processing, generating, etc.,

firstly we need to represent audio using numbers (aka digital representation)!

Digital Audio Representation

How can we use numbers to represent the continuous WAVE of sound? 🌊

We sample it! 📝

Sampling - the process of converting analog sound waves into a digital format by measuring the amplitude (loudness) of the sound wave at regular intervals.

- Just like taking notes at regular intervals when observing the wave📝

Digital audio is a sequence of samples (numbers) representing the amplitude of sound over time.

Two main parameters for the sampling process:

Sampling Rate (Hz): how many samples per second (e.g., 44.1 kHz)
Bit Depth: how many bits per sample (e.g., 16-bit = 65,536 levels)

Raw Waveform

The waveform (the discrete sequence of samples in digital audio) is the most basic representation of audio.

It captures amplitude vs time.

Example: A 1-second 44.1 kHz mono clip = 44,100 numbers.

Visualizing a Waveform

We can plot the waveform as amplitude vs time to observe loudness and shape.

- hands-on practice later

Limitations of the Raw Waveform

High dimensional (many samples per second)
Hard to interpret directly - what can we read from amplitudes seperated apart by 1/44100th second?
It does not explicitly show interpretable information such as the frequency content.

Time vs Frequency Domain

- The waveform is the time-domain audio representation.

- Audio can also be described in terms of its frequency component.

- We use transformations (like the Fourier Transform) to move from time to time-frequency domain. (no worries about understanding the technical details here!)

Fourier Transform

- Decomposes a signal into a sum of sinusoids.

- Shows which frequencies are present and their amplitudes.

- Fast Fourier Transform (FFT) is the efficient version used in audio processing.

[optional] A good YTB tutorial on Fourier Transform

The technique behind extracting spectrogram from waveform: Short-Time Fourier Transform (STFT)

Audio is analyzed in short overlapping windows (frames).

Each window gives a spectrum → combined to form the spectrogram.

Provides both time and frequency resolution.

Mel Scale

Human perception of pitch is not linear.

The Mel scale compresses high frequencies to match human hearing sensitivity.

Mel Spectrogram = frequency in spectrogram bins mapped to Mel scale.

Log-Mel Spectrogram

Logarithmic scaling of energy values (closer to how humans perceive loudness).

Common input representation for deep audio models (e.g., Whisper from OpenAI).

MFCCs – Mel-Frequency Cepstral Coefficients

Compact representation of the spectral envelope of sound.

Derived from log-mel spectrogram.

Widely used in speech recognition and speaker identification.

Good NEWS!🥰

In practice, we use simple code from libraries to extract spectrograms, log-mel spectrograms, MFCCs, and more from audio files.

Librosa
Torchaudio
Madmom
Essentia
etc.

Go beyond! what we can do with these audio representations?

Audio representations (features) become input to AI models for tasks like:

Speech recognition (e.g. transcription for a YTB video)
Music genre classification
Emotion detection
and a lot more!

Example: processing pipeline for applications like bird net (audio classification, as we have seen in week 2!)

Load raw audio waveform from an audio file
Extract Mel Spectrogram or MFCCs
Feed into an AI model
Predict label (e.g., “magpie”, “raven”, “blue jay”)

Summary

Raw waveform is time-domain data
Apply a techinque called "Fourier transformatio"n to waveform to extract time-frequency-domain features
Spectrogram & > Mel Spectrogram & Log-Mel Spectrogram & MFCCs → perceptually meaningful representations

Hands-on: Audio representations〜🌊🌈

Commonly used Python libraries for extracting audio features are Librosa, Madmom and torchaudio.

Our CCI WIKI for using Librosa to extract spectrograms.