๐Ÿ‘ Introduction to audio data representation and visualization using Python
Welcome ๐Ÿ‘ฉโ€๐ŸŽค๐Ÿง‘โ€๐ŸŽค๐Ÿ‘จโ€๐ŸŽค
By the end of this lecture, we'll have learnt about:
- how to represent audio numerically and visually
- how to use Python to represent and visualize audio data, with help from some libraries!

What is audio?๐Ÿ”ˆ

Sound is the physical vibration that travels through a medium like air.

Audio refers to the electronic recording, transmission, or reproduction of that sound.

Example of audio: recorded/streamed speech, music, environment sound, etc. how about when someone says OK?

Some examples of audio datasets
- A dataset of music loops
- A dataset of audio events
- A dataset with speech recordings
- Find more atthis comprehensive list of open-source audio datasets

- From sound to data: bird song visualization

What is sound? Dive deeper!

Sound is the physical vibration that propagates through medium (e.g. air) as pressure waves.

- A YTB tutorial on sound(DIY and well-made)

Like images and texts, to be able to do all things with audio in computer including reading, editting, processing, generating, etc.,

firstly we need to represent audio using numbers (aka digital representation)!

Digital Audio Representation

How can we use numbers to represent the continuous WAVE of sound? ๐ŸŒŠ

We sample it! ๐Ÿ“

Sampling - the process of converting analog sound waves into a digital format by measuring the amplitude (loudness) of the sound wave at regular intervals.

- Just like taking notes at regular intervals when observing the wave๐Ÿ“

Digital audio is a sequence of samples (numbers) representing the amplitude of sound over time.

Two main parameters for the sampling process:

Raw Waveform

The waveform (the discrete sequence of samples in digital audio) is the most basic representation of audio.

It captures amplitude vs time.

Example: A 1-second 44.1 kHz mono clip = 44,100 numbers.

Visualizing a Waveform

We can plot the waveform as amplitude vs time to observe loudness and shape.

- hands-on practice later

Limitations of the Raw Waveform

From raw waveform to spectrograms!

Time vs Frequency Domain

- The waveform is the time-domain audio representation.

- Audio can also be described in terms of its frequency component.

- We use transformations (like the Fourier Transform) to move from time to time-frequency domain. (no worries about understanding the technical details here!)

Fourier Transform

- Decomposes a signal into a sum of sinusoids.

- Shows which frequencies are present and their amplitudes.

- Fast Fourier Transform (FFT) is the efficient version used in audio processing.

[optional] A good YTB tutorial on Fourier Transform

Spectrogram

A spectrogram shows how frequency content changes over time.

The technique behind extracting spectrogram from waveform: Short-Time Fourier Transform (STFT)

Audio is analyzed in short overlapping windows (frames).

Each window gives a spectrum โ†’ combined to form the spectrogram.

Provides both time and frequency resolution.

Mel Scale

Human perception of pitch is not linear.

The Mel scale compresses high frequencies to match human hearing sensitivity.

Mel Spectrogram = frequency in spectrogram bins mapped to Mel scale.

Log-Mel Spectrogram

Logarithmic scaling of energy values (closer to how humans perceive loudness).

Common input representation for deep audio models (e.g., Whisper from OpenAI).

MFCCs โ€“ Mel-Frequency Cepstral Coefficients

Compact representation of the spectral envelope of sound.

Derived from log-mel spectrogram.

Widely used in speech recognition and speaker identification.

Other Common Audio Features

Good NEWS!๐Ÿฅฐ

In practice, we use simple code from libraries to extract spectrograms, log-mel spectrograms, MFCCs, and more from audio files.

Go beyond! what we can do with these audio representations?

Audio representations (features) become input to AI models for tasks like:

Example: processing pipeline for applications like bird net (audio classification, as we have seen in week 2!)

Summary

Hands-on: Audio representationsใ€œ๐ŸŒŠ๐ŸŒˆ

Commonly used Python libraries for extracting audio features are Librosa, Madmom and torchaudio.

Our CCI WIKI for using Librosa to extract spectrograms.