ML One
Lecture 09
Introduction to convolutional neural network (CNN)
+
example application: pose detection with PoseNet
Welcome ๐Ÿ‘ฉโ€๐ŸŽค๐Ÿง‘โ€๐ŸŽค๐Ÿ‘จโ€๐ŸŽค
By the end of this lecture, we'll have learnt about:
The theoretical:
- new AI model unlocked ๐Ÿ”“: convolutional neural network
- cool AI application unlocked ๐Ÿ”“๐Ÿ’ƒ: pose detection
The practical:
- CNN implemented in python using PyTorch
First of all, don't forget to confirm your attendence on Seats App!
as usual, a fun AI project Discrete Figures to wake us up
- we'll inspect the AI model related to this project later!!!
Recap of past two weeks
๐Ÿงซ Biological neurons: receive charges, accumulate charges, fire off charges
๐Ÿงซ Biological neurons (continued)
-- This "charging, thresholding and firing" neural process is the backbone of taking sensory input, processing and conducting motory output.
๐Ÿค– Artificial neurons
- Think of a neuron as a number container that holds a single number, activation. ๐Ÿชซ๐Ÿ”‹
- Neurons are grouped in layers. ๐Ÿ˜๏ธ
- Neurons in different layers are connected hierarchically (usually from left to righ) ๐Ÿ‘‰
- There are different weights assigned to each link between two connected neurons. ๐Ÿ”—
๐Ÿ˜๏ธ Layers in MLP
- Put each layer's activations into a vector: a layer activation vector ๐Ÿ“›
๐Ÿ˜๏ธ Layers in MLP
- Input layer: each neuron loads one value from the input (e.g. one pixel's greyscale value in the MNIST dataset)๐Ÿ‘๏ธ
๐Ÿ˜๏ธ Layers in MLP
- Output layer: for image classification task, one neuron represents the probability this neural net assigns to one class๐Ÿ—ฃ๏ธ
๐Ÿ˜๏ธ Layers in MLP
- Hidden layer: any layer between input and output layers and the size (number of neuron) is user-specified ๐Ÿดโ€โ˜ ๏ธ
๐Ÿ˜๏ธ Layers in MLP
- Put each layer's activations into a vector: a layer activation vector ๐Ÿ“›
- Input layer: each neuron loads one value from the input (e.g. one pixel's greyscale value in the MNIST dataset)๐Ÿ‘๏ธ
- Output layer: for image classification task, one neuron represents the probability this neural net assigns to one class๐Ÿ—ฃ๏ธ
- Hidden layer: any layer between input and output layers and the size (number of neuron) is user-specified ๐Ÿดโ€โ˜ ๏ธ
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)๐Ÿ‘‰
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)๐Ÿ‘‰
- The computation calculates each layer activation vector one by one from left to right.๐Ÿ‘‰
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)๐Ÿ‘‰
- The computation calculates each layer activation vector one by one from left to right.๐Ÿ‘‰
- Overall, the computation takes the input layer activation vector as input and returns the output layer activation vector that ideally represents the correct answer.๐Ÿ‘‰
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐Ÿ”ฌ:
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐Ÿ”ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โšก๏ธ
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐Ÿ”ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โšก๏ธ
--- Add with a bias vector and feed into an activation function (e.g. relu): "thresholding" ๐Ÿชœ
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐Ÿ”ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โšก๏ธ
--- Add with a bias vector and feed into an activation function (e.g. relu): "thresholding" ๐Ÿชœ
--- The result would be this layer's activation vector. โœŒ๏ธ
๐Ÿค–๐Ÿงฎ The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers๐Ÿ”ฌ:
--- Multiply previous layer's activation vector with a weights matrix: "charging" โšก๏ธ
--- Add with a bias vector and feed into an activation function (e.g. relu): "thresholding" ๐Ÿชœ
--- The result would be this layer's activation vector. โœŒ๏ธ
--- We then feed this layer's activation vector as the input to next layer until it reaches the last layer (output layer): "firing" โ›“๏ธ
Wait but what numbers shall we put in the weights matrices and bias vectors? ๐Ÿฅน
- Numbers in the weights matrices and bias vectors are called parameters ๐ŸŽ›๏ธ
- The training (aka parameter tweaking) process is done through backpropogation by a technique called gradient descent โช
- Numbers in the weights matrices and bias vectors are called parameters ๐ŸŽ›๏ธ
- The training (aka parameter tweaking) process is done through backpropogation by a technique called gradient descent โช
- Intuitions on the gradien descent: feel the slope and find the valley ๐Ÿ‚โ›ท๏ธ
- Feel the slope and find the valley ๐Ÿ‚โ›ท๏ธ: try to find the set of parameters that land at the lowest point on loss function
well done everyone ๐ŸŽ‰
we have gone through MSc-level content
Recall the amazing perceptual adaptation effect: repeated exposure makes things easier and familier! ๐Ÿฅธ
end of recap ๐Ÿ‘‹
Today we are going to look at the stepstone of (almost) every advanced image-dealing AI models
- convolutional neural network
Fully connected layer ๐Ÿฅน
Confession: there is more than one way of connection between consecutive layers
What we have seen in the MLP hidden layer is "fully connected layer"
Simply because every neuron is connected to every neuron in previous layer
This full connectivity brings a convenience for NN architects (you)
When defining a fully connected layer, the only quantity (aka hyper-parameter) to be specified, that is left to our choice, is the numebr of neurons to set up in this layer.
Once that choice is made,
the shapes of weights matrix and bias vector would be automatically calculated using #of neurons in this layer times #of neurons in previous layer.
let's confirm we have learned something today by revisiting this model training process, especially "Set up the layers" part
Our first time hands-on tweaking with a neural network:
add another fully connected layer comprising [your choice here] neurons
Help me with basic maths: how many parameters are there for this MLP that handle this more practical input data? hint: use the shape of weights matrix and bias vector
Apparently we don't encounter image of 28x28 much often, say if we have more realistic images of 1024 x 1024
With the task being a dog-or-not classification problem
The size of input layer?
1,048, 576 ~= 1m
With first hidden layer having 1000 neurons, what is the size of this weights matrix?
~= 1B
This number of parameters is probably more than the number of dogs in the world (check this up) ๐ŸŽƒ
Recall that we oNLy have 100 billion neurons ๐Ÿ˜ˆ
Imagine that with 100 different 1024x1024 image classifications models and our brain juice is drained ๐Ÿ˜ต
When the task and data get slightly more difficult... the constraint of FC layer becomes prominent:
the curse of dimensionality (CoD)
sorry, fully connected layer ๐Ÿฅน
Reflection on why FC layer suffers from CoD:
1. everything connects to everthing
2. every connection has its own weight
hence the big weight matrix size
๐Ÿซฅ bye MLP for now
Introducing... convolutional neural network!
CNN is the stepstone for working with vision or any 2D data.
It is a variant on top of MLP, most things remain the same (it has input output layer, and fully connected layer๐Ÿคจ)
New ideas in CNN are essentially two new layer types introduced: convolutional layer and pooling layer
Convolution and conv layer ๐Ÿซฃ
-- convolution is a math operation that is a special case of "multiplication" applied between two matrices
-- just another meaningful computation rules ๐Ÿค 
-- check this video out for a very smooth introduction
Good thing about CNN 01:
Because convolution operation can be defined in 2D, image data can retain its 2D structure.
It means that in the input layer, there is no need to flatten the 2D input.
(recall the cost of losing neighboring information in flattening)
Convolution is actually quite easy:
- "filter" in the image is the weights matrix
- element-wise multiplication between input and filter matrices
- sum up the element-wise products to one number, similar to dot product
and slide the filter to the next part of input like a sliding window and compute again, all the way till the end
recall multiplication has a flavour of "computing similarity"
Some intuitions on conv
- view the weights matrix here as a filter that has a graphical pattern on it,
- by convolute a filter with input image,
- it can detect if that graphical pattern is present in the image or not
- ( how similar it is between the filter pattern and input image)
3B1B smashes it again
Now we know what convolution is, a convolution layer is just a layer comprising a set of filters with following specifications:
- Kernel size: size of filter (e.g. 3x3, 5x5)
- Number of channels: number of total filter (e.g. 32, 64)
- stride: how do you slide the filter over the input (e.g. 1, 3 )
- padding: usually used when the filters do not fit the input image
(more details here fyi )
Recall layer activation vectors in MLP?
in CNN, the input and output of a conv layer is called "feature maps" (2D feel huh)
Conv layer is profound
-- this single math operation is a game changer in AI, it made up the "first" working AI model in 1998 by Lecun Yann
-- reduced connectivity (see next slides)
-- shared(copied/tied) weights (see next slides)
-- its design coincides with image's intrinsic properties, right tool for the right problem (e.g. self-similarity as shown later)
Conv layer why cool:
-- reduced connectivity: a neuron only connects to some of the neurons in prev layer
-- not one link one weight anymore: the connection weights are shared (copied) across different links
-- image has self-similarity structure (patches at diff locations looks similar)
another ytb vid on convolution
That's quite a lot, congrats! ๐ŸŽ‰
It also helps me get a job lol, this is the exact CNN that helped my previous job a looot
Pooling layer (max pooling or average pooling):
- no parameters/weights
- just an operation of taking the max/avg pixel values in a given neighborhood from the input
- "downsample" (like blurring the image)
max pooling is just another easy math operation, it reduces each dimension by half in this example
Why pooling layer? some intuitions... ๐Ÿค–
- to compensate the fact that conv layer cannot largely reduce the dimension.
why do we need dimension reduction?
- to squeeze information into more general ones,
- and max pooling does dimension reduction brutally, with no parameters to be learned
๐Ÿ˜ˆ End of noodling!!! Here is a takeaway message:
- In CNN, there is a typical mini-architecture of having a conv layer followed by a pooling layer.
- The ordered layering of "conv+pooling" is very conventional, like shown in VGG.
- Sometimes it is called a "conv block" or "conv module".
- The process of designing a CNN architecture is mostly just stacking conv blocks, and add some FC layers in the end
Here are what we have talked so far:
- ๐ŸฐConvolution operation and conv layer
-- ๐Ÿค˜conv layer why cool? efficient (less parameters) and effective (very handy in image processing)
- ๐ŸฐPooling operation and pooling layer
-- ๐Ÿค˜pooling layer why cool? it does dimension reduction, VERY simply (no parameters)
- ๐Ÿ”A conv block: conv + pool layer
- ๐Ÿ”๐Ÿ” CNN: stacked conv blocks, sometimes with FC layers in the end
That's quite a lot, congrats! ๐ŸŽ‰
Intuitions time: the hierachical feature extractions in CNN
Here is a visualisation of "hierachical 2D features extraction done properly by CNN"
https://cs.colby.edu/courses/F19/cs343/lectures/lecture11/Lecture11Slides.pdf
Filters in different conv layers are doing pattern detection with an overall hierarchical structure, which is guess what, biologically inspired!
A modern state-of-the-art AI architecture: PoseNet.
- Intimidating maybe at the first look
- looking closely you will find it is just many good old friends stacking together.

Introducing Pose detection ๐Ÿ’ƒ๐Ÿ•บ:
- Input: a digital image
- Output: a set of coordinates corresponding to different key points on human body (similar with landmarks on face)
play around with PoseNet which does pose detection in real time
- you can use this model easily in your Apple App, check this out
Detect animal poses in Vision (from WWDC 2023)
That's quite a lot, congrats! ๐ŸŽ‰
Let's take a look at this notebook!
- on how to train a CNN using PyTorch
- 1. Make sure you have saved a copy to your GDrive or opened in playground. ๐ŸŽ‰
- 2. Most parts are out of the range of the content we have covered so far.
- 3. Your task: find the line of codes where the conv layer specifications aka hyperparameters are defined.
- 4. hint: it is something related to nn.Conv2D()
Today we have looked at:
- CNN is better at image tasks than MLP: efficient and effective
- CNN with two new layer types: conv layer and pooling layer
- Conv layer: a set of filters ๐Ÿ”Ž
- Pooling layer: take avg/max in each patch ๐ŸŒซ๏ธ
- A conv block: conv layer + pooling layer ๐Ÿ”
- CNN: input layer, conv blocks, fc layer, output layer ๐Ÿ”๐Ÿ”
- Pose detection application: a model that detects key points on human body๐Ÿฉป
- PoseNet as an example of real-life sized CNN ๐Ÿ’ƒ
We have gone through some MSc-level content today, congrats!!!
- It's okay if you have not grasped all the details yet, this lecture serves as a demystifying purpose.
This is the simple takeaway message on what's going on inside an AI model (CNN) in action: extracing features hierachically ๐Ÿชœ
We'll see you next Thursday same time and same place!