Machine Learning One

By the end of this lecture, we'll have learnt about:
The theoretical:
- what is supervised learning: the framework of training
- how to train a neural network on a labeled dataset
- aka how to find the right numbers to put in weight matrices and bias vector
- Intuitions on gradient descent
The practical:
- MLP implemented in python

🧫 Biological neurons
-- A neuron is connected with some other neurons.
-- A neuron is charged by other connected neurons.
-- There are usually different levels of charges emitted from different neurons.

🧫 Biological neurons (continued)
-- The received charges within one neurons are accumulated.
-- We can refer to the level of accumulated charges in one neuron as its activation value.
-- Once a neuron is sufficiently charged, it fires off a charge to the next neurons.

🧫 Biological neurons (continued)
-- This "charging, thresholding and firing" neural process is the backbone of taking sensory input, processing and conducting motory output.

🤖 Artificial neurons
- Think of a neuron as a number container that holds a single number, activation. 🪫🔋
- Neurons are grouped in layers. 🏘️
- Neurons in different layers are connected hierarchically (usually from left to righ) 👉
- There are different weights assigned to each link between two connected neurons. 🔗

🏘️ Layers in MLP
- Input layer: each neuron loads one value from the input (e.g. one pixel's greyscale value in the MNIST dataset)👁️

🏘️ Layers in MLP
- Output layer: for image classification task, one neuron represents the probability this neural net assigns to one class🗣️

🏘️ Layers in MLP
- Hidden layer: any layer between input and output layers and the size (number of neuron) is user-specified 🏴‍☠️

🏘️ Layers in MLP
- Put each layer's activations into a vector: a layer activation vector 📛
- Input layer: each neuron loads one value from the input (e.g. one pixel's greyscale value in the MNIST dataset)👁️
- Output layer: for image classification task, one neuron represents the probability this neural net assigns to one class🗣️
- Hidden layer: any layer between input and output layers and the size (number of neuron) is user-specified 🏴‍☠️

🤖🧮 The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)👉

🤖🧮 The computational simulation of the "charging, thresholding and firing" neural process
- The computation runs from left to right (a forward pass)👉
- The computation calculates each layer activation vector one by one from left to right.👉
- Overall, the computation takes the input layer activation vector as input and returns the output layer activation vector that ideally represents the correct answer.👉

🤖🧮 The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers🔬:

🤖🧮 The computational simulation of the "charging, thresholding and firing" neural process
-- Zoom in to the function between every consecutive layers🔬:
--- Multiply previous layer's activation vector with a weights matrix: "charging" ⚡️
--- Add with a bias vector and feed into an activation function (e.g. relu): "thresholding" 🪜
--- The result would be this layer's activation vector. ✌️

⛓️ the big function for a MLP with one hidden layer (chaining each layer function together):
V_output =
ReLU(WeightsMat_o · ReLU(WeightsMat_h · V_input + Bias_h) + Bias_o)
V_input: input layer's activation vector
V_output: output layer's activation vector
WeightsMat_h, Bias_h: weights matrix and bias vector from the hidden layer
WeightsMat_o, Bias_o: weights matrix and bias vector from the output layer

- Supervised (machine) learning (SL) is a subcategory of machine learning.
- Supervised learning involves training a model on a labeled dataset, where the algorithm learns the mapping from input data to corresponding output labels.
💡keywords: model, labeled dataset

🖇️Connected to what we have talked about:
- The handwritten digit recognition task on the MNIST dataset we introduced last week is a supervised learning task.
- Neural networks (e.g. the MLP we introduced last week) are models (big functions with weight matrices and bias vectors).

🖇️Connected to what we have talked about:
- The handwritten digit recognition task on the MNIST dataset we introduced last week is supervised learning.
- Neural networks (e.g. the MLP we introduced last week) are models.
- The missing pieces: what does it mean by training a model? how to train a model using labeled dataset?
To be demystified today!

hi I'm a happy caveman,
I eat potatoes 🥔,
and I collect apples 🍎
can I have a model that predict the number of 🍎 I can collect given the number of 🥔 I eat today?

This is my labeled dataset at hand:
I ate 0.5🥔 and collected 2🍎 on Monday
I ate 2 🥔 and collected 5🍎 on Tuesday
I ate 1 🥔 and collected 3🍎 on Wedneady

the input: number of 🥔 I ate
the labelled output: number of 🍎 i collected
the task: to learn a model from the dataset, which is a function that can predict the numebr of 🍎 I collect (the output)
given the number of 🥔 I eat (the input)?

Numbers are too headache to look at so I want to plot these numbers out and look at the relation between 🥔 and 🍎 visually!

What if... there is a continuous line going through all points so that I can lookup the point corresponding to 1.5🥔 from this line?

Back to the question: 🤨
- Which function shall I choose?
- By looking at the dataset, a linear function (whose graph representation is a straight line) looks like a good fit.

Back to the question: 🤨
Which function shall I choose?
- By looking at the dataset, a linear function (whose graph representation is a straight line) looks like a good fit.
- a linear function: f(x) = wx + b where w is the weight and b is the bias
- But there are infinite many straight lines I can draw on the canvas! Which one is THE best?

But there are infinite many straight lines I can draw on the canvas! Which one is THE best model?
- BTW, the infinite many straight lines can be parametrised by weight and bias according to this expression: f(x) = wx + b where w is the weight and b is the bias
- Being parametrised means that we can tweak the value of the weight and bias to create all the infinite many straight lines where each line is determined by a unique combination of weight and bias.

Which one is THE best straight line?
- The line that "weaves" through all the labelled data points! It means that this line perfectly captures relation between # of 🥔 to # of 🍎 from my existing experience (the labeled dataset)
- And we can hand calculate THE weight and bias for this best-fit line easily.

SOLVED!!! 🤘 Happy caveman life!
- Note that we have made two choices for this SL task: we chose to use a straight line or a linear function for weaving the labeled dataset
- and we chose the best-fit weight and bias for the linear function.

Caution: this is a overly simplified example where it seems trivial to find a model that has the perfect fit shape (a straight line) and perfect parameters for weaving through all data points.

For real life data,
to find a good-fit model/function,
we can't just look at data and draw a nice weaving curve,
we can only start with initially guessed function and tweak its parameters till it is in a good weaving position.

1. have an initially guessed model (often imperfect)
->
2. feed the input data into the (imperfect) model
->
3. get the (imperfect) model output
->
4. measure how wrong this output is compared to the correct answer (labeled output)
->
5. use the measurement to update the model
->
back to step 2 and repeat

introducing unsupervised learning (the one that uses unlabelled data, stay tuned!) or self-supervised learning(cool kids use this term)

🤩The MLP we talked about last week corresponds to the step 1, 2 and 3
(recall the forward pass with randomly guessed numbers in the weight matrices and bias vector?).

🤩The MLP we talked about last week corresponds to the step 1, 2 and 3 (recall the forward pass with randomly guessed numbers in the weight matrices and bias vector?).
😎It also corresponds to our first decision on choosing linear functions to model in the caveman example.

Now let's zoom into step 4 and 5 😎:
- how to tweak the numbers in weight matrices and bias vectors according to the labeled data
so that the model is in a good weaving state.

GAME SETTINGS 🎰
--1. environment: littel 2D creature living on a curve terrain 🗻
--2. objective: find the valley 🔻
--3. player control: moving along the X axis, left or right (direction)? how far (stepsize)?🕹️
--4. world mist: mostly unknown but some very local information
all we know is that we can feel the slope under our feet 🌫️

on the whiteboard:
answer 2: jackpot! 🥰
no slope means that we have reached the valley!!!
a gentle reminder: avoiding being omniscient in this game, we know it is the valley not because we can see it being the lowest point from outside the game world

on the whiteboard:
question 3: back to the start, now we know how to decide the direction but how about our step size?
the dangerous situation with big step size 🥾

on the whiteboard:
answer 3: a good strategy is that we should decrease our step size when the slope gets flatter

on whiteboard:
question 4: game level up! new terrain unlocked... 🔐
What are the flat-slope points in the new terrain? how can we know if we are at THE valley (if flat slope is all we are looking for) ??? 🥲

on the whiteboard:
BONUS 💰 question 1: start here and is there any chance we end up at the local maxima ?
hint: run simulations in your 🧠, follow the
"feel the slope -> decide the direction
-> pick a step size -> jump to the point
-> repeat"
process

on the whiteboard:
BONUS 💰 question 2: start here and is there any chance we end up at the saddle point?
hint: run simulations in your 🧠, follow the
"feel the slope -> decide the direction
-> pick a step size -> jump to the point
-> repeat"
process

on the whiteboard:
BONUS 💰 answer 2: likely!!!
we could get trapped at the saddle point 🪤
what can we do?
larger step size helps us get carried over

Recall the SL training process:
1. have an initially guessed model (often imperfect)
->
2. feed the input data into the (imperfect) model
->
3. get the (imperfect) model output
->
4. measure how wrong this output is compared to the correct answer (labeled output)
->
5. use the measurement to update the model
->
back to step 2 and repeat

SAME SETTINGS 🎰
--1. curve terrain 🗻: a loss function measuring distance between predicted output and labeled output
and we are moving in the space of model parameters

SAME SETTINGS 🎰
--2. objective 🔻: find the parameters that give the lowest loss, which corresponds to the valley on the loss function terrain

SAME SETTINGS 🎰
--3. player control with step size🕹️: adjust numbers in weight matrices and bias vectrs by deciding how much to increase/decrease each parameter

SAME SETTINGS 🎰
--4. world mist 🌫️: we are agnostic of what parameters give the perfect solution
but we can compute the gradient (the slope) given current parameter values (under our feet)

SAME SETTINGS 🎰
--1. curve terrain 🗻: a loss function measuring distance between prediction and groudtruth
--2. objective 🔻: navigate in the space of parameters and find the lowest loss point
--3. player control 🕹️: adjust numbers in weight matrices and bias vectors by deciding how much to increase/decrease each parameter
--4. world mist 🌫️: we are agnostic of what parameter values would give the perfect solution
but we can compute the gradient (the slope) given current parameter values (under our feet)

SAME TECHNIQUES 🍜
--1. use the slope (gradient) direction to infer what directions to adjust for each parameter
gradient: derivative, aka the slope
direction really just refer to the plain binary choice of "increase or decrease / + or - "

SAME TECHNIQUES 🍜 --2. also use the slope (gradient) value as an indicator of how close we are to a potential valley

SAME TECHNIQUES 🍜
--3. also use the slope (gradient) value to determine the step size of adjustment
step size: learning rate

SAME TECHNIQUES 🍜
--1. use the slope (gradient) direction to infer what directions to adjust for each parameter
direction really just refer to the plain binary choice of "increase or decrease / + or - "
--2. also use the slope (gradient) value as indicator of how close we are to a potential valley
--3. also use the slope (gradient) value to determine the step size of adjustment
step size: learning rate

SAME FINDINGS IMPLIED 🍜
-- mostly local minima
-- impossible global minima
-- no need to worry about local maxima
-- extra caution for saddle point (use larger step size)

RECAP 1️⃣
numberify and rephrase the neural network parameter tweaking process:
1. the goal is to minimize the loss (cost) function by adjusting parameter numbers

RECAP 2️⃣
2. after numberifying the training process, we can then apply some math trick called gradient descent
-- find the steepest decreasing direction
-- one gradient for one parameter

RECAP 3️⃣
3. we multiply the (minus) gradients with some learning rate to decide the parameter adjustment values

let's watch some of this video together
to verify our intuitions
and to connect them with the practical process

another two jargons unlocked:
backpropagation: a gradients calculation scheme
optimizer: conventionally in python DL libraries, all these backprop/GD stuff are handled by an object called "optimizer"

Next, we are going to:
- take a look at how training a MLP on fashion MNIST dataset is implemented in python with help from NumPy and TensorFlow(a very popular deep learning library in Python)!

Alert: you are going to see quite advanced python and neural network programming stuff, we are not expected to understand them all at the moment.
Let's take a look at how some ideas we talk about today are reflected in the code,
especially how we choose a model and how we fit the model to the dataset by setting up the loss function and optimizer.
- It is just a one-liner really...

Let's take a look at the notebook!

- 1. Make sure you have saved a copy to your GDrive or opened in playground. 🎉
- 2. Most parts are out of the range of the content we have covered so far.
- 3. We only need to take a look at the several lines in the "Build the Model" and "Feed the Model" sections.

Today we have looked at:
- The supervised learning process that select a parametrised model and train the model to fit a labeled dataset
- The training or parameter tweaking process is done through backpropogation by a technique called gradient descent
- Intuitions on the gradien descent: feel the slope and find the valley