Machine Learning One

ML One
Lecture 11
Introduction to RNN, LSTM and Transformer for modelling sequential data
+
Tips for presentation

By the end of this lecture, we'll have learnt about:
- Introduction to sequential data modelling
- Example applications of sequential data modelling: natural language processing🗣️💬👁️‍
- A new AI architecture unlocked 🔓: recurrent neural network
- A very cool and almost universally useful AI architecture unlocked🔓: Transformer 🤖
*Tips for presentation (ML One assessment)*

- Neural networks for modelling sequential data -
(following content is adapted from Mick's slides)

〰️ Introduction to sequential data - what's a sequence? 〰️
- A tracking record of coin tosses 🪙
- A transcript/score of a drum beat 🎼🥁
- An audio recording of a drum beat 🎤🥁
- Letters in a word
- Words in a sentence
- Methods in a cooking recipe🔖
- Numbers in a list of any kind that represent anything that happens over time 💯

〰️ Introduction to sequential data - sequence of tokens 〰️
Let's use "token" to denote the "what" in any sequence.
- A transcript/score of a drum beat 🎼🥁: each music note as token
- An audio recording of a drum beat 🎤🥁: each audio sample as token
- Letters in a word: each letter as token
Words in a sentence: each word as token
- Methods in a cooking recipe🔖:each method as token

〰️ Introduction to sequential data - token embeddings 〰️
🫲What token really is if we want to feed sequence data into our computer for doing some computation later?🫱
- They are just representations (of note, audio sample, letter, word, etc.) in numbers, which are just vectors/matrices.

〰️ Introduction to sequential data - sequential data modelling tasks 〰️
🫲Say we are given this sequence dataset, which comprises a list of token embeddings where the order implies the time dimension🫱
What tasks can we do? What information can we model and predict?
- 1. At any point of time, based on what has happened, which token is likely to the next?
-- example application: langauge modelling -> given a context of previous text, predict the next word (this is what ChatGPT does)

〰️ Introduction to sequential data - sequential data modelling tasks 〰️
- 2. At the end of the sequence, based on what has happened in the entire sequence, what kind of sequence is this?
-- example application: sentiment analysis

〰️ Introduction to sequential data - sequential data modelling tasks 〰️
- 3. At the end of the sequence, based on what has happened in the entire sequence, can we predict another related sequence?
-- example application: machine translation -> given a sentence in French, generate the English version of it

RNN - summary
1. Recurrent Neural Networks are different from MLPs, CNNs etc.
2. They still have input layer, hidden layers with weights and biases.
3. The hidden layers are applied recurrently along tokens of an input sequence - RNNs have loops in them! ( Here is an illustration.)
4. The network doesn't hard memorise an exact copy of the sequence. Instead it encodes the compound information of tokens in a sequence.
5. This architecture has a kind of'memory'.

RNN - the problem🥲
- It is designed to "try to memorise everything from a sequence".
- Because of that recurrent loop, which can be unrolled into a looong computation graph, RNN suffers from exploding/vanishing gradient.
(meaning that it is hard to train a RNN, and to have a good modelling performance)

RNN - the mitigation🤗 -> LSTM
Other RNN variants are proposed for better optimization performance.
- LSTM (Long Short Term Memory Networks, I usually refer to this article for an in-depth explanation.)
- LSTM networks add gates (yet another sets of weight and bias and activation function) to edit and filter the information that the RNN retains.
- These gates allow for "selectively forgetting information" in a way.

Introduction to natural language processing🗣️
- It's about language!
- Our language, either written(text) or spoken(audio) is in the form of sequential data!

Introduction to natural language processing🗣️
Natural language processing, or NLP, combines computational linguistics—rule-based modeling of human language with statistical and machine learning models
- to enable computers and digital devices
- to recognize, understand and generate text and speech.

Introduction to natural language processing🗣️
Natural language processing, or NLP, combines computational linguistics—rule-based modeling of human language with statistical and machine learning models
- Check here for example applications of NLP (we have seen some of them already🤗)

What does GPT in ChatGPT stand for?
- Generative Pre-trained Transformer.
Transformer is the neural network that does the heavy lifting behind ChatGPT.

Transformer
You can find the original epic-titled paper here, and many good explanations online.
‼️"Attention mechanism" is the key design.
- Key components:
-- Transformer block with self-attention layer, feedforward layer, residual connection and layer normalization

Transformer
- We use language tokens as input to transformer for NLP tasks.
- Transformer is quite universitile and can be applied to other domains and data modalities.
- It can be very powerful for multi-modal application. 💪
*You just need to swap that language token to be other tokens, for instance, patches from an image, latent code from an audio etc. *
- An example: Vision Transformer (ViT)
- Hugging face has a model hub for Transformers.