10 years of NLP history explained in 50 concepts | From Word2Vec, RNNs to GPT

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

2023 has been a key year in the space of artificial intelligence chat GPT rocked the World by introducing many people about the potential of AI to not only be great generative text models but also exhibit intelligence creativity and reliability in this episode of neural breakdown I'm going to talk about the last 10 or so years of deep learning research in the NLP space that have all contributed to the current AI boom that we are seeing I'm going to stay mostly high level to give you an intuition about all the major Concepts that have shaped NLP research hope you enjoyed this video and consider subscribing to the channel for more breakdown of AI research and the occasional AI game development logs without further Ado let's take a trip down NLP Lane at the most basic level language models are machine learning models that are designed to predict the likelihood of a sequence of tokens in a language in order for us to insert text into a machine learning model we first need to do tokenization where we split the raw text into multiple tokens which are then inputted for processing into an ml model most models today use a subword tokenizer that splits rare words into smaller sub word units while frequently occurring words are inserted as whole for processing the stream of input tokens we first convert each token into a vector word embedding word embeddings are dense low dimensional vectors that represent each word in the model's vocabulary word embeddings captures Mantic and syntactical structure of the words such that similar words have their embeddings close to each other foreign [Music] s and neural networks that are able to preserve the order of tokens in a sequence when creating embeddings in a nutshell the rnns work as follows there is an RNN cell which is initialized with a random hidden State at the beginning at time step T it inputs a token embedding i t and uses it to update its internal hidden State embedding to HD and outputs an embedding OT in the next time step it inputs the next token updates its internal hidden State embedding again and produces another output token for the next time step bi-directional rnns make two passes through the data once from left to right and then from right to left to produce two separate embeddings that are then added or concated together to form a bi-directional embedding grus and lstms are gated rnns that add special gates to the basic RNN structure to selectively update and forget new information into and from the hidden State this shows massive Improvement in the neural Network's ability to learn with longer sequences while gated variants are outperform normal rnns one of the most common tasks studied in LLP is the sequence to sequence task this is a type of task where we need to generate a sequence as output from another input sequence it could be machine translation where the input is a French sentence and output is an English sentence or summarization where the input is a long passage and the output is a short summary in 2014 we got this seminal paper to create an architecture for learning sequence to sequence tasks called the encoder decoder architecture we have an encoder which takes the input sequence and passes a token by token through an lstm or a GRU the final hidden state of the RNN after all tokens have passed represent the embedding of the input sequence this encoder's output Vector is then passed into the decoder as the initial hidden state of another RNN Network and the decoder then has to generate the target output sequence from this encoder output Vector during inference we feed e each word that the model outputs back as the input word in the next time step and generate an entire sentence until the end token has been predicted readily predicting the most likely token at each time step may not always result in the best sentence so often we employ a beam surge mechanism where the model looks back multiple time steps in the future instead of just one to find the most likely sequence of words by keeping track of a fixed number of top scoring tokens at each step and expanding them in a branch and bound like fashion until the end of the sequence is reached the main issue with the original sequence to sequence architecture is that the encoder has to map the entire sequence into a single Vector embedding which the decoder then has to produce the entire Target sequence from in 2015 we got the attention mechanism where the decoder can now selectively focus on specific tokens in the encoder sequence instead of relying on a single encoder output at each step the decoder can form new context vectors by combining relevant hidden States from the input sequence specifically the decoder State at each time step is treated as a query vector and all the encoder states are considered key vectors we learned special ways in the network to compute the attention score or the relevance score for each query decoder State Vector with respect to encoder key vectors depending on these attention scores we do a weighted average of the encoder vectors to form a special context Vector for each decoded query allowing the network to train these special weights allows the network to selectively focus on specific parts of the encoder sequence when generating specific parts of the target sentence as can be seen by these heat Maps where specific English words can be seen as providing focus on corresponding French words while ignoring other parts of the French sentence there were still two main issues with lstms and grus they still had long-term dependency issues where the network will tend to forget things that were said long time in the past because it simply can't encode everything into its hidden State vector and number two training was quite slow because we had to pass each token one by one sequentially through the RNN cell there is no way to parallelize this computation yet because each token's output embedding basically depend on everything that came before it a seminal paper called attention is all you need introduce the concept of a transformer the ripple effect of this paper is still felt today and lstms were about to be replaced the Transformers introducing attention is all you need are also encoder decoder architectures but instead of using any kind of rnns they used the attention mechanism specifically they introduced the concept of multi-headed attention where each query Vector learns multiple attention maps for each of the key vectors and produces multiple context vectors instead of just one all the input token embeddings are updated using a concept called self-attention self-attention is a mechanism used to compute attention weights between all pairs of tokens in the input sequence allowing the model to weigh the importance of different tokens based on their contextual relevance to each other self-attention also alleviates the long-term dependency issues of ordinance because they get a holistic view of the entire sequence and can attend to any token regardless of their position in the sequence the decoder uses two types of attention much like the decoder architectures of lstm based sequence sequence models the decoder embedding is treated as the query and provides attention to the encoder's state embeddings the decoder also provides attention to the embeddings of the target sequence however only to tokens that have appeared in the past the tokens in the future are masked away this is called decoder masked self-attention or causal attention and it is very important because it ensures that during training the model can only attend to tokens that have already been generated and not to Future tokens this is necessary to prevent the model from cheating and predicting future tokens which would not be possible during inference at training Time Each Transformer layer in the encoder and the decoder can process all the input tokens parallely speeding up training by a significant margin from lstm training which had to step through each input sequentially to generate its embeddings to retain the sequential information Transformer strain special high frequency dense embeddings which represent a specific index in the sequence these positional encodings get added to the corresponding word embedding to form the final input token embedding that goes into the network llm stands for large language model and a deep learning models to process natural language these models typically have a large number of parameters usually above a few hundred millions and are trained on huge corpuses of text Data allowing them to capture complex relationships between the words as well as generate coherent text GPT is considered the first llm it had over 117 million parameters and trained on text Data acquired from the web gpts follow a Transformer decoder-like architecture that are also trained on large corpuses of web data with the task of next word prediction next word prediction is a task in which the model is trained to predict the most likely next word that comes after a sequence of context words bird uses a Transformer encoder architecture and is a pre-trained language model to produce contextualized sentence embeddings one of the pre-training tasks bird was trained on was called mask language modeling random tokens are masked from the input sequence and the network is tasked to predict these tokens by providing bi-directional self-attention on all the other tokens in the sequence knowledge distillation is a process where a large complex model like bird or GPT is used to train a smaller simpler model to mimic Its Behavior that run much faster than the original models and can often be used on consumer machines the smaller model trains from the same input data and learns to predict output probabilities and intermediate States similar to the larger Model A limitation of Transformers is that because they process entire sequences at once they have to limit the maximum length of tokens that they can process and they are still research going on about how to make Transformers have larger memory for example by prompting the model with a summary of its previous outputs or keeping a vector database like Pinecone to fetch smaller relevant parts from a large Corpus to then feed into the prompt instead of the absolute positional embeddings the original Transformer applied to each token Transformer Excel introduced the idea of relative positional embeddings which enables the model to learn positional information of tokens relative to each other and pay attention to tokens appearing in only a fixed length context window instead of the entire sequence excellent is a pre-training language model that combines the autoregressive training of GPT and bi-directional nature of birth to introduce the concept of permutation language modeling excellent predicts the likelihood of a token given the entire input sequence per factor in any random order of appearance excellent beat bird in many fine-tuning Downstream tasks and by a large margin too T5 or text to text transfer Transformer introduced the idea of a unified model that can be trained on a variety of NLP tasks by Framing them as a text to text problem and allowing the model to learn the task specific mapping through fine tuning you could prompt the same model to translate a sentence or summarize a passage or even sentiment analysis and there's a whole line of research that's focused on reducing the computation cost of the attention module itself this often sacrifices a little bit of accuracy to speed up training and inference and significantly reduce memory requirements some examples of this models are long formers lint Farmers performers Etc low rank adaptation or Laura is a fine tuning technique that freezes all the pre-trained model weights and adds trainable rank decomposition matrices into each layer of the Transformer architecture this greatly reduces the number of trainable parameters during fine tuning by about 10 000 times as well as the GPU requirements gpt3 Advanced the GPT architecture of Transformer decoders on next word prediction on a massive data set and 175 billion parameters by far the largest language model trained up to this point gpd3 showed that large models can be few short or zero shot Learners few short learning refers to the ability where the model is shown some demonstrations or examples of a task during inference as conditioning in some types of tasks dpt3 also demonstrated one shot and zero shot abilities note that unlike fine tuning fuse shot does not update the Network's weights and instead just conditions the model during inference with the prompt so gpd3 can do a lot of smart things you can write long passages of coherent text it can correct grammar it can translate sentences but there was still a lot of issues with it one of the major ones is hallucination where a language model generates output that contains incorrect invented or unrealistic information that was not present in the input Hallucination is bad because these models can lie or fabricate information to please the user and are generally unsafe to use and so far we had been using web data to train models which gave llms a solid understanding about the semantics and syntax of language and how to generate fluent sentences gpt3's objective was to predict likely sentences but there is no notion of correctness or factual bias into the model as humans our goal to query an llm is not to just generate fluent text but coherent informative trustworthy and reliable out outputs in 2022 openai published instruct GPD they collected a bunch of prompts from existing openai playground users after consent and got human labelers to provide demonstrations of the desired Behavior as response to the prompts then gpd3 was fine-tuned on these annotated data sets next the human labelers were also asked to rank multiple outputs generated by variants of gpt3 according to their order of preference for a given input a much smaller reward model was trained on these rankings containing 6 billion parameters to input a prompt and a response and then output a human reward score a technique known as reinforcement learning with human feedback was then used to further tune the model's weights using the rewards models rlhl had many use cases in the field of reinforcement learning especially in robotics but in a nutshell it fine-tunes gpt3 to generate text that maximizes the human generated reward signal output by the reward model while retaining the semantic language understanding of the original gpt3 I might do a separate video one day explaining the intricate details of reward models and rlhf so stay tuned for that a year after instruct GPT we got chat GPD PT was designed to provide guidance or follow instructions from the user while chat GPT is designed to have conversations with the user with a chat interface chat GPT was trained on a larger and more diverse data set than in stock GPT especially on human to human conversations and has been shown to be less hallucinatory and unbiased than instruct GPT gpd4 is an order of magnitude larger than gpt3 and is generally more truthful safe creative collaborative and reliable than any other language model out there llms trained with human alignment are truly a paradigm shift and one of the main reasons are because of their zero shot and few shot capabilities the Baseline of a good is no longer how easy it is to fine tune on some Downstream tasks but it is about how good it is to responding to prompts via prompting you can make gbd 3.5 or gbd4 assume a role ask it to do zero short tasks make it output creative poems and provide some examples in the prompt to do few short inferences without needing to update its weight it's honestly insane foreign at the end of the day llms are still black boxes you can ask it something and it will answer you and from there you can infer about its intelligence and its understanding of the topic it might show empathy and Science Of Consciousness is it exhibiting those Tendencies because it has some notion of Consciousness stored within its billions of weights or is it faking it to you to sound like a human to please you is it smart enough to lie to you if you give llms the power to connect to the internet with some open source projects already have how safe is it when an llm is out there into the wild interacting with people this might be the first time in human history when we have encountered a program or an entity which might be smarter than a single human how are we going to understand and learn to trust it without all the propaganda and the business and the race for an llm and all the clickbaits how do we learn to trust this and make it safe for us to use and shape our future I'll leave you with that thought today hope you enjoyed this video your time means a lot to me you're magnificent [Music]

Info

Channel: Neural Breakdown with AVB

Views: 23,460

Rating: undefined out of 5

Keywords:

Id: uocYQH0cWTs

Channel Id: undefined

Length: 17min 32sec (1052 seconds)

Published: Wed May 10 2023