Confused which Transformer Architecture to use? BERT, GPT-3, T5, Chat GPT? Encoder Decoder Explained

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Useful video. I have subscribed to your YouTube channel. Keep up the good work buddy

๐Ÿ‘๏ธŽ︎ 1 ๐Ÿ‘ค๏ธŽ︎ u/Advanced-Hedgehog-95 ๐Ÿ“…๏ธŽ︎ Jan 14 2023 ๐Ÿ—ซ︎ replies
Captions
hello guys welcome to data shoes analytics where the concepts are simplified and explained intuitively in this video we will be studying the entire tree of Transformers and I will be introducing you to important Transformer architectures which are available as you all know there are three important architectures for Transformer models namely encoder decoder and encoder decoder architecture as you all know the early Transformer models initial success poured an explosion in models development in this explosion the researchers started creating models using a variety of data sets of varying size and types adopting new pre-training objectives and modifying the model architectures to further boost performance in different tasks although the family of models is still growing at a rapid Pace they still can be divided into the above discussed categories namely encoders decoders and encoder decoder architectures up till now there are more than 50 different architectures available in hugging phase Transformers in this video I will cover some of the few important milestones let's start with encoder branch do you know the first encoder model only based on Transformer was bird when bird paper was published it outperformed various existing state-of-the-art or Sota models on different evaluation metrics like blue various nlu or natural language understanding challenges like text classification named entity recognition Etc can be solved using encoder only models now let's look at different famous and important encoder only models the first model that we will be discussing is bird in this bird stands for bi-directional encoder representations from Transformers birth is pre-trained by keeping following two training objectives which are as follows the first running objective is predicting Mass tokens in the text which is also called x marks language modeling or MLM the second pre-training objective is determining if one text passage is likely to follow another text Passage this is also called as NSP or next sentence prediction objective the next encoder only model is distal word as discussed bird outperform many state-of-the-art architectures but the industry needed some lightweight version of Bolt to deploy in production environment this gave rise to digital bird which is a distilled version of bird this distal bird architecture is trained using knowledge distillation technique distal bird is a whooping sixty percent fast than bird and its memory footprint is 40 percent less too distal but achieves this by maintaining 97 percent of the birds performance in terms of accuracy the next encoder only architecture is Roberta Roberta stands for robustly optimized bird pre-training approach Roberta aims to improve Birds performance by slightly modifying the pre-training scheme or pre-training objective robota is trained on longer sequences or longer sentences on more training data than bullet and it drops the NSP or next sentence prediction training objective of birth these two changes made Roberta improve its performance when compared with world the next encoder only architecture is called as xlm xlm is an improved version of bird which is capable to perform cross-lingual language tasks like text classification and machine translation xlm learns to map words from different languages by using byte pair encoding or bpe and a dual language training mechanism xlm introduces and additional training objective which is called as translation language modeling or tlm in short tlm or translucent language modeling can be viewed as an extension of MLM or Max language modeling which we discussed during World xlm model achieved state-of-the-art results on multilingual nlu or natural language understanding benchmarks as well as on translation tasks the next encoder only architecture is xlm Roberta it is also called as xlmr this is an extension of xlm that incorporates massive training data to train xlmr common crawl Corpus is used this common crawl Corpus does not contain parallel text hence the tlm or translation language modeling training objective which was used in xlm was removed an interesting point is that xlmr beats xlm and even Birth by huge margin in different tasks especially those including low resource languages the next encoder only architecture is Albert Albert is an efficient Transformer architecture the following are the three modification that makes Albert efficient the first modification is that the token embedding is decoupled from the hidden Dimension this makes embedding Dimension to be small especially when vocabulary is huge or when vocabulary is large which help it to save model parameters the second change is that all the layers share parameters this helps to decrease the final total effective parameters the final modification which makes Albert efficient is that the NSP or next sentence prediction objective is replaced with sentence ordering prediction in sentence ordering prediction objective the model predicts whether the order of two consecutive sequences was swapped or not this three modification or these three changes made Albert to train for larger model with fewer parameters efficiently the next encoder only architecture is called as Elektra one of the major limitations of marks language modeling or MLM training objective is that only the marks tokens are updated at each step while the other input tokens remain as it is Electra solves this issue by using a two model approach model 1 is like a regular MLM and tries to predict marks tokens whereas model 2 act as a discriminator and the aim of this model too is to predict which of the token in the first model's output were originally marked the final encoder architecture that we'll be discussing is diberta debater is the first model to beat human Baseline on the super glue Benchmark for people who do not know what super glue Benchmark is a super glue Benchmark is a more difficult version of glue consisting of several subtasks used to measure nlu performance or natural language understanding task performance makes two major architectural changes which are as follows first architectural changes is that each token is represented as two vectors one vector is for the content and the another Vector is for relative position this makes the self attention layers to better model the dependency of nearby token pairs diverter uses relative position representations this is achieved by modifying the internal mechanism Itself by introducing a few additional terms or parameters now that we have gone through important encoder only architectures let's look into decoder only architectures now the development of Transformer decoder models has mostly been driven by open AI these models are mostly utilized for Tech generation tasks because of how well they predict the next word in a sequence because they are so accurate at predicting the next word in the sequence these models are mostly used for text generation tags let's examine the development of this interesting text generation models the first decoder model that we will be talking about is GPT GPT stands for generative pre-trained Transformer GPT is pre-trained by predicting the next word based on the previous ones GPT is trained on the book Corpus and achieved significant great results on Downstream tags such as classification then the next decoder model is Ctrl in which Ctrl stands for conditional Transformer language we all know that GPT is used to auto complete an input given the input prompt but the major limitation is that we as user has very less control over the style of generated sequence or text this Ctrl model addresses this issue by introducing something called as control tokens at the beginning of the sequence this allows the user to control the text generation allowing for diverse text generation the next decoder only model we will be looking into is gpt2 you all might have heard about gpt2 gpt2 is inspired by its predecessor which is named as GPT GPT is upskilled and the training data is increased which gave birth to gpt2 highlight of gpt2 is that it can produce long coherent text gpt2 was released in a stagewise fashion due to few concerns about its misuse smaller models were published first and then the final full model was published the next decoder model that we will be looking into is gpt3 GPT and gpt2 were a huge success in the tech generation domain and Analysis was conducted on a few parameters like compute data set size model size and performance of language model the results of this analysis was upscaling GPT 200 times to yield this gpd3 gpt3 has whopping 175 billion parameters this model has excellent Tech generation capacities but the highlight of gpt3 is few short learning capability it means that jpt3 was able to solve novel tasks with very few input examples open AI has not yet open source this model but gpt3 can be accessed via an interface which is provided by open AI the final model of decoder only that we will be looking into is GPT Neo or GPT j6b these are like GPT models which are trained by a lifter AI these models are smaller versions when we compare it with gpt3 these models have about 1.3 2.7 and 6 billion parameters the last branch in Transformer tree is the encoder decoder Branch let's take a look in some of the famous models the first encoder decoder architecture that we will be seeing is T5 the T5 stands for text to text transfer Transformer as the full form suggests the T5 model combined nlu and nlg tasks by converting them to text to text task for text classification the T5 models encoder takes input text and decoder generates a label as a prediction T5 model is pre-trained using colossal cleaned version of common crawls web crawl Corpus which is also called as C4 Corpus the training objective of T5 is MLM T5 has several variants like T5 small T5 base T5 large t53b and t511b T5 small has about 60 million parameters T5 base to 20 million parameters T5 are 770 million parameters t53b 3 billion parameters and T5 will even be with a whooping one 11 billion parameters then the next encoder decoder only architecture we will be looking into is Bart bar stands for bi-directional auto regressive Transformers Bart combines the pre-training objectives of birth and GP tree with the encoded decoder architecture the input sentence undergoes one of the following simple masking sentence permutation token deletion and document rotation this transformation aims to distort the inputs and then the decoder tries to construct the original sentence this pre-training objective makes Bart more flexible and good to use for energy and nlu tasks then we have M2M 100 as another encoder decoder architecture this m2m00 is the first model to translate between 100 languages While most of the language models deal with only one language per translation this m2m100 leverages the information and patterns of multiple languages to translate over 100 different languages this model uses a prefix token which is similar to that of CLS token this prefix token indicates the source and target language then we have big bird the maximum context size and transform models is limited because it utilizes memory in terms of quadratic requirements Big Bird model solves this memory issue or memory requirement challenge by using a sparse form of attention mechanism which enables it to scale in linear fashion this allows for scaling from 512 tokens in most bird models to 4096 tokens in Big Bird this model is heavily used in tasks like text summarization due to its ability to model long term dependencies so guys that's all concerning with all the major and important Transformer architectures or what I call as a Transformer tree it should be noted that all the models which are discussed in this video are available on hugging phase Hub and can be fine-tuned as per the problem statement which we want to solve if you like this short video of Transformer 3 please give a like to this video share this video with your friends and subscribe to this channel thank you
Info
Channel: Datafuse Analytics
Views: 2,245
Rating: undefined out of 5
Keywords: deep learning, machine learning, transformers, attention is all you need, attention mechanism, neural networks, recurrent neural networks, Data Science, Artificial Intelligence, attention, attention neural networks, transformer neural networks, most important paper in deep learning
Id: wuj8Hao1TT4
Channel Id: undefined
Length: 15min 30sec (930 seconds)
Published: Sun Jan 08 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.