A brief history of the Transformer architecture in NLP

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello! If you are here to join our little  lesson about the recent advances in the  Natural Language Processing (or short NLP) field, you are exactly at the right place! We will follow the young, but explosive life of   the Transformer architecture that  revolutionized the AI field of NLP! It all started a few years ago, where the fun  in AI was happening in Computer Vision with the   great success of CNNs on the ImageNet challenge  (I am talking about you, AlexNet in 2013!), the   success of Neural Style Transfer and of Generative  Adversarial Networks, and the list can go on and   on! Computer Vision was attracting headlines and  NLP was not so much in the public’s attention.  Of course, NLP researchers were working hard too.  For example, 2013 was the year ow Word2Vec, based on the idea that neural networks should   be forced to learn similarities between  words based on word distributions.  For example, the word “Apple” is often used  near the same words as Microsoft or IBM, so   it is mapped close to these in the vector space. But “Apple” is also used often with words like   “pear” or other fruits, so we see that we can not  disambiguate words contextually with this method.  2014 and 2015 were marked by the  rising popularity of the Recurrent   Neural Network (RNN), that made it feasible  to not only solve classification tasks better,  but also tasks for sequence to sequence modelling.  An example for generating a sequence  from a sequence is Machine Translation.  The idea of Recurrent Neural Network  or RNNs is simply put the following:  The same Neural Network block is applied  recurrently on the input sequence.  Reaching the sequence end, the module ideally  has learnt one single vector that captures the   meaning of the whole sentence. And from this single vector,   one can start to translate, by  reverting the process in the decoder. But one problem of the method is: It works well for short sentences   like “I am working hard” that can  perhaps be captured by a single vector,   but not for sentences like the second one. The RNN block, reaching the end of the sequence,   is so “overwhelmed” by the new input, that it  forgets what the beginning of the sentence was. 2015-2016 were the years that  took care of this “forgetting”   problem by employing the attention mechanism. This attention is nothing more than a way to say   which part of the sentence  should get more importance.  In other words, attention tells the RNN  what it should focus on and not forget. Yet another problem with Recurrent Neural  Networks is that they are… recurrent! Unfortunately, sentences  are processed sequentially,   word by word, like generally humans do. For a machine, this means  however, that it needs a lot of  time to process huge corpora of text. And here it is where 2017, bam!  The finally long-awaited ImageNet   moment came also for NLP with  the Transformer architecture! It first crashed the party with the “Attention is all you need” paper! Like the title says: If RNNs and attention are so  slow because of the sequential processing of RNNs,  let us just use only the attention  and throw away the RNN part! And Boom, here it was, the Transformer! In practice this is a little  more complicated than this and   we will explain the Transformer  architecture in the next video. Here, we put it very simply: Transformers,  like RNNs handle sequence data, but not   in a particular order. This means, that it  can train much faster with parallelization! Now, focusing less on how it works,   we focus what the Transformer means and  what came out of the original transformer: Because this architecture and it’s  descendants prove to be improving   the state-of-the-art across the board, like in  machine translation, sentiment classification,   coreference resolution, commonsense  reasoning, and so on and so on! It has been even successfully  applied on translating from   one programming language to another or for solving symbolic mathematics!  And if you want to see the transformer in action,   do not hesitate to play around  with demos, like this one here. Coming back to our history lesson: 2018 Google  Researchers developed a bidirectional version   of the Transformer and they called it "BERT".  Yes, you heard right, BERT from Sesame Street. Now let’s not get distracted by the  weird names researchers choose for   their model and let us focus on the word  “bidirectional” regarding the transformer. Bidirectional means, that it allows  the information to flow forwards and   backwards as the model trains, which  results in better model performance.   Versions of BERT are one of the  most successful advances in NLP. Like Word2Vec, BERT is also mapping similar  words closely, but it is context sensitive! This   means that a different word vector is computed  for a word if it is encountered in different   contexts. For example, if we have two sentences like “This game is just not fair” and “I had a   great time at this fair”, the word “fair” will  have very different word vectors with BERT,   but equal-valued word vectors  would be computed by Word2Vec.  And after BERT, everything just escalated  for attention-based architectures:  In 2019, researchers developed RoBERTa, ERNIE  2.0, XLNet, and this to only to name a very   small fraction! All improve upon BERT  both conceptionally and in performance. But these architectures just keep coming,  there is no light at the end of the tunnel yet. And now to make it even crazier, we  close the circle by going back to the   computer vision community, because the  Transformer is also making waves there:  The Transformer, proved to be  so performant on text sequences,   that it was also applied to pixel sequences,  or as we usually called them, images! And wait, there is even more: If we can  apply transformers on text or images,   why not apply them on text and images  simultaneously??? Here we are in the   so-called realm of multi-modality, where the goal  is to process different input sources at once. So still 2019 was the year of V-LBert,  VisualBERT, ViLBERT, UNITER and this is   just to name a few! All these papers came  out around the same month of August 2019.  Three of them on the same day! What a time to be an NLP researcher!  Hard to be not both overwhelmed and  overly excited at the same time! To finally end this crazy enumeration: The  transformer works so well on multiple tasks   and in many cases even though it was not  explicitly trained on the task of interest. So, we are now at the point in NLP where we have  to investigate if a neural network trained on task   A is also able to more or less solve a task B. For more information on how to “probe” for this,  check out our previous video, linked here  in the video and in the description below! If you want to want to know more about  the inner workings of the Transformer,   wait until our next video!  Thanks for watching, do not forget to  like and subscribe and see you next time!
Info
Channel: AI Coffee Break with Letitia
Views: 5,390
Rating: 4.9725084 out of 5
Keywords: Neural networks, transformer, NLP, AI, probing, learning, machine learning, easy, explained, comprehensible, research, teaching, attention is all you need, word2vec, Vaswani, history, machine translation, translation, BERT, ViLBERT, multimodal, multi-modal, VL-BERT, UNITER, Image Transformer, Symbolic Mathematics, deep learning, ImageNet, SOTA, attention mechanism, self-attention, language model, basics, algorithm, beginner, short, example, for beginners, animated, animation
Id: iH-wmtxHunk
Channel Id: undefined
Length: 8min 23sec (503 seconds)
Published: Fri Jun 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.