A brief history of the Transformer architecture in NLP

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello! If you are here to join our little lesson about the recent advances in the Natural Language Processing (or short NLP) field, you are exactly at the right place! We will follow the young, but explosive life of the Transformer architecture that revolutionized the AI field of NLP! It all started a few years ago, where the fun in AI was happening in Computer Vision with the great success of CNNs on the ImageNet challenge (I am talking about you, AlexNet in 2013!), the success of Neural Style Transfer and of Generative Adversarial Networks, and the list can go on and on! Computer Vision was attracting headlines and NLP was not so much in the public’s attention. Of course, NLP researchers were working hard too. For example, 2013 was the year ow Word2Vec, based on the idea that neural networks should be forced to learn similarities between words based on word distributions. For example, the word “Apple” is often used near the same words as Microsoft or IBM, so it is mapped close to these in the vector space. But “Apple” is also used often with words like “pear” or other fruits, so we see that we can not disambiguate words contextually with this method. 2014 and 2015 were marked by the rising popularity of the Recurrent Neural Network (RNN), that made it feasible to not only solve classification tasks better, but also tasks for sequence to sequence modelling. An example for generating a sequence from a sequence is Machine Translation. The idea of Recurrent Neural Network or RNNs is simply put the following: The same Neural Network block is applied recurrently on the input sequence. Reaching the sequence end, the module ideally has learnt one single vector that captures the meaning of the whole sentence. And from this single vector, one can start to translate, by reverting the process in the decoder. But one problem of the method is: It works well for short sentences like “I am working hard” that can perhaps be captured by a single vector, but not for sentences like the second one. The RNN block, reaching the end of the sequence, is so “overwhelmed” by the new input, that it forgets what the beginning of the sentence was. 2015-2016 were the years that took care of this “forgetting” problem by employing the attention mechanism. This attention is nothing more than a way to say which part of the sentence should get more importance. In other words, attention tells the RNN what it should focus on and not forget. Yet another problem with Recurrent Neural Networks is that they are… recurrent! Unfortunately, sentences are processed sequentially, word by word, like generally humans do. For a machine, this means however, that it needs a lot of time to process huge corpora of text. And here it is where 2017, bam! The finally long-awaited ImageNet moment came also for NLP with the Transformer architecture! It first crashed the party with the “Attention is all you need” paper! Like the title says: If RNNs and attention are so slow because of the sequential processing of RNNs, let us just use only the attention and throw away the RNN part! And Boom, here it was, the Transformer! In practice this is a little more complicated than this and we will explain the Transformer architecture in the next video. Here, we put it very simply: Transformers, like RNNs handle sequence data, but not in a particular order. This means, that it can train much faster with parallelization! Now, focusing less on how it works, we focus what the Transformer means and what came out of the original transformer: Because this architecture and it’s descendants prove to be improving the state-of-the-art across the board, like in machine translation, sentiment classification, coreference resolution, commonsense reasoning, and so on and so on! It has been even successfully applied on translating from one programming language to another or for solving symbolic mathematics! And if you want to see the transformer in action, do not hesitate to play around with demos, like this one here. Coming back to our history lesson: 2018 Google Researchers developed a bidirectional version of the Transformer and they called it "BERT". Yes, you heard right, BERT from Sesame Street. Now let’s not get distracted by the weird names researchers choose for their model and let us focus on the word “bidirectional” regarding the transformer. Bidirectional means, that it allows the information to flow forwards and backwards as the model trains, which results in better model performance. Versions of BERT are one of the most successful advances in NLP. Like Word2Vec, BERT is also mapping similar words closely, but it is context sensitive! This means that a different word vector is computed for a word if it is encountered in different contexts. For example, if we have two sentences like “This game is just not fair” and “I had a great time at this fair”, the word “fair” will have very different word vectors with BERT, but equal-valued word vectors would be computed by Word2Vec. And after BERT, everything just escalated for attention-based architectures: In 2019, researchers developed RoBERTa, ERNIE 2.0, XLNet, and this to only to name a very small fraction! All improve upon BERT both conceptionally and in performance. But these architectures just keep coming, there is no light at the end of the tunnel yet. And now to make it even crazier, we close the circle by going back to the computer vision community, because the Transformer is also making waves there: The Transformer, proved to be so performant on text sequences, that it was also applied to pixel sequences, or as we usually called them, images! And wait, there is even more: If we can apply transformers on text or images, why not apply them on text and images simultaneously??? Here we are in the so-called realm of multi-modality, where the goal is to process different input sources at once. So still 2019 was the year of V-LBert, VisualBERT, ViLBERT, UNITER and this is just to name a few! All these papers came out around the same month of August 2019. Three of them on the same day! What a time to be an NLP researcher! Hard to be not both overwhelmed and overly excited at the same time! To finally end this crazy enumeration: The transformer works so well on multiple tasks and in many cases even though it was not explicitly trained on the task of interest. So, we are now at the point in NLP where we have to investigate if a neural network trained on task A is also able to more or less solve a task B. For more information on how to “probe” for this, check out our previous video, linked here in the video and in the description below! If you want to want to know more about the inner workings of the Transformer, wait until our next video! Thanks for watching, do not forget to like and subscribe and see you next time!

Info

Channel: AI Coffee Break with Letitia

Views: 5,390

Rating: 4.9725084 out of 5

Keywords: Neural networks, transformer, NLP, AI, probing, learning, machine learning, easy, explained, comprehensible, research, teaching, attention is all you need, word2vec, Vaswani, history, machine translation, translation, BERT, ViLBERT, multimodal, multi-modal, VL-BERT, UNITER, Image Transformer, Symbolic Mathematics, deep learning, ImageNet, SOTA, attention mechanism, self-attention, language model, basics, algorithm, beginner, short, example, for beginners, animated, animation

Id: iH-wmtxHunk

Channel Id: undefined

Length: 8min 23sec (503 seconds)

Published: Fri Jun 12 2020