Hello! If you are here to join our little
lesson about the recent advances in the Natural Language Processing (or short NLP) field, you are exactly at the right place! We will follow the young, but explosive life of the Transformer architecture that
revolutionized the AI field of NLP! It all started a few years ago, where the fun
in AI was happening in Computer Vision with the great success of CNNs on the ImageNet challenge
(I am talking about you, AlexNet in 2013!), the success of Neural Style Transfer and of Generative
Adversarial Networks, and the list can go on and on! Computer Vision was attracting headlines and
NLP was not so much in the public’s attention. Of course, NLP researchers
were working hard too. For example, 2013 was the year ow Word2Vec,
based on the idea that neural networks should be forced to learn similarities between
words based on word distributions. For example, the word “Apple” is often used
near the same words as Microsoft or IBM, so it is mapped close to these in the vector space.
But “Apple” is also used often with words like “pear” or other fruits, so we see that we can not
disambiguate words contextually with this method. 2014 and 2015 were marked by the
rising popularity of the Recurrent Neural Network (RNN), that made it feasible
to not only solve classification tasks better, but also tasks for sequence to sequence modelling. An example for generating a sequence
from a sequence is Machine Translation. The idea of Recurrent Neural Network
or RNNs is simply put the following: The same Neural Network block is applied
recurrently on the input sequence. Reaching the sequence end, the module ideally
has learnt one single vector that captures the meaning of the whole sentence.
And from this single vector, one can start to translate, by
reverting the process in the decoder. But one problem of the method is:
It works well for short sentences like “I am working hard” that can
perhaps be captured by a single vector, but not for sentences like the second one.
The RNN block, reaching the end of the sequence, is so “overwhelmed” by the new input, that it
forgets what the beginning of the sentence was. 2015-2016 were the years that
took care of this “forgetting” problem by employing the attention mechanism. This attention is nothing more than a way to say which part of the sentence
should get more importance. In other words, attention tells the RNN
what it should focus on and not forget. Yet another problem with Recurrent Neural
Networks is that they are… recurrent! Unfortunately, sentences
are processed sequentially, word by word, like generally humans do. For a machine, this means
however, that it needs a lot of time to process huge corpora of text. And here it is where 2017, bam!
The finally long-awaited ImageNet moment came also for NLP with
the Transformer architecture! It first crashed the party with
the “Attention is all you need” paper! Like the title says: If RNNs and attention are so
slow because of the sequential processing of RNNs, let us just use only the attention
and throw away the RNN part! And Boom, here it was, the Transformer! In practice this is a little
more complicated than this and we will explain the Transformer
architecture in the next video. Here, we put it very simply: Transformers,
like RNNs handle sequence data, but not in a particular order. This means, that it
can train much faster with parallelization! Now, focusing less on how it works, we focus what the Transformer means and
what came out of the original transformer: Because this architecture and it’s
descendants prove to be improving the state-of-the-art across the board, like in
machine translation, sentiment classification, coreference resolution, commonsense
reasoning, and so on and so on! It has been even successfully
applied on translating from one programming language to another
or for solving symbolic mathematics! And if you want to see the transformer in action, do not hesitate to play around
with demos, like this one here. Coming back to our history lesson: 2018 Google
Researchers developed a bidirectional version of the Transformer and they called it "BERT".
Yes, you heard right, BERT from Sesame Street. Now let’s not get distracted by the
weird names researchers choose for their model and let us focus on the word
“bidirectional” regarding the transformer. Bidirectional means, that it allows
the information to flow forwards and backwards as the model trains, which
results in better model performance. Versions of BERT are one of the
most successful advances in NLP. Like Word2Vec, BERT is also mapping similar
words closely, but it is context sensitive! This means that a different word vector is computed
for a word if it is encountered in different contexts. For example, if we have two sentences
like “This game is just not fair” and “I had a great time at this fair”, the word “fair” will
have very different word vectors with BERT, but equal-valued word vectors
would be computed by Word2Vec. And after BERT, everything just escalated
for attention-based architectures: In 2019, researchers developed RoBERTa, ERNIE
2.0, XLNet, and this to only to name a very small fraction! All improve upon BERT
both conceptionally and in performance. But these architectures just keep coming,
there is no light at the end of the tunnel yet. And now to make it even crazier, we
close the circle by going back to the computer vision community, because the
Transformer is also making waves there: The Transformer, proved to be
so performant on text sequences, that it was also applied to pixel sequences,
or as we usually called them, images! And wait, there is even more: If we can
apply transformers on text or images, why not apply them on text and images
simultaneously??? Here we are in the so-called realm of multi-modality, where the goal
is to process different input sources at once. So still 2019 was the year of V-LBert,
VisualBERT, ViLBERT, UNITER and this is just to name a few! All these papers came
out around the same month of August 2019. Three of them on the same day!
What a time to be an NLP researcher! Hard to be not both overwhelmed and
overly excited at the same time! To finally end this crazy enumeration: The
transformer works so well on multiple tasks and in many cases even though it was not
explicitly trained on the task of interest. So, we are now at the point in NLP where we have
to investigate if a neural network trained on task A is also able to more or less solve a task B.
For more information on how to “probe” for this, check out our previous video, linked here
in the video and in the description below! If you want to want to know more about
the inner workings of the Transformer, wait until our next video! Thanks for watching, do not forget to
like and subscribe and see you next time!