Given a long document to read, our natural
preference is to not read, or at least to read just the main points. So rather than
reading the full document, it would always be great if we can have a summary. That will
save us time and brain processing power. But of course summarization can also be used in other
applications. In a newsletter, we can auto provide summary for each article rather than the first X
number of words to better represent the article. Or given the script of a speaker, we can
also auto generate the summary of the talk. These are useful applications made possible by
summarization models, and in this video we shall look at how a model by Google named Pegasus can
help us to perform abstractive summarization. Before we start, there's a concept that we need
to clarify. You have probably noticed that the title of this video mentioned abstractive
summarization and not just summarization. Well, that's because there's another kind of
summarization called extractive summarization, which literally extracts the important
sentences from the article and combine them to form a summary. This is not
something that we will focus on now, although you will see this concept being
brought up at a later part of the video. The type of summarization that we are focusing
on is abstractive summarization, which involves paraphrasing words and thus can potentially give
a more coherent and polished summary. This is obviously not easy to do, but let's start off from
a simple base. Essentially what we want to achieve is to feed in some text into a model and obtain
a corresponding summary. The mainstream approach now is to use an encoder-decoder model, which is
a form of sequence-to-sequence learning. That is to say we are learning to convert a sequence
of words to a different sequence of words, or in short, seq2seq learning. In such a model,
the encoder will first take into consideration the context of the whole input text and encode the
input text into something called context vector, which is basically a numerical representation
of the input text. This numerical representation will then be fed to the decoder whose job is to
decode the context vector to produce the summary, so there's a good separation of tasks
here between the encoder and the decoder. Nowadays, the state-of-the-art models also
tend to use the transformer architecture and if you would like to find out more about
what is a transformer, I strongly encourage you to read this article titled "The Illustrated
Transformer" by Jay Alammar and I have placed the link in the description box. Back to the
model what we have seen here is a high level overview of the inference stage. During model
training, there are some nuances to take note of. First, for the training data, besides the
document text we also need a corresponding summary that's of good quality for the model to
learn to output. And second, during training the reference summary is fed to the decoder so that
the model can learn how to produce good summaries. Here comes one of our main challenges: collecting
the training data. It is not so much about collecting the document that is a problem but that
is not to say it is easy. The bigger problem here though is the reference summary that will
be hard to collect in sufficient quantities. The fact is the transformer model is quite data
hungry and will need quite a lot of data to train. How many is a lot? Hmm... well, for another model
that also uses the transformer architecture, the smallest number of documents used for
training was 90,000. Obviously that is a lot of data to collect for you and me in order to
reach state-of-the-art results. The million-dollar question is therefore how can we get something
close to the state of the art without breaking our backs to collect as much data. And Google's answer
to this question is... a heck of a lot more data. Well okay not really. Google's plan is to perform
pre-training of the model on a super large corpus. That is 350 million web pages
and 1.5 billion news articles. And with the pre-trained model we can then perform
fine-tuning of the model on the actual data. Hopefully then we do not need as much data
for the fine-tuning task. For those of you who've heard of Bert and GPT, this idea of
pre-training probably sounds very familiar. But the question explored by Google's Pegasus
is what kind of pre-training specifically works well for abstractive summarization. And their
hypothesis is that pre-training the model to output important sentences is suitable as it
closely resembles what abstractive summarization needs to do. How then can we auto generate
these important sentences? Recalling back to our example at the start of the video, this will
mean we need to auto extract the sentence in blue. To do this the developers of Pegasus
suggest using a metric called ROUGE. Essentially this metric is used for
evaluating automatic summarization of texts, and we can compute this metric for each of the
sentences in the document and select the top few sentences with the highest scores. In our example
assuming we only want to select one sentence, then we will select the sentence with the highest
score as our target output for pre-training. The other thing we need to do is to mask the
selected sentence in the original document and this will then form our input text. So
through this way we can now prepare millions and even billions of training data for our
pre-training. Luckily for us, this has been done for us by Pegasus developers and we get to
enjoy the fruits of their labor - the pre-trained model. We can now fine-tune this pre-trained
model on our data to adapt to our use cases, and this can be done without breaking our backs yet
still obtained close to state-of-the-art results. Let's take a look at a table to find out how
much data do we now need for fine-tuning. The developers of Pegasus have tested
the model against various data sets. For this first dataset, the previous
state-of-the-art ROUGE metric reported is 45. Varying the number of documents used for
fine-tuning, it is observed that as we increase the number to 10,000 documents, we are able to
get a score quite close to the state-of-the-art. In fact, with one thousand examples,
we already have quite a good score. If we do this comparison similarly for many other
datasets, we will derive the same conclusion as well. That is, with just 1,000 documents we
can already obtain quite comparable results. In fact for some of the datasets, the score is
even higher than the previous state-of-the-art. This is absolutely fantastic because now
instead of tens of thousands of training data, we can effectively train a model close to
state-of-the-art with just one thousand training data. To use an analogy, we do not
need to work like a horse and can instead ride on Pegasus to achieve great
results for abstractive summarization. Excellent! That is all good, but a practical
question is how do we code this? Well, thankfully we have Hugging Face, which provides an
NLP library for easy implementation of the model. You can visit Hugging Face's website to view
their documentation for the Pegasus model and they have also provided an example script to perform
summarization of a document. As for fine-tuning, they have a general guide that shows us
how to do so and we have adapted from their example to suit Pegasus. Check out our Github
script from the link in the description box. And so this is in summary how Google's Pegasus
helps us to perform abstractive summarization. Hope this video has helped you and please go
on to subscribe to our channel AI Tapas if you would like to get more bite-sized feeds on AI
like this video. Thank you and see you again!