How to perform Abstractive Summarization using Google’s Pegasus

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Given a long document to read, our natural preference is to not read, or at least to read just the main points. So rather than reading the full document, it would always be great if we can have a summary. That will save us time and brain processing power. But of course summarization can also be used in other applications. In a newsletter, we can auto provide summary for each article rather than the first X number of words to better represent the article. Or given the script of a speaker, we can also auto generate the summary of the talk. These are useful applications made possible by summarization models, and in this video we shall look at how a model by Google named Pegasus can help us to perform abstractive summarization. Before we start, there's a concept that we need to clarify. You have probably noticed that the title of this video mentioned abstractive summarization and not just summarization. Well, that's because there's another kind of summarization called extractive summarization, which literally extracts the important sentences from the article and combine them to form a summary. This is not something that we will focus on now, although you will see this concept being brought up at a later part of the video. The type of summarization that we are focusing on is abstractive summarization, which involves paraphrasing words and thus can potentially give a more coherent and polished summary. This is obviously not easy to do, but let's start off from a simple base. Essentially what we want to achieve is to feed in some text into a model and obtain a corresponding summary. The mainstream approach now is to use an encoder-decoder model, which is a form of sequence-to-sequence learning. That is to say we are learning to convert a sequence of words to a different sequence of words, or in short, seq2seq learning. In such a model, the encoder will first take into consideration the context of the whole input text and encode the input text into something called context vector, which is basically a numerical representation of the input text. This numerical representation will then be fed to the decoder whose job is to decode the context vector to produce the summary, so there's a good separation of tasks here between the encoder and the decoder. Nowadays, the state-of-the-art models also tend to use the transformer architecture and if you would like to find out more about what is a transformer, I strongly encourage you to read this article titled "The Illustrated Transformer" by Jay Alammar and I have placed the link in the description box. Back to the model what we have seen here is a high level overview of the inference stage. During model training, there are some nuances to take note of. First, for the training data, besides the document text we also need a corresponding summary that's of good quality for the model to learn to output. And second, during training the reference summary is fed to the decoder so that the model can learn how to produce good summaries. Here comes one of our main challenges: collecting the training data. It is not so much about collecting the document that is a problem but that is not to say it is easy. The bigger problem here though is the reference summary that will be hard to collect in sufficient quantities. The fact is the transformer model is quite data hungry and will need quite a lot of data to train. How many is a lot? Hmm... well, for another model that also uses the transformer architecture, the smallest number of documents used for training was 90,000. Obviously that is a lot of data to collect for you and me in order to reach state-of-the-art results. The million-dollar question is therefore how can we get something close to the state of the art without breaking our backs to collect as much data. And Google's answer to this question is... a heck of a lot more data. Well okay not really. Google's plan is to perform pre-training of the model on a super large corpus. That is 350 million web pages and 1.5 billion news articles. And with the pre-trained model we can then perform fine-tuning of the model on the actual data. Hopefully then we do not need as much data for the fine-tuning task. For those of you who've heard of Bert and GPT, this idea of pre-training probably sounds very familiar. But the question explored by Google's Pegasus is what kind of pre-training specifically works well for abstractive summarization. And their hypothesis is that pre-training the model to output important sentences is suitable as it closely resembles what abstractive summarization needs to do. How then can we auto generate these important sentences? Recalling back to our example at the start of the video, this will mean we need to auto extract the sentence in blue. To do this the developers of Pegasus suggest using a metric called ROUGE. Essentially this metric is used for evaluating automatic summarization of texts, and we can compute this metric for each of the sentences in the document and select the top few sentences with the highest scores. In our example assuming we only want to select one sentence, then we will select the sentence with the highest score as our target output for pre-training. The other thing we need to do is to mask the selected sentence in the original document and this will then form our input text. So through this way we can now prepare millions and even billions of training data for our pre-training. Luckily for us, this has been done for us by Pegasus developers and we get to enjoy the fruits of their labor - the pre-trained model. We can now fine-tune this pre-trained model on our data to adapt to our use cases, and this can be done without breaking our backs yet still obtained close to state-of-the-art results. Let's take a look at a table to find out how much data do we now need for fine-tuning. The developers of Pegasus have tested the model against various data sets. For this first dataset, the previous state-of-the-art ROUGE metric reported is 45. Varying the number of documents used for fine-tuning, it is observed that as we increase the number to 10,000 documents, we are able to get a score quite close to the state-of-the-art. In fact, with one thousand examples, we already have quite a good score. If we do this comparison similarly for many other datasets, we will derive the same conclusion as well. That is, with just 1,000 documents we can already obtain quite comparable results. In fact for some of the datasets, the score is even higher than the previous state-of-the-art. This is absolutely fantastic because now instead of tens of thousands of training data, we can effectively train a model close to state-of-the-art with just one thousand training data. To use an analogy, we do not need to work like a horse and can instead ride on Pegasus to achieve great results for abstractive summarization. Excellent! That is all good, but a practical question is how do we code this? Well, thankfully we have Hugging Face, which provides an NLP library for easy implementation of the model. You can visit Hugging Face's website to view their documentation for the Pegasus model and they have also provided an example script to perform summarization of a document. As for fine-tuning, they have a general guide that shows us how to do so and we have adapted from their example to suit Pegasus. Check out our Github script from the link in the description box. And so this is in summary how Google's Pegasus helps us to perform abstractive summarization. Hope this video has helped you and please go on to subscribe to our channel AI Tapas if you would like to get more bite-sized feeds on AI like this video. Thank you and see you again!

Info

Channel: AI Tapas

Views: 2,532

Rating: undefined out of 5

Keywords: artificial intelligence, natural language processing, abstractive summarization, pegasus

Id: naRdmLvlEzE

Channel Id: undefined

Length: 8min 39sec (519 seconds)

Published: Wed Feb 03 2021