How to perform Abstractive Summarization using Google’s Pegasus

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Given a long document to read, our natural  preference is to not read, or at least   to read just the main points. So rather than  reading the full document, it would always   be great if we can have a summary. That will  save us time and brain processing power. But   of course summarization can also be used in other  applications. In a newsletter, we can auto provide   summary for each article rather than the first X  number of words to better represent the article.   Or given the script of a speaker, we can  also auto generate the summary of the talk.   These are useful applications made possible by  summarization models, and in this video we shall   look at how a model by Google named Pegasus can  help us to perform abstractive summarization. Before we start, there's a concept that we need  to clarify. You have probably noticed that the   title of this video mentioned abstractive  summarization and not just summarization.   Well, that's because there's another kind of  summarization called extractive summarization,   which literally extracts the important  sentences from the article and combine   them to form a summary. This is not  something that we will focus on now,   although you will see this concept being  brought up at a later part of the video.   The type of summarization that we are focusing  on is abstractive summarization, which involves   paraphrasing words and thus can potentially give  a more coherent and polished summary. This is   obviously not easy to do, but let's start off from  a simple base. Essentially what we want to achieve   is to feed in some text into a model and obtain  a corresponding summary. The mainstream approach   now is to use an encoder-decoder model, which is  a form of sequence-to-sequence learning. That is   to say we are learning to convert a sequence  of words to a different sequence of words,   or in short, seq2seq learning. In such a model,  the encoder will first take into consideration   the context of the whole input text and encode the  input text into something called context vector,   which is basically a numerical representation  of the input text. This numerical representation   will then be fed to the decoder whose job is to  decode the context vector to produce the summary,   so there's a good separation of tasks  here between the encoder and the decoder.   Nowadays, the state-of-the-art models also  tend to use the transformer architecture   and if you would like to find out more about  what is a transformer, I strongly encourage you   to read this article titled "The Illustrated  Transformer" by Jay Alammar and I have placed   the link in the description box. Back to the  model what we have seen here is a high level   overview of the inference stage. During model  training, there are some nuances to take note of.   First, for the training data, besides the  document text we also need a corresponding   summary that's of good quality for the model to  learn to output. And second, during training the   reference summary is fed to the decoder so that  the model can learn how to produce good summaries.   Here comes one of our main challenges: collecting  the training data. It is not so much about   collecting the document that is a problem but that  is not to say it is easy. The bigger problem here   though is the reference summary that will  be hard to collect in sufficient quantities.   The fact is the transformer model is quite data  hungry and will need quite a lot of data to train.   How many is a lot? Hmm... well, for another model  that also uses the transformer architecture,   the smallest number of documents used for  training was 90,000. Obviously that is a   lot of data to collect for you and me in order to  reach state-of-the-art results. The million-dollar   question is therefore how can we get something  close to the state of the art without breaking our   backs to collect as much data. And Google's answer  to this question is... a heck of a lot more data.   Well okay not really. Google's plan is to perform  pre-training of the model on a super large corpus.   That is 350 million web pages  and 1.5 billion news articles.   And with the pre-trained model we can then perform  fine-tuning of the model on the actual data.   Hopefully then we do not need as much data  for the fine-tuning task. For those of you   who've heard of Bert and GPT, this idea of  pre-training probably sounds very familiar.   But the question explored by Google's Pegasus  is what kind of pre-training specifically works   well for abstractive summarization. And their  hypothesis is that pre-training the model to   output important sentences is suitable as it  closely resembles what abstractive summarization   needs to do. How then can we auto generate  these important sentences? Recalling back to   our example at the start of the video, this will  mean we need to auto extract the sentence in blue.   To do this the developers of Pegasus  suggest using a metric called ROUGE.   Essentially this metric is used for  evaluating automatic summarization of texts,   and we can compute this metric for each of the  sentences in the document and select the top few   sentences with the highest scores. In our example  assuming we only want to select one sentence,   then we will select the sentence with the highest  score as our target output for pre-training.   The other thing we need to do is to mask the  selected sentence in the original document   and this will then form our input text. So  through this way we can now prepare millions   and even billions of training data for our  pre-training. Luckily for us, this has been   done for us by Pegasus developers and we get to  enjoy the fruits of their labor - the pre-trained   model. We can now fine-tune this pre-trained  model on our data to adapt to our use cases, and   this can be done without breaking our backs yet  still obtained close to state-of-the-art results.   Let's take a look at a table to find out how  much data do we now need for fine-tuning.   The developers of Pegasus have tested  the model against various data sets.   For this first dataset, the previous  state-of-the-art ROUGE metric reported is 45.   Varying the number of documents used for  fine-tuning, it is observed that as we increase   the number to 10,000 documents, we are able to  get a score quite close to the state-of-the-art.   In fact, with one thousand examples,  we already have quite a good score.   If we do this comparison similarly for many other  datasets, we will derive the same conclusion as   well. That is, with just 1,000 documents we  can already obtain quite comparable results.   In fact for some of the datasets, the score is  even higher than the previous state-of-the-art.   This is absolutely fantastic because now  instead of tens of thousands of training data,   we can effectively train a model close to  state-of-the-art with just one thousand   training data. To use an analogy, we do not  need to work like a horse and can instead   ride on Pegasus to achieve great  results for abstractive summarization.   Excellent! That is all good, but a practical  question is how do we code this? Well,   thankfully we have Hugging Face, which provides an  NLP library for easy implementation of the model.   You can visit Hugging Face's website to view  their documentation for the Pegasus model and they   have also provided an example script to perform  summarization of a document. As for fine-tuning,   they have a general guide that shows us  how to do so and we have adapted from their   example to suit Pegasus. Check out our Github  script from the link in the description box.   And so this is in summary how Google's Pegasus  helps us to perform abstractive summarization.   Hope this video has helped you and please go  on to subscribe to our channel AI Tapas if you   would like to get more bite-sized feeds on AI  like this video. Thank you and see you again!
Info
Channel: AI Tapas
Views: 2,532
Rating: undefined out of 5
Keywords: artificial intelligence, natural language processing, abstractive summarization, pegasus
Id: naRdmLvlEzE
Channel Id: undefined
Length: 8min 39sec (519 seconds)
Published: Wed Feb 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.