AI Blog Post Summarization with Hugging Face Transformers & Beautiful Soup Web Scraping

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
nick i'm going to need a summary of these papers for the japan report by close of business today uh there's like 500 pages here perfect thanks knew i could count on you we're gonna need some help luckily we can use hugging face transformers for ai based summarization let's do it tired of reading a ton of blog posts well in this video we're going to be taking a look at how we can use ai to summarize long blog posts so for this we're going to be using hugging face and a number of natural language processing techniques let's take a deeper look as to what we're going to be going through so in order to summarize our blog post we're going to be using a library called transformers by a group called hugging face and specifically we're going to be using their summarization pipelines capability so this allows you to pass your blocker text and have it summarize now because there is a bit of a limit on that pipeline what we need to do is a little bit of processing in order to be able to handle larger blog posts but we'll go through that really really easily so in terms of what we're going to be covering in this video we're going to be setting up hugging face transformers then we're going to actually use beautiful soup to be able to scrape blog posts off the internet so you don't need to copy and paste them we'll be able to scrape them down and use them for our summarization then what we'll do is we'll chunk them into blocks of sentences and pass them to our summarizer in order to generate our summary then we'll be able to export this out to a text file so we can actually go on ahead and read it and use it wherever we need to let's take a look at how this is all going to fit together so first up what we're going to be doing is installing hugging face transformers so this is going to give us a whole bunch of natural language processing capability then what we're going to do is we're going to scrape a blog post from the web using beautiful soup and i think we might take a look at hackernoon and towards data science just as some examples then what we'll do is using that text that we've managed to grab we'll chunk that into sentences and then pass it through to our summarization model in order to generate our summary then we'll push it out to a text file so we can play around with it and post it wherever we need to ready to do it let's get to it alrighty so in order to summarize our blog post we need to go through six key steps so first up what we need to do is install transformers and import a bunch of dependencies then what we're going to do is load our summarization pipeline we're then going to go on ahead and get a blog post from medium so what we'll actually do is we'll use beautiful soup to actually pull down a blog post pre-process that and allow us to actually pass that to our summarization pipeline but before we actually get to that what we're actually going to do is we're going to chunk our text into blocks and the reason that we do this is because there is a limit as to how much text we can pass to our baseline summarization pipeline there are other models that can handle a whole bunch of additional text but the problem with some of these sometimes is that you need quite a fair bit of memory on your gpu a lot of which isn't actually very commercially viable for a lot of users so what we're going to do is rather than having to go and spend thousands of bucks on a gpu we're going to chunk up our text and summarize it in blocks and then last but not least we're going to output to our text file so in order to do this we're going to be using the transformers pipeline and specifically we're going to be loading the summarization pipeline which you can see over here now you can go through a really simple example and really just load that pipeline and then summarize like that and it's quite quick right but in this case what we're going to do is build out a fully fleshed example if you want to see a quick run through i'll include a link to a video that we did on short summarization somewhere above and in the description below and as always all the code for this video is going to be available via github so if you want to pick this up and just run with it really quickly you can actually go to github.com forward slash long form summarize a long form dash summarization dash with hugging face and you should be able to get this entire notebook available pretty easily but as always we're going to go through this step by step and actually take a look at how it's done now also another thing to note whilst we're doing this on blog posts you could do this on a whole bunch of other different types of text if you wanted to so if you wanted to summarize research papers or newspaper articles you could definitely do that as well alrighty on to step one so first up what we're going to do is install our core dependency which is going to be transformers so let's go ahead and do it alrighty so that's our installation done so in order to do that what we've written is exclamation mark pip install transformers so this is going to go on ahead and install the transformers library into our python environment so we've now got that available now the next thing that we need to do is actually go ahead and import our dependency so let's go ahead and do that alrighty so we've gone and written three lines of code there to import our dependencies now the first line is importing transformers so what we've written is from transformers import pipeline and the pipeline is going to allow us to import our summarization model really easily then the second line that we've written is from bs4 import beautiful suit so beautiful soup is a library that allows us to easily perform web scraping so the reason that we're importing beautiful soup is that we're actually going to go on ahead and programmatically grab a blog post from the web bring it down and then actually work with it in python so we don't need to do any copying and pasting we're just going to be able to scrape it and the last library that we've imported is request so in order to do that we've written import requests and requests allows us to make http calls out to the web so this is going to allow us to call out to our blog post and bring back the results and then what we'll do is we'll pass those results to beautiful soup in order to extract the text for that blog post and that's pretty much step zero done so the next step that we need to go through is actually go on ahead and load our summarization pipeline so step number one here so let's go ahead and do it alrighty so that's our summarization pipeline brought down and imported into our notebook now if you are doing this for the first time it will download the model behind that summarization pipeline so it may take a little bit longer but you don't need to do anything else apart from writing that line so let's take a look at the line that we actually wrote so what we've done is we've created a new variable called summarizer to hold our pipeline and in order to set it up what we've written is pipeline and then to that with passive parameter called summarization the cool thing about the hugging face transformers pipeline is that you can actually do a whole heap of really advanced and sophisticated natural processing tasks just by importing the default pipeline and the cool thing about this as well is that you can actually import a whole bunch of different models so say for example i wanted to import the t5 base model or even like a huge t5 11 billion model you could do that really easily now this model here i did test this out this is 45 gigabytes and you'd need a ton of vram in order to be able to load this into your gpu but fret not we're not going to be using that model today we're going to be using one that's readily available and quite easy to use but back to our notebook so the next thing that we're going to do so this is step one now done the next thing that we need to do is actually go on ahead and get a blog post so i've got a couple of links that we'll try out and we'll sort of work with those and go from there so first up what we're going to do is we're going to create a blank variable called url and in that we're going to paste in a link to a blog post that we want to summarize so the one that i found that i wanted to take a look at was this hackernoon article here so and this is all to do with the gamestop short squeeze i thought this is particularly relevant but if you wanted to you could paste in a different link and we'll actually try a bunch of others as well but let's start with this for now and go with that so again the link will be available in the description below as well so if you want to try it out you'll have that link so we'll copy it this link and what we're going to do is just paste that into our url so again if you wanted to do this with different articles or with different blog posts all you need to do once we've finished writing the code is just change this url and you'll be able to do exactly that now what we want to do is actually start setting up our request and actually start processing and scraping our data okay so that's our request done so before i show you what's sort of been returned so the line that we've written is requests dot get url and this is basically going out to this url that you can see here and it's going and grabbing the entire webpage i'm talking all the html all the metadata all the text everything that's on that page is now going to be inside of this variable r so if we actually take a look at r you'll actually see that it's a whole bunch of html so when we actually just output our it's going to give us our response and 200 means it's been successful and if we type in text this is actually going to show us all of the stuff that's in that web page now you can see here that there's a there's a whole bunch of stuff right so it's not all that usable right now but this is where beautiful soup comes in because we can actually use beautiful soup to actually go through all of this text and actually pull out the specific tags that we want now in this case in order to get the data out of this blog post what we actually want is we want the title so in this case if i inspect that little section there you can see that these are all divs on their insider paragraphs so really we're going to be wanting to extract all of this stuff so again we're inside of a paragraph so what we're actually going to do is i've seen in a couple of other medium blog posts or medium based blogs that they also have h1 tags so basically titles or subtitles so what we're actually going to do with our beautiful soup web scraping code is we're actually going to extract all the paragraphs and all the h1 tags but we'll take a look at how to do that now so now that we've got our text we're sort of good to go for our scraping so let's go ahead and write our code to extract our blog post out of that okay and that's really it for beautiful soup so it's actually gone through it's passed our html and it should ideally have all the results that we need we might need to do a little bit of additional pre-processing to string it into one big block of text but we'll take a look at that in a second so if we actually take a look at our results value results with an s you can see that we've now gone and extracted a whole bunch of text and if you take a look closely what you'll see is that we've extracted our h1 tag which is extracting our title so will gamestop with games of the game stop with gamestop or is this just the beginning so if we actually go and take a look that's this title here so will the game stop with gamestop or is this just the beginning and if we actually scroll down we're actually grabbing the rest of this text so if we take a look at our first paragraph the gamestop squeeze on short sellers is an extraordinary event and if we go back to our text the gamestop squeeze on short sellers is an extraordinary event in market so you can see that we've started to extract our text now what we actually want is we actually want to extract just the text out of this so we don't want any of these p tags or h1 tags and we want it to be in one big block of text in order to pass it to our summarizer so we're going to do that next but let's take a look at the code that we actually wrote to get to this state so in order to do this we've written two pieces of code so the first line that we've written is we're creating a new instance of beautiful soup so by writing beautiful soup and then to that we're passing through two parameters so we're passing the text that we extracted over here so r dot text and then the second parameter is indicating that we want to use the html parser so to do that we've written html dot parser then we've stored that or saw that object in a variable called soup so if we actually take a look at soup you can see that that's holding all of our stuff in there so basically this transforms it into a format that we're able to use our search across then the next step is where we actually go on ahead and perform our search so really what we're doing when we're actually going ahead and performing web scraping is we're just making a request to a web page we're putting it into a format that we can search and then we're searching for the specific tags or the specific patterns that we want within that block of text so in order to search for that pattern really what we've identified is that we need our h1 tag and our p tags which effectively our h tags represent our titles and subtitles and our p tags represent our block of text so in order to search through and find those patterns we've written soup so this object over here dot find all and then inside of square brackets we've passed through the two tags that we want to look for so h1 and p and then we'll store those results in a variable called results which again is giving us this big block over here now the next thing that we want to do is concatenate this into one big block of text rather than having to go through each one of these values in an array because right now this is just a regular array and you can see that they're all individual lines so let's go on ahead and concatenate this into one block of text alrighty so that is our article now ready for pre-processing so what we've actually done here is written two lines of code so the first line is looping through each result in our results array which we remember we had from our find and then the second one is just joining each one of those results together so remember that we had all of those tags so those results we actually take a look we had all of those h1 tags and p tags this first line is really just pre-processing our array and getting rid of all of those so ideally if we stop it or if we take a look at text you can see that that's gotten rid of all of our p tags and our h1 tags and in order to do that we've written for result in results extract just result.text and again we've stored this inside of an array using a list comprehension so this is now stored in a new variable called text then the second line that we've written is really just appending all of these together so it's going through each value in this array and it's joining it to a blank string so ideally when we take a look at article now just got one big block of text here which is exactly what we want so in order to do that we've created a blank string so you can see i've got some quotes there with a space in it dot join and then text so this is going to loop through each value inside of our text array and it's going to join it into one big string and we've stored all of that inside of a variable called article which you can see here now the next thing that we want to go ahead and do is start chunking up our text so what we're going to do is we're going to chunk it into blocks of sentences and the reason again that we're doing this is because there is a limit as to how much text you can pass to our samurai or our baseline summarization pipeline at any point in time now you can get around this if you're willing to use one of the larger models but sometimes that's not technically feasible with the amount of vram you've got on a gpu so we're going to do it the other way and chunk it up and work from that so the first thing that we're going to do is we're going to now split up this article into individual sentences and the reason that we're going to do this is because when we actually go and chunk it up we're going to split based on sentence so in order to do this we first want to replace all of our full stops exclamation marks and question marks with a end of sentence tag so this is just going to make it a little bit easier when it comes to pre-processing because we'll still have full stops when we generate our summarization so let's go ahead and do that okay so that's our blog post now broken up into sentences so if we actually now take a look at that you can see that we've now got individual sentences inside of our sentences array so if we take a look at sentences zero we've got one sentence sentences two gives us another sentence and you can see that we've still got our full stop and a little bit of punctuation so in order to do this we've done a couple of key things so first up what we've gone and done is we've replaced our full stops exclamation marks and question marks and we've gone and replaced them with the exact same punctuation symbol and the end of sentence tag so the reason that we're doing this is when we go and split up this big block of text into sentences we want to split based on this eos tag rather than the punctuation tag because otherwise it's going to get rid of our punctuation tag so when we actually go and perform our summarization it's going to look a little bit weird without that punctuation so rather than splitting based on the individual punctuation tag so full stop exclamation mark and a question mark we're going and appending eos to each of those punctuation statements so exclamation mark becomes exclamation mark eos question mark becomes question mark eos full stop becomes full stop e o s and in order to do that we've gone and written article dot replace and this is just replacing that particular bit within a string without new string so in this case it's replacing our full stop with full stop dot eos exclamation mark with exclamation mark eos and question mark with question mark eos and so we've done this three times for each individual punctuation symbol and again we've stored the results back inside of our article variable then what we've gone and done is we've gone and split at our article into sentences so to do that we've written article dot split and we've split by our eos tag so now if you take a look at our sentences array which is what we've actually stored that result into so article.split eos is now stored inside of an array called sentences so if we take a look at that you can see that we've got all of our individual sentences there which easily allows us to now work with it alrighty so the next thing that we actually now want to do is actually start chunking up our blocks of text so what we want to do is we want to limit the chunks that we send to our summarizer to no more than 500 words so what we're going to do is we're going to have to loop through each one of these sentences and basically have a running count as to whether or not that particular chunk is less than or greater than 500 words so let's go on ahead and write this block of text and then i'll take a step back and walk you through it alrighty so that's our chunking text now done so let's take a step back and take a look at what we actually wrote there so first up what we're doing is we're looping through each one of our sentences and in order to do that written for sentence in sentences then the next line is actually doing a check to see whether or not we have a current chunk so if we don't have a current chunk then we default and go to this line here if we do then we go on ahead and execute this so the reason that we do this is if we don't have a current chunk then what we're going to do is append to that array so we'll effectively take our sentences split it by a space and then append it to our chunks so this is effectively taking our entire sentence and then splitting it out into its individual words if we actually go and run this what we'll actually see is that we'll have within our chunks we've just got the individual words so let's actually run this first and it looks like we've got let's just double check uh we haven't spelt sentences right up here so if we actually take a look at our chunks you can see that we've actually got them broken out into their individual words so if we take a look at chunk 0 we've got a whole bunch of individual words but that's fine we're going to append those back together and the reason that we do it by words to begin with is because it makes it a little bit easier to do the running count and to make sure that we're below our 500 word limit okay so assuming we don't have a current chunk then effectively what happens is we default to this block of code here and we go on ahead and split our sentence for our current sentence into its individual words so we append it to that now assuming that we do have a current chunk which is what this particular line is doing so we're actually checking if the length of chunks is equal to the current chunk plus one because where our counter is starting from one then what we're doing is we're checking if the current chunk plus the current sentence so the length of the current sentence if we append those together if that is less than our maximum chunk length which is 500 words then what we're going to do is we're going to go on ahead and extend the current array with the current sentence so this line here is effectively getting our current chunk and it's grabbing our sentence splitting it up into its words using sentence.split and our spaced split and it's effectively using the extend function to be able to extend that existing array so this allows us to take a bunch of sentences and append it to the current chunk effectively bunching it out and chunking it up then the next line that we've written is effectively running if our current chunk plus our current sentence is greater than 500 words then we're going to go on ahead and create a new chunk so current chunk is incremented by one and then we're effectively running the same line that we had down here now what we've got is a bunch of words inside of our chunks array so we actually want to append these back to their component sentences so let's go ahead and do that alrighty so that's our chunking done so we did quite a fair bit in here and here but the next block of code that we effectively wrote is going through each one of our chunks and it's joining it together using the join method so what we're doing is we're first up looping through each one of our chunks so in order to do that we've written four chunk underscore id in range len chunks so this is effectively getting how long our chunks array is and then we're getting an index then what we're doing is we're getting the current chunk so chunks and then we're passing through the current index and it's just looping through and we're joining it together in one big string so we've created a blank string dot join and then we're going through our chunks array and we're replacing the existing array so chunks chunk id now equals that joined together value which effectively gives us this so you can see now that this is one chunk that we're then going to pass to our summarization pipeline so if we take a look at another value you can see that again we've got another chunk of text ideally which should be less than 500. so if we actually take a look at the length of that you can see oh this is based on character so if we split by word you can see 478 so if we take a look at the first chunk that's 493 so they're all below 500 words so we've now successfully gone and grabbed this big blog post and we've now chunked it up into blocks of 500 words now the next thing that we need to do is effectively start summarizing our text so it's all downhill from here so let's go ahead and do it alrighty so that's our summarization done so in order to do that we've written one line of code so specifically what we've gone and done is we've gone and grabbed our summarizer which we imported right up here so this is our summarization pipeline then what we've gone and done is we've passed through a number of arguments and keyword arguments so specifically the first thing that we've gone and passed to it is our chunks and remember our chunks is really our big sentence blocks and then to that we're going to pass a number of keyword parameters so the first keyword parameter that we've passed through is max underscore length and this is the maximum number of words we want for our summary so in this case we're going to be capped to 120 words then the second keyword parameter that we've passed through is min underscore length and again this is the minimum number of words that we want for our summary and then we've passed through whether or not we want to sample and in this case we've set it to false then what we've done is we've stored our result inside of a variable called red so if we actually take a look at that now we've now got a whole bunch of different summaries now what we've actually got is we've got our one two three four five six seven individual summaries because what we're actually doing is we're actually creating a summary for each one of our chunks that we've got available so if you take a look we've actually got eight chunks and if we take a look at the result that we've got we've got eight summaries i've obviously miscounted there so what might actually be useful now though is to append all of these together so if you actually take a look what it's actually doing is it's extracting each one of the core sentences from each one of our chunks so if we take a look at the first one inside of this block here so the gamestop squeeze on short sellers in is an extraordinary event in markets where at face value retail traders and investors have worked together in an attempt to put some of the largest wall street institutions out of business that in and of itself is actually quite a good summary of what's actually happening with the short squeeze now what we can do is we can actually do a little bit of additional pre-processing and combine this all together into one short sentence summary so let's go ahead and do that alrighty so that's our summary now in one single block of text so you can see here that what we're getting back is the short squares and short sellers again the same sentence but this has basically generated a summary for us so rather than having to go through read the entire blog article we've now got a nice concise summary now if we wanted to we could make this even shorter so rather than leaving through our max length as 120 we could actually set this to let's say for example 80. we can then run this again and again it's going to take a little bit of time to perform the summarization but ideally what you should see is that once we run this we get shorter sentence blocks all right so that's our summarizer rerun so remember what we did just there is we changed our maximum length to 80. so this means our summary for each chunk of text is going to be limited to 80 words so if we take a look now you can see that we've already got a shorter summary and if we go and run our appending code this is generating a shorter summary there i don't know how much shorter that is but again you can play around with this you can make it a lot shorter if you wanted to alrighty so in order to perform this summarization what we've actually gone and done is we've looped through each summary inside of this results array and what we're actually doing is we're then going and extracting the summary text because it's actually stored inside of an object so if we actually take a look at our results array and we grab one value you can see that what we're actually getting back is an object so if we actually type in type you can see that it's actually a dictionary now if we wanted to actually go on ahead and grab a value out of this what we can do is we can access that key so if we type in summary if we actually take a look at our keys first up you can see that we've just got one key which is summary text so if we actually use that summary text key then we can access that value so let's do that and you can see by doing that we've gone and accessed the text from within that so what we're effectively doing down here is we're looping through each one of the values inside of that results array and we're effectively doing this up here but because we've got our individual summary stored in a temporary variable called sum so when we're looping through we're going for sum in res so basically for summary in our results we're then going and accessing or using that sum variable and accessing its key using summary underscore text and we're then using the join method to join them all together exactly as we did a bunch of times with our chunks so this effectively gives us one summary that we can then go and output and then this brings us to our fifth l sixth and final step outputting this summary to a text file so let's go ahead and output it now alrighty that is our text now output so if we actually take a look at what we wrote there basically we took this same line of text that we used to concatenate it all together and we stored it inside of a variable called text and then we just used some standard python with statements and write statements to be able to write this out so what we've then written is with open blog summary or this should be txt so with open blog summary.txt and then we've passed through the right flag as f so this means we're going to be able to work with our file as f we've then written f dot write text so this is effectively creating a new file called blog summary.txt and it's writing out our text that we just summarized up here so if we now actually take a look at the folder that we're working you can see that we've actually got two files so this is the one that we screwed up because we wrote dot tx rather than dot txt and then we've got our proper summary which you can see is now available in a text file so this obviously shows you how to do it for one blog post but that about wraps it up in terms of how to actually take a blog post chunk it up and then summarize it now again if you wanted to you could actually do this on a different blog post so all you need to do is go on ahead and replace this url here so let's actually try that now so if we go to hackernoon which happens to be one of my favorite sites let's go ahead and grab another blog post so say for example uh i don't know docker i'm kind of boring so say for example um let's actually do this one so will institutional investment keep pouring into bitcoin so i've just clicked the link and what we're going to do is copy this link replace this url here that we set up originally in step two and we're just going to run through the code again so if we hit enter again this is going to get our or run our request return our text which you can see there we're then going to perform us or create our soup and perform our search for our h1 tags and our paragraph tags you can see it's gone and grabbed our text then what we're going to do is chunk it up again and then we can run our summary so remember if we now take a look at our chunks we've now got our bitcoin article now chunked up so again this one looks like it's a little bit shorter than the short squeeze article but that's cool we can run through that so if we now go and run our summarizer we've now gone and summarized that blog post so you can see this one's actually summarized quite a fair bit so if we take a look here so when was the last time you heard of a decent crypto project i mean a real one which doesn't promise mountains of gold great systems doing everything and unrealistic etas so you've got a bit of a summary of this blog post yes but again you can see that really quickly we're able to generate these summaries and if we again if we output this out so we might call this one bitcoin summary and if we now go and take a look we've now got our bitcoin summary now output now if you wanted to you could actually build this as part of a pipeline for yourself and set up a number of urls so that overnight or as you wake up in the morning you've actually got a whole bunch of summaries that you can then go and read through to maximize productivity but that about wraps it up thanks so much for tuning in guys hopefully you found this video useful if you did be sure to give it a thumbs up hit subscribe and tick that bell so you get notified of when i'm releasing future videos and let me know what types of blog posts you went about summarizing thanks again for tuning in peace
Info
Channel: Nicholas Renotte
Views: 2,625
Rating: 4.9776535 out of 5
Keywords: web scraping, python web scraping, web scraping with python, ai text summarization, ai text summary, huggingface transformers, huggingface text summarization
Id: JctmnczWg0U
Channel Id: undefined
Length: 33min 0sec (1980 seconds)
Published: Wed Feb 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.