Scrape and Summarize News Articles with Newspaper3k | Data Science for Media Bias Detection #4

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in today's episode i will be teaching you how to scrape and summarize news articles using the newspaper 3k python package all from scratch well what are we waiting for let's get started hello my name is rohak and i'm the founder of empower code helping you make a change with technology today marks the fourth episode of my course data science for media bias detection where we will take our first steps in python by scraping and summarizing news articles why the new york times well the new york times covers all sorts of topics and offers different media perspectives towards events a feature that's hard to find anywhere else if we can build a short python script that efficiently scrapes and summarizes a given news article it would be an amazing way to sift through long repetitive pieces of news in order to take our first steps toward accomplishing this goal we need to get started with the python ide now the last episode in my chorus took you guys through the installation process and setup of pycharm a popular open source python ide if you haven't watched that one yet i highly recommend you to do so as it would make the content in this episode a whole lot more manageable now without further ado let's get right into the code awesome now i have opened up my newspaper scrape project in pycharm so let me create a new python file to do this you can simply right click our project folder at the top select new and from the drop down menu select python file after naming my file i can simply click enter and watch my file come to life now since we are using newspaper 3k which is a news extraction package we need to import it first so to do this we simply type in from then newspaper which is our package and inside we import the article class which we'll be using to scrape and summarize our news articles now in case you're having difficulty installing the package simply run the following command pip install newspaper 3k on anaconda or terminal in order to access its functions with that aside we can now start our newspaper scrape first let me create a function called summarize article which will store the main script that we'll be using throughout our code additionally we can pass in the url of a sample news article as a parameter now that we're inside our function we can use the imported article class and assign a variable article to a new article object containing our url let me show you how you can do this awesome so as you guys can see we assign a variable article to a new article object and inside we passed in the url of the news article that we want to scrape next in order to see visible results and act upon our article we need to perform some setup so the first thing we're gonna do is actually download the article itself next in order to perform the natural language processing and tokenization that we want to do we need to actually parse our article to get it ready for extraction now this next step is super important in order to extract and detect text on wording in our article url we need to download punct which is a sentence tokenizer that is very useful for extracting and detecting individual words or sentences in a large body of text essentially punk dynamically breaks up text into individual sentences making it easier to extract and detect certain words and parts of speech in each sentence so we can simply call the download function we used earlier but this time as a parameter we can specify punct which is what we want to install finally to complete our setup let's call the nlp function on our article object which allows for natural language processing on our article so here in the drop down menu of fields and methods we can simply type in nlp and we're ready to go awesome now we're set up and ready to finally start seeing some visible results in our code but before we do we need an article because empower code revolves around technology let's take a look at the technology section of the new york times here i have found an article it's called a capitalist fixed the digital divide and it's centered around the role of bigger tech companies paying for the internet access of low-income americans it's interesting practical and definitely a great article to scrape now if i scroll down here the first thing we notice in our article is the author how can we extract this information with newspaper 3k so now in our print statement that will be containing the author of our article we can simply access the author's instance variable inside of the article class if you guys remember our article variable is an article object so we can interact with it the same way we would with an ordinary python object so once we typecast our result to a string we can simply type in article dot authors to access the correct field of the article class so now after passing in the correct article url that we want to use into our function if i click run at the top here we can see that it gives us the correct author shira ovide so if we check this against our actual article we see the exact same result on our computer screen how cool is that next let's repeat the process to access the publish date of the article so here we have our print statement like we did for the author which will contain the publish date of our article and so to actually get the publish date we can typecast a result to a string which you always want to do and then we can type in article dot publish date so now if we run our program to see what we get we see that we get a publish date this says that the article was published on september the 22nd of 2020 looking back on our actual article and we see the exact same result published september 22nd 2020 but if you guys notice this publish date that is outputted is formatted weirdly and it's very hard to read so the first thing we do is we actually assign the publish date to a separate variable this is because we'll be using this variable later on to format our date correctly so that it is printed out to the console in a readable format so here inside of our print statement we can use the string format time method in short it is typed out as str f time what this method does is that it converts tuple representations of dates like the one we have here into a proper string as specified by the format argument so here as you guys can see i've now passed in a formatting argument but what does this actually do let's take a look and find out so here if you guys notice i've used a lot of percent symbols and they're followed by m d and y these stand for month day and year respectively so when we enter our date these parameters will be replaced by the ones we have so for instance month will be replaced by 9 day will be replaced by 22 and gear will be replaced by 2020 and our string will be formatted and readable awesome guys so as you can see on my screen let me zoom in here we can see that our date is formatted like you would see it in everyday life oh nine slash 22 slash 2020. exactly how it should be next let's continue our extraction journey and grab the top image of our article which is also known as the cover image so here we have our print statement which will give us the top image url as you guys can see and to actually get this top image we can simply access the top image field of our article object and so if we run this code as you guys can see it gives us a url containing the top image and so if we click this url as you guys can see this image correctly matches with the one that we see on the top of our article now that is really cool now what if we wanted to get all the images inside our article how would we even go about doing this well it's actually a whole lot easier than you think we simply type in article we type in our dot and we access the images field of our article class so once we run our code we see that it gives us a complete list of all the urls in our article but if you guys notice if we print these out to the console we get a great sequence of images but they look way messy and organized we really don't want this so to fix this we can simply use a for loop to print each image line by line since article.images returns a list of all the images we can simply iterate through each image and print each one to the console so first we declare a new variable called image string we create this because we'll be appending the urls of all our images to this string so now as you guys can see we're iterating over each image in our article.images list now we simply append a new line in a tab to our image string if you're familiar with programming languages like java these backslash characters are known as escape sequences and backslash n stands for a new line and backslash t stands for tab next we append the image that we're currently iterating through to our image string so now after our for loop has ended we can simply print our image string to the console and see how it looks awesome that looks a whole lot better as you guys can see each image is printed on a separate line and is tabbed to indicate a sense of hierarchy from the all images string above now that we have extracted major metadata from our article what if there was a way to summarize the entire article in just a couple short sentences well after typing some introductory print statements it turns out there actually is a way to get the article summary from our news article we can simply print out article doc summary it's that simple guys this is the power of newspaper 3k remember the punk tool i downloaded earlier well due to its ability to tokenize or break up the text into individual sentences we are able to pull the first five sentences easily and thus create a summary now let's test out our code on a different article awesome so as you guys can see on my screen this new article is about the harmful side effects and consequences of ransomware so now if you replace our previous url and paste in our new one let's go ahead and run our script to see what we get wow check this out now we have our author publish date top image url all the images inside our article and a quick article summary containing the first five sentences of our article if i click on the top image it takes me to this jpg file and if i look at the actual cover image the two match perfectly you guys can also see that the quick article summary is the exact first five sentences of the actual news article that we are extracting now we have officially scraped and summarized news articles with the newspaper 3k python package awesome we were able to use python and a handy extraction package to scrape and summarize news articles in a matter of minutes if you still have doubts or are confused the script is stored on github check the description to access the link to it each and every episode we are getting closer and closer to our end goal which is to use python data science to detect news bias in the media join me and together let's accomplish this goal thank you so much for watching take care and i'll see you in the next episode
Info
Channel: EmpowerCode
Views: 934
Rating: 4.8709679 out of 5
Keywords: scrape and summarize news articles, scrape article, scrape website, python, scrape article python, NLP python, newspaper3k, newspaper3k tutorial, newspaper3k nlp, python nlp tutorial, natural language processing tutorial python, scrape news articles, newspaper3k python, newspaper3k demo, python newspaper3k
Id: 4pVXRC6ss94
Channel Id: undefined
Length: 13min 25sec (805 seconds)
Published: Sat Dec 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.