Web Scrape Google News with Python Requests and Beautiful Soup | Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back everybody to another python tutorial in the last video i showed you how to get started with beautiful soup and scraping data from the internet now this video is going to handle grabbing a large amount of data and to do it we're going to be using google news and the same technique that we used in the previous video and we're also going to be writing this data to a csv file and in a later episode we're actually going to use that data that we collect to train an nlp model which we can predict uh sentiment analysis for say the articles and descriptions that we are gathering today so let's get started i already created a new virtual environment and we're going to have to download some packages um these are the exact packages that i downloaded in the last episode so if you're using the same environment you can skip this part so we're going to do pip install request and then right after that we're going to do pip install beautiful support so pip install bs4 all right and then the last one is going to be url lib pip install url lib3 um it says already satisfied i think because it was downloaded in one of these later packages so let's create a new file and we will call it googlenews.py and we're going to start by importing the same so if you remember in the last tutorial we just did url lib dot request and from that we're going to import request as well as url open and this is going to work for us so we could build out our headers and then we're going to do from bs4 import beautiful suit then last just import requests and you may say like well how you're going to write it to the csv file we don't actually need pandas or any other external program to do that and we might also need because i just realized you need a parser library so let's do pip install html5 lib and that's just a library that we we might need when we're uh using beautiful soup so let's begin by showing you where we're gonna be scraping data from so we're gonna be scraping data from google news so let me type in google since it's not my primary search engine and let's type in a company name or just any random news article and actually let's say something that's a little more biased so we'll say like trump um just because he has a lot of news about him um and a lot of it is either on the on supporting him or against him so i feel like this would be really dramatic data that we can use um to train our our model on so from here you could see in the link itself it has a few data points that we can break out into word so you could see that right here it says q equals trump um let me see if i could fit it all on one line si so yeah right here you can see q equals trump so that's actually taking in the variable that we're giving it in google so say if we were to um you know change this to say q equals biden um now you could see we're getting a bunch of information about biden um so that's one way that we could control what we're searching for using just the link and manipulating the query on it and also you can see that if we change the parameters on here such as you know one where we want it from say we want to collect data only from the past 24 hours you can say you know past 24 hours sort by date or just keep it at relevance and it will pull up here and you'll see that the link has changed as a result so actually let's use that since uh we can limit the amounts of data we'll only go through about 10 maybe more pages and we won't go anything past that however you could theoretically go through every single page until google you know stops giving you data so let's start now by creating a new object and we're going to call it actually we're going to call it root and root will be equal to just google.com and we want that because this will never change so we will always be using google.com as kind of the host and we're just changing the search query and um you know pulling different items from it so then as the link now let's put in this right here actually let's change the query back to trump so if we copy it now and we'll go into pycharm we paste it as a new variable called link and now we're gonna create a new variable like we did last time called rec and we're gonna have it be equal to request link and we're gonna pass it headers and the headers are going to be user dash agent and that's going to be equal to mozilla slash 5.0 and then here we're going to just do web page equals url open and we're going to pass in that rec so after we apply the headers and the link to it and now we're going to just want to do dot read so actually if we just print it out real quick um i moved my terminal to the side uh we have to do just print web page so printed out the html now and like we saw in the last video it's a really messy format in that way so to fix that and uh to make it you know a little more readable we can use beautiful soup so we're going to do with requests dot session as c and we're just gonna say soup equals beautiful soup and we're gonna pass in webpage and also that html5 lib which is our parser library oops i forgot to print it out again so i'll just do print soup and now you can see that it you know makes it a little easier to read it breaks it up into smaller chunks i think if i had my terminal set up in a different way it'd be a lot more readable however i like having it on the side so now since we have all of this information right here in the html we can actually begin to select some of our elements from this webpage on google so to do that i'm going to be opening up inspect element and i'm going to hover over the region that we want to select so right off the bat i can see that we want to select this class right here and we're choosing that one because each card which um contains like the image the title when it was published each card has their own class and it's all equal to the same one so we're gonna kind of do a for loop for every class on this page um that's containing you know dbsr so we can do that now by opening up pycharm again and we're gonna say for item in soup in soup dot find underscore all and we're gonna put div because it's a div tag and we're also gonna want to do attributes is equal to class and that is dbsr i'm pretty sure yep so if we were to print out this um item now so print out every item forgot my colon um so if we were to print out every item of this we'll see the html for every tag that has that class so it's not printing out for me and i think that issue might be because it's not able to grab this class then there might be an error so if we try to print out soup now we can actually search in this html if there is a class element that's equal to dbsr so if we do dbsr you can see that it's not finding anything when we do control f so that means we're going to have to select a new element and make sure it's in the following html so a better way to do it would be if we're looking through the html you could see that there's links um for the different articles so we might want to capture a tag that holds all of that information so since it's not printing out to any of the classes that we're giving it let's instead try pulling one directly from the html so we can say you know k c r y t and the benefit of this is you can already see that it's in the html so that means that it should print out and you could see it happen right here so actually if we look at it now we can see we're getting the different links for each source and also it seems like we are getting the title and description so i'm gonna leave off right here for this video and in part two we'll be going over how to parse this data and then write it to our csv file so stay tuned for that and that will be dropping in a day after the release of this video
Info
Channel: Stealth Rabbit
Views: 6,276
Rating: 4.9148936 out of 5
Keywords: Web Scrape Google News, web scraping tutorial, python web scraping, beautiful soup web scraping, web scraping data, requests web scraping, urllib3, pip install requests, requests python, python bs4, web scraping, data mining, beautifulsoup tutorial 2020, requests tutorial python, beautiful soup python tutorial, web scraping for beginners, webscrapping, webscrap, google news bot, google news, beautiful soup, bs4, urllib, python, google python, pip install request, python requests
Id: Hu9cgcdvt2w
Channel Id: undefined
Length: 10min 12sec (612 seconds)
Published: Tue Sep 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.