Scraping Google News the Easy Way with Python and pygooglenews

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so there's many reasons why you might want to scrape the news and where better to go than google i guess we go there for everything else so why not this one and today is a day for us in the uk where we found out the rest of our lockdown plans yes we are still locked down and you can't do anything realistically so we were excited to find out this now i got me thinking we could probably have a good way to scrape the google news site and we could create some kind of real custom feed or aggregate it or maybe even do some analysis on it or something so i thought i'd create this video to show you guys how you can scrape the google news website so this is it here and the first thing you want to do is not try and scrape this page because you're not going to get anywhere it it might work but it'll be slow it'll be arduous so what you want to do is you want to come to this url here this one i'll link it down below for you but basically i've created a search term here which is called lockdown and because it knows i'm from the uk i'm getting the uk results now this is all of the i believe i can't remember how many items there are but there's quite a few of all the last uh news items from google that match this here so what we can do is we can actually copy this url i'm going to remove the end bit and just the bit that i need copy that i'm going to come over to our code and we are going to use uh request.html for this just because and we're going to say uh from requests underscore html import html session because we always want to use a session object if you don't know what this is um i've got a video on sessions and why you should use them so you should go check that out i'll leave a link down below for you so we're going to say our url is what i just said here now you can see that we've broken this down into the search and the queue for the query and i've put lockdown which i'm going to leave on there now we can say s is equal to our html session and we're going to do r is equal to s dot get and we're going to do our url now if i just print the r.html.h that should give us all the data back there we can see that it's worked and we've got all this information back so we just need to pass through this now so to get the titles i would say let's do for title in r.html.find and i believe it was this will work we can just do print title dot text and run that and we get all the titles back so we can see them all here now you could do this for any part other parts of the information but this got me thinking with this so easy to be accessible there's got to be a better way and of course somebody has already built uh created their own python package for it uh always happens and a credit to this guy so i'm just going to delete all this because we don't need it i'm going to use this over here so i've got the github page open here and i will give all the way to the top and credit to this guy for for making it his github and the github link now i've installed this already i was just playing with it just a minute ago but apart from that this is almost completely fresh to me as well so you can see we're pip installing here i've already done that and we have a quick start that we can follow so we've got a quick start and we've got top stories stories by topic and a query search so what i'm going to do is i'm just going to replicate what i just did so we're going to copy the quick start and come back to our code and paste that in there and we just want the docs real back real quick buy a query search so we're going to say search is equal to gm dot search and then type our term in there and we can do not over a time limit so i'm going to ignore that for a minute and just see how many results we get back so we're going to say our search is locked down which is what we just did and then let's see what happens when we print out search now it's working okay we got a lot of data back wow a lot a lot a lot a lot and i'm guessing just by looking at some of this um we're not it's not geo-targeted at all washington post yeah us responds okay that's cool let's scroll down so let's see where we're at let's have a look uh at the dictionary so we can do dot keys and we can see what keys we get okay so we got feed and entries so let's do uh search let's do feed so i'm just accessing this part of the dictionary the feed tag the feed key sorry okay so that just that's the one that explains where everything is so let's do entries and this should be the rest of the data yes it is do we have a title of some description in here i'm sure we do so let's print out uh let's do a full loop again for item in search entries print item let's run that oh you have to have an in in your for loop otherwise it won't work let's see what we get okay cool so we can see that's one whole entry there so we have a title so we can then access that key and we can do title and there we go we've got a load more titles we've got more information i think although it looks to be very similar i'm guessing we're just not geo is right that's cool though so we can see that really quickly and easily with just this little bit of code we've managed to get all the titles and everything back so i'm going to try and expand on that a bit and we'll write something a bit more complicated that gets us the top x amount of stories for whatever search term we put in so it looks like we've got quite a few so let's see how many actually come back let's remove this as we don't need it now so let's do print search actually let's print the length of search entries and see how many we return 100. i wonder if we get 100 each time uh let's do something else like let's put in football and see if we get another 100 entries yes we do so it seems like we get 100 entries each time great so let's go back to the docs okay talks about the class language okay so we can do country is equal to great so let's let's location this then to the uk where was that going in into the here country uk and let's change it back to lockdown see if we get another 100 results we do okay cool so what we want to do is create something that we can just quickly search the type the top titles for uh in python so let's leave it at country uk and let's remove that and we're going to say let's create a new function and we'll say that'll be get titles and we're going to give it a search term so we'll just say search and then we're going to indent this and we'll put our search term in here and then we will do let's print them back out for now search entries and let's say um news item is equal to search entries and then we can do for item in news item print item dot title and let's check that works all right let's return out of there here and we'll do lockdown and let's see if we get the titles back again great there we go all 100 of them or so so let's change this to uh football i can't think of any anything going on my brain's frozen there we go a load there great so what we can do is if we print the item now we can see what other bits of information we can get so what we'll do is we will get the link as well so let's print out the link to check that that works there we go so now we can just assemble the rest of our scraper here so what we're going to do is we'll say um what i'm going to do is i'm just going to call this story or article story will be fine now create our dictionary and we'll say our title is equal to item dot title and then the link because if this one interests us we might want to click on the link item.link and then we can let's do let's create a blank list and we'll say stories blank list and then for each one we can do stories that append append our story to it and then out of the end we can return the whole list like this okay so when we run this function now it's going to return a list of all of the ones like this so if i have to do change it to print here and now run this we should be able to get and we can see the title followed by the link for each one of the stories i'm going to change it off for football for now let's go back to lockdown flavor of the moment and there we go great just to tidy this up i should probably move this outside of that function in case we want to write any other ones and let's run it again just to make sure we're all good still we are there we go so that's a really kind of cool way i'm glad i found this uh pie google news um i'll put a link to the github down below so you guys can go check it out it does a lot more things than this but it's really cool to i love being able to get the data out of the websites without having to to screen scrape if you like without having to take that information if you can download the html or if you can get the raw data somehow that's so much better and so much quicker imagine trying to scrape the front page of the actual google news site that you look at compared to this so i've got 17 lines of code which probably could be slimmed down quite a lot just to get the title and the link down for a chosen search term so thank you guys for watching i hope you've enjoyed this one i've got loads more web scraping content on my channel already more stuff to come so if you're interested in this sort of content make sure you subscribe and it doesn't help doesn't hurt to and it doesn't hurt if you want to hit that like button down below as well so thank you very much guys and i will see you in the next one goodbye
Info
Channel: John Watson Rooney
Views: 5,488
Rating: 5 out of 5
Keywords: scraping news articles python, scraping news websites, scraping news headlines python, web scraping news articles, news scraping using python, google news scraper python, news scraper python, how to scrape news articles, learn python, python web scraper, pygooglenews, scraping google news
Id: rQXL9A0ST5k
Channel Id: undefined
Length: 12min 4sec (724 seconds)
Published: Wed Feb 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.