Python Web Scraping with Beautiful Soup and Regex

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Just use scrapy

👍︎︎ 6 👤︎︎ u/Mr_Again 📅︎︎ Sep 23 2018 🗫︎ replies
Captions
what's going on eeehm crew today's video is Python web scraping with beautifulsoup before I jump into the beautiful suit portion of it I want to actually show you a quick and dirty version as well but before we do any of that what's web scraping well web scraping is when you want to extract data from a website so imagine you're checking out the leaderboards for your favorite video game and there's a table with say a thousand records of people's information maybe it's their name tag their score the plays and then whatever else so web scraping would be programmatically taking all of that data and then packaging it up into something you could use and I'll show you a couple techniques that you can use to go about that so I built a file here to allow us to scrape data out of and this file is basically just a giant table with a bunch of data and a couple of these I've marked as class special and those are come into play later with beautifulsoup I also have a second table at the very bottom that is inside a div called special table and again this will come into play when we use beautifulsoup besides this one file we're also gonna do some scraping on some alive sites just to show you some variety so the first example is just kind of a down and dirty just quick scrape now for this example it's really nothing more than using requests to request the data and here I provide the URL to that special HTML file that I have the second step is supplying some regex that can extract data out of that HTML file I don't think I mentioned it before but note that this HTML file is full of names emails and phone numbers and it's all just generated data it's all it's all garbage and if it's real all I've really done here is create T regex as one that matches the phone number format for the for the data that I found in that file and a one that matches just a really basic email format it's not a great regex i just threw it together real quick just for the purpose of the video so we can test this by printing out these two things at data you know phones and emails and then coming over to the terminal and running it so it's a scrape one so basically we're looking at here is a giant array of emails so if we scroll up enough you'll see also the array of phone numbers so the upside to this meth is that it's quick it's easy and it's gonna search just throughout the entire text doesn't matter if it's HTML it could be anything the downside to this is it's kind of just a dragnet method it's just gonna get you everything and occasionally you're gonna get you know garbage back things that things that matched your regex but they aren't actually phone numbers so just to show you that this works with really any file I pulled up this site it's just a golf golf site and club contacts they always have their their emails sometimes phone numbers on here so if I just copy this URL and then just drop it and place it here and then just rerun it you'll see that I get all the emails and phone numbers that are present on the page you know here and this is true no matter what site I use and I can hit back I could just pick a different one now pick the second one instead so copy that drop it in here save clear rerun and then poof I got all the phones and emails anyway it's kind of basic but you know that's the that's the quick and dirty method of scraping and now on to the more sophisticated method which is the one that uses beautiful soup using beautiful soup starts out a lot like the other one where you import your you know import requests you know so you can get the content of the site and then import beautiful soup once you've got the content of the site you can load that into beautiful soup and what this does is it takes all of the HTML and it it basically represents it as as the soup variable and you can use that to find you know various elements iterate over elements and things like that and we'll do a bunch of examples now so since this is less of a dragnet method more of a specific you know pick the exact data you want meth and you have to first analyze what the data you're scraping looks like so I'll go over to our HTML file and we know that there's a table and inside the table there's a number of TR elements and then inside each TR element there's a number of TD elements and this is an important step to web scraping is you have to go analyze the structure of the data you want to scrape first because that's how you're going to instruct beautifulsoup where to actually look for that data so our first example this is going to be getting data simply by looking for each TR element to do this you can use soup dot find all TR now because this returns a list of TR elements it means you can iterate over it so you can do for TR ensue not find LTR so let's recap something important here the suit variable is representation of the entire HTML document so when you tell soup to find all TR every TR is now a representation of that TR and everything beneath it so what that means is just like I did soup top find all TR I can now do for TD in TR dot find all TD and so we'll stop here and just print out TD and then we'll run it over here another thing so you notice I got all that table data I got name email phone name email phone name email phone all the way down but you see that the TD is still there and that's because TD is a beautiful soup element still if you want to extract the actual text out of the element it's very simple all you do is you add dot text and when we rerun it now now we have just the data so if we want a list of lists of all the data that's in each TR we could modify this to use list comprehension so values equals you know TD text for TD in Tierra and all TD then now I'll just print the values and check it again here now notice the empty list there that's gonna be the header row that's the one that has a TR element but inside that TR element is th elements rather than TD elements so let's just perfect there's just a little bit more just to get into a realist so I'll do data and then data dot append you know the values so now we have a legit list here no next example is finding things that have a specific class applied so remember when we're looking over this I said there's some elements here with class special so what we're gonna look to do is extract the data from these TR elements but only if the class is you know special to do that it's almost exactly like this other one so much so that just copy and paste it the only change is that in find all you could specify a class you could specify which class you want to look for so in this case I'm specifying just special and then we'll just output the resulting data and then riot again over here so you can see now the only extracted a few records and that's because it only looked for the TR elements that have the class special you know so Cullen you come over here you could see that Cullen is the first one that has class special applied if you wanted to look for specific IDs you would just change class to ID and it would work the exact same way now when you're web scraping for real it's probably not gonna be as simple as just a page with a giant table on it you're gonna have to actually have to do some hunting for the data you want so in this in this file the one thing I added was it was a special element down here so this is a div class special table and it has a table beneath it so imagine if you're on a page and you found a table of data that you want but there's also other tables of data on the page so you can look in the source and you can find you know which class that that div has and then you can use that in beautifulsoup to narrow down the scope of what you're scraping - just that dip this example starts out kind of the same just like the rest you know will do data but this is where the difference comes notice up until now I've been using soup dot find all and that's because I wanted to find all you know the TR elements but if you just want to find one element you can say you know I'm looking for a div so div equals soup dot find just find and then in here I'll do div and then just like the second example I can use a dictionary I can do class and then what I call it special table I could specify a special table in there now div is now a variable that represents that div class special table and then the table inside it from here extracting the data is the exact same thing just like the rest except instead of soup dot find all I'm doing div dot find all and then I'll just output the data so we can test it you can see the results are Kennedy Graham and Aristotle and if we look in our script I HTML we could see that those are the three names that are in that special table so hopefully now everybody can see the power of find and find all it basically lets you select any element anywhere on the page and iterate over any child element of any class name of any ID of any amount so just these three examples alone should let you scrape just just a litany of data out there so let's see we learned put everything to practice and scrape some data for real so I found this site on um G gaming and it has a leaderboard and I thought for this exercise we could extract the place the username and the XP so let me just arrange this so I can open it frequently remember I said the first step to scraping data is to first understand what the data looks like so hydro I know that's gonna be helpful you know for finding stuff so I make this bigger and this is all the content for the site so I need to get to wherever hydro is so hydros here so here I just have to kind of go up in this HTML you know structure and figure out where it starts I got span div div so maybe this isn't it I'm looking for more of a table here it is so I got first I got the name at the XP this is where I'm at so I just need to bring you know go up here figure out where this begins so I got TD T body table perfect ideal leaderboard table perfect it's exact code I need I can use beautifulsoup to just go directly to that table and I can extract the data inside it so as usual I gotta take the URL drop it into my script parse the resulting data the same as I've always have the first step is getting the leaderboard and remember we said that was a that was a table with an ID leaderboard - table so I do find table ID leaderboard - table and that should get me the leaderboard I can verify that by printing the leaderboard and running it so I see all the data I know it's the leaderboard so I'm good so far so looking back at our data here table is the top level element so I want to skip over this tea head and just go straight for the tea body so that's the next element I have to write up here so I do tea body leaderboard not find tea body again I can print tea body just to verify that that's all still good and it is I'm still getting data so good so far we've isolated the data that we're scraping down to that tbody element we could just jump right into the actual TR element so for TR in T body dot find all t our grant TR now we can test it out and we're still getting a bunch of data I see some XP so still getting the right stuff the next step is begin to extract the individual pieces of data so I want first and I can see that that is the zero index TD element and I want the text from that so we can just write that out so we'll do place equals TR dot find all TD remember it's the zero element TD and then we want the text and we want to strip out all that white space around it and they will just print the place to make sure on the right track here now we are first second third all the way to 25th so now analyze the second piece of data which is the user name so user name is buried here in the text so the user name is going to be in the the one index TD and it's going to be the second anchor tag and that's going to be the text inside that but this is still pretty simple using the same stuff that we found you know that we've learned so far so the username equals TR dot find all TD and remember it's 1 and then we want to find all the a and then take the second element and then it's going to be the text inside that and the stripped version with all the new lines and whitespace removed print out the user name and we'll see if we're on the right track there we are now we got all the places and all the user names so last is the XP so remember the place was in the first element the user name is in the second and then the third is something we don't care about and then the fourth is the XP so this looks a lot like the the place so we just take the place copy that just change the 0 to a 3 that is the XP oh look at it again and there we have it we've got the place the username and the xp for all the people on so let's recap on our code so we started by getting all the content from that site and then we proceeded to load all of that content into beautifulsoup and get a soup object we then isolated just the content we wanted which was the leaderboard table ID and then we further narrowed it down to just the tea body within that table then we looped over every TR element and then began to extract the pieces that we wanted so for place you know we found all the tea DS we took the first element in that and then took the text for the username we looked for the second TD we found all the anchor tags we looked inside the second anchor tag took the text and stripped and then same with XP was just like please and that's it for web scraping I showed you a quick and dirty way and I showed you a more sophisticated way with beautifulsoup as always if you have any questions leave them in the comments and get out there and scrape some data see you on the next video
Info
Channel: Engineer Man
Views: 174,061
Rating: 4.9780254 out of 5
Keywords: python, beautiful soup, web scraping, engineer man
Id: F1kZ39SvuGE
Channel Id: undefined
Length: 14min 24sec (864 seconds)
Published: Sun Sep 23 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.