BeautifulSoup 4 Python Web Scaping to CSV Excel File

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys it's Ryan and in this video we're gonna scrape the web with Python and beautiful suit for so as promised I'm gonna show you how to scrape socialblade.com and once we've done that we're gonna put the data into a CSV Val file so we can open it in Excel so that's what we're gonna do in this tutorial so with that being said we're going to jump straight into the tutorial okay guys so the first thing we do once you get on your computer here so I let me clear this is you want to install beautiful suit for tote so to do this if you have the PIP package manager install you just go click install beautiful soup or and should get that installed I already have it but it'll go through a bit of a reinstall there next thing you want to do is make your Python file so I'm gonna terminal you can just you touch scrape dog hi and then I'll get the file going for you of course you could just create it normally as well we've got a dime open here so we can go ahead and get that started now so the first thing we want to do is import URL live to now to be clear I'm using Python 2 here not Python 3 if you are using Python 3 there is a different URL Lib but I know beautifulsoup works with Python 2 so I'm just gonna keep using that for now the next thing the other thing we want to do is just import sorry from PS 4 so beautiful soup for an import beautiful soup ok so once we have that we actually want to establish what page we're gonna do so I'm gonna call it just just page and we need the URL so what we're gonna do is we're gonna open up socialblade.com perfect and what I want to scrape is the top list I'm gonna go to talk 50 youtubers and we're gonna sort them by video views so I want to get all these youtubers I want to get their name I want to get their video views and I want to store that in an excel sheet so we can just go ahead and copy this link here is the first part of this process and throw it in a variable there so now we need to actually load the page so to do that we're gonna create a web request and that's gonna be a URL Lib to dog requests it needs to be a capital there make sure it's a capital dot request and we're gonna pass it the page and just gonna pass it the page for now and I will show you guys in a second this won't work and I'll show you why we're just gonna pass it the page for now so that we can get a good idea and you know what let's just actually rename this to URL and with that URL we can go I'm not I'm now going to create page as another variable gonna be URL Lib to dot URL open and then we're gonna pass at that request okay then we need to create our beautiful soup object so we can go soup equals beautiful soup and then we can pass it that page and now here's where you pass in your parser beautiful soup works with multiple different parsers but the easiest one to use is just HTML dot parcel so if you're not too sure with the differences between different parsers HTML parser is a good bet perfect so now that we have that we basically have the web page scraped hopefully what we want to do is see if we can find some tags in it now the one trick with social blade it's kind of a good one to show an example with because it is tricky but it's also maybe a bit different than a normal page but the issue with it is that their CSS doesn't seem to use classes it's all in inline CSS I'm not sure if assisted like tutor people from scraping the website or why they made the page like this it seems to me they probably just don't really know how to do web programming but it's all these like inline Styles and like barely anything as a class you can see like some stuff uses classes and then there's all these inline Styles so it's really weird to me but it probably has to do something with the way the page is coded anyway there's no class for each of these lines which is what I typically identify it with but what we can do is actually do a search with just the style inline style attribute so what you want to do is you want to do soup dot fine in this case I'm gonna do fine also if you just want to find one object you can do stooped offline if you need to find a list of objects in this case I want to find every row in the table we can do a soup dot final and then you pass in as a string the name of the tag you want so I'm looking for gives actually my apologies this is gonna be a regular fun so the issue I had before was trying to find each one of these they all have a slightly different color so you can see this one's fa fa fa fa this one's f8f8f8 so you can't pass a single one of these strings in so what you're looking for is to pass to find the parent and then just search for like most of the list and we'll trim these ones off at the top so that's kind of how we do that so first I'm just trying to find this one with this style float:right width:50px nary object so we can do style : and then you can just paste in that now you could put class here and then a class name or another way to do a class is if you just go underscore a class equals and then just go like that and obviously you wouldn't use the adders there so it would be like that but for whatever reason the site uses inline styles so if you do come across a site like that and this is how you can kind of deal with it it's pretty big pain to be honest with you and then what we can do is we can go find all and then we're gonna find all the divs under that so the divs that are gonna be children and then one thing I'm gonna do here is turn off the cursive so normally if I did find all div it would go and find every div that was a child of this div including ones that are inside other divs if I say recursive is equal to false then it will just find all these surface level children right so it's gonna find this div and then it's gonna give us a list of all these divs here just look at having the thing without dropping down any of the arrows and going like oh there's a dev here to here to here div here it won't find any of these divs it'll just find these divs right here which is all we want to find whereas normally it would go and find this div and then it would find all these divs inside here too but in this case that's not what we want we want to find a single row at a time and be able to work with that object so then you can always just do a print my apologies I want to do I'm gonna call it rows equals that we could do a print rows and see what we get just to see if everything is working properly and at this point I'm not expecting it to because that user agent issue that I mentioned earlier where we haven't supplied a user agent here and it's gonna go right just deny us but I'll show you guys what that looks like just so you get the kind of idea perfect so it's given us an HTTP error now if we want it to be really great script people we could actually catch this error so we could do like a try-catch statement it's not a bad idea because every once in a while if you're gonna be using this script a lot you're gonna come across an HTTP error but if you're writing a script to scrape data like one time it's really not necessarily to sorry not necessary to add in error caching and just gonna slow you down but in this case we got the HTTP error of 403 forbidden so that's really weird because everything's working fine in my browser but the URL library isn't passing any user agent so what you can do is actually just go over here and type in what is my user agent and guck echo or Google it will tell you your entire user agent copy this bit right here so in my case mine is Mozilla and then after this URL we can pass in headers and then the only one we need is just that user agent and we can just throw our user agent in there and now it won't know the difference between our script and our browser obviously there's other ways to tell between the script in the browser but for our purposes this will be good enough to trick socialblade into thinking that this is our browser and so now if we go like this it actually shows us this stuff and so that's great it's a pretty messy printout but it is what we need perfect okay so now we basically got all these rows we also got these three rows here what you don't contain the thing we need so there's four rows at the start that we don't need so we basically just want to cut them off so we can use pythons little list selector thing to go for : so that will take index four and beyond so of course remember lists are zero index so we're gonna drop off the first four and it'll take start with the fifth element and then just go to the end of the list if you don't specify our ending point then what we want to do is say four row in rows so for each row that we find we basically want to get this username we want to get this upload number and we want to get this video view total okay so they're already ordered for the rank so I'm not gonna worry about scraping up it but you could too if you wanted so underneath each one of these we'll see they continue with their inline style theme it's super duper annoying and you can see multiple even have the same one here so we're gonna have to do the same strategy of using the list slicing to actually find what we're looking for so the first thing that we're looking for is actually the username we can see that there's a div here with an anchor tag and then that as stored as the text of the anchor tag so what we can actually do here is say username is going to be equal to row dot find we're gonna find a div I'm just thinking actually if that's the only anchor we could find the div or we could just straight up find the anchor so it's the only anchor so we might as well just go find a dot text dot strip so it's gonna strip it of any whitespace never a bad idea to just call don't strip on anything you're scraping from the web you don't really always know of what you're pulling in is free of whitespace so it's never a bad idea and like I say in this case this is the only anchor tags that's why I could just simply do a fine for a and then pull in the text to get the username so the other thing here is the span with a color of hashtag five five five so what I'm actually gonna do is go we can do I'm just gonna call it numbers equals Rho dot fine we'll do a span and then again we're gonna do a ders equals passes style and then we're gonna find base where the style is color five five five okay and then so this is gonna pull us in everything that's out I think there's three of them and then we can use list slicing to figure out exact date appoints here so the first one is going to be the number of uploads so we can just use zero for that it's going to say uploads is equal to numbers at index zero the third one is going to be this one here I'm hoping that this is a slightly different color this first thing well it's a div so that's perfect and then the third one is gonna be those video views it should be the video views so we're gonna have to go to use equals and numbers and then I index two to get the third on them and we can simply just print user name plus space plus up loads of space plus to use and that should give us a good idea of whether the program is pulling the right stuff for us or not we can go ahead and run it we'll see what happens we get a key error excuse me and that's because I wanted to do a find all perfect and I want to add a text strip here my apologies doing this a little bit different than a hat I had the notes but I think it's a better method in this case perfect and we got exactly what we're looking for so we have the username followed by the number of out uploads and then the total number of views that they have so we can see t-series right at the top here with 13,000 uploads and over 68 billion views pretty crazy so yeah that's awesome perfect so we've got the data scraped now what we really want to do is put it in some sort of file so that we can access it later because just printing it in the terminal obviously isn't that helpful what you probably want to do is record it in some sort of file in this case the file type that I want to use is the CSV file I could record it in any file but a CSV file is going to be pretty simple for us to both code in Python and work with after the fact so to do that I need to go import CSV at the top here import CSV as part of my initialization here I need to open the CSV file so I'm gonna say file equals open I'm gonna call it top youtubers dot CSV so comma separated value we're going to open that with right we also need a CSV writers so I'm gonna call it writer as C as B dot writer and we're gonna pass the open file to it I'm also going to right a header row so CSV file when you pull it into Excel or any other program it's nice if it has that header at the top a lot of programs are looking for that and if you're gonna be giving this data to anyone but yourself you're gonna want to have some way to identify what the numbers in the rows mean obviously the username is pretty obvious but it's not obvious that the uploads are uploads in the views or views so what we can just do is we can go writer dot write row this takes an array of the comma separated values that it's going to put on the row so we can give it just user name uploads and then views so those are gonna be our header items then what we want to do I'm still going to print it just so we can see that the program is operating correctly after we printed that I'm gonna do a writer on write row same thing here we're gonna do user name what we need to do is we need to encode it as utf-8 otherwise it will complain that ASCII can encode some values still I find some of these characters don't really work too well if they're foreign but it's kind of a battle for another day so we can go ahead and write ro there then we can do file doc close at the end of the script so what we should see now when we run it is that but also a resulting file here I have top youtubers dot CSV we can go ahead and open that and it will open up with Microsoft Excel in my case but you can use pretty much any program to open those you can just even just open it in notepad and it'll be pretty readable we can see we have these titles so it's perfect obviously we can't put any styling on the titles because CSV doesn't support those kind of features as it's just a basic plain text file format that Microsoft is trying to interpret and Excel and other spreadsheet programs so it's morning as a potential data loss here just if we tried to do something like bold this or make it italic it wouldn't be able to save that in the CSV properly and we have the data here in Excel so we can do some further analysis of it so that's all I wanted to show you guys today if you guys did enjoy the video be sure to give it that thumbs up remember I'll have the link in the description to my number one recommended Python book if you guys are looking to learn more Python if you guys enjoyed the video and want to see more in the future be sure to subscribe and leave a comment and I'll see you guys in the next video
Info
Channel: SyntaxByte
Views: 39,810
Rating: 4.8874826 out of 5
Keywords: python, web scrape, web scraping, beautifulsoup, beautiful soup, beautiful soup 4, beautiful soup 4 python tutorial, excel, csv, python to excel, python to csv
Id: JfU1G1Ug6-k
Channel Id: undefined
Length: 19min 16sec (1156 seconds)
Published: Tue Apr 23 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.