30 Days of Python - Day 12 - Web Scraping Box Office $$ Numbers - Python TUTORIAL

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey there welcome to day 12 and this one we're gonna be scraping websites to extract data now you may be familiar with what this is but undoubtedly you've used the result of it now if you look at something like Google it actually scrapes data from websites it opens up websites grab some interesting data or crawls that website extracts data that it finds interesting and then turns it into a search engine that is something that Google did and has done very well and nobody's even come close to them so what we want to do here is actually learn the basics of web scraping and also use some of the modern tools of actually executing this to make it happen now the challenge here is actually parsing the data it's not the actual grabbing the data grabbing the data is super easy because of Python requests but even if you weren't using Python requests it's not that hard to open up a web page so let's take a look at the website we're gonna be using in this case it's gonna be Box Office Mojo com so if you're not familiar with this website all it does is it lists off the box office sales for any given movie at that time if you click on worldwide you'll see the twenty20 worldwide statistics or whatever year you're watching this this is probably not gonna change a whole lot in the future although some things of it might but the general concept is that we want to use a Python program to look for these numbers right here now a big part of the reason I want to do this well one it's really useful to know how to do web scraping to IMDB itself doesn't have what's called an API now a lot of services down the bottom you'll see something like developer or API or developer services a lot of them will have something called an API which essentially makes it really easy for Python to just grab this sort of data I definitely intend to cover that as one of the days is actually working with an API so make sure that you stay around for that but for now what I'm gonna do is actually use Python to grab the data that's in here it will be certainly hell full to understand some basic HTML and CSS to actually make this really effective outside of using box office mojo but even if you don't fully understand it you can still get a good amount of things out of web scraping anyway so let's go ahead and start off with the code I'm gonna jump into vs code and in my project I'm gonna make a new directory for de 12 and in here I'll make a new file and we'll just call this scrape PI and I want to import Python requests now I want to make sure that I actually have that installed in my general Python 3 installation so python 3 - M pip install requests so like a chance that you already have it installed because we have used it already so what I want to do is actually I want to see the result of Python requests so before I actually write a bunch of code let's go ahead and take a look here I'm gonna go ahead and open up the actual terminal and we're gonna jump into Python 3 import requests and we want to grab our URL so the URL is going to be let's just use this URL right here so I'm gonna go and copy that entire thing paste it in ok so one of the things that you can usually ignore when it comes to URLs is anything after a question mark in some cases you don't want to use that you don't want to lose those things but in our case here I definitely do not need what's ever after that question mark including the question mark so I'm just gonna go ahead and throw that in there and see and make sure that it's still rendering out that same page and of course it is ok so what I want to do is open up this page using Python much like you would with the web browser so you'd say R equals 2 requests that get the URL and hit enter and if we do our that status underscore code likely you'll see at 200 this is a success code that's great and then if we do our dot text what we get is a bunch of HTML now if you're not familiar with wage tml is it's the backbone of every single web page so to see HTML for any given web page open up your web browser it works best if you're on a desktop web browser your mobile web browsers don't work as well as finding the source code but desktop ones definitely do you're gonna want to go to view developer and view source right now I'm in Chrome all of them have it Safari Firefox Opera they all have a way to view this source code if you need to you might have to actually activate some developer tools to make that happen so anyways this is the HTML this is what's driving that page so if I scroll down to the bottom I'll see the exact same stuff as I did with so right here there you go I got all this stuff right here this should be exactly the same or roughly exactly the same to what we've got down here okay I mean some things might be gone because of what JavaScript does javascript actually will change HTML pages and this will not work or the method we're about to do will not work on JavaScript heavy pages we have a method to do that and I will show you that just not today in day 12 so anyways the idea here is we are able to open up this HTML now is the real hard part is parsing out that HTML so before I do with the actual parsing let's go ahead and save this text into a file that we can actually open up again so I'll just go ahead and say define URL to file and it'll take in a URL and again we'll go ahead and say R equals 2 requests that get that URL and if status or rather our dot status on the shortcode equals equal to 200 then we'll go ahead and say HTML text equals to our text okay so we actually need a file name for this probably so file name equals 2 and I'll just go ahead and do world dot HTML as in it's gonna be the end of this URL world that HTML and I want to save this somewhere so I can just go ahead and say with open and file name and will write it as F and then F dot write HTML text and then I'll even return that HTML text if I want to change this to something different later because in general I mean you might want to actually save the resulting HTML you don't have to but I just want to put into practice the things that we've already learned so I'm going to go ahead and just call this in here I'm calling it right in line just to make it easy for me so I'm gonna go ahead and exit this out because we will change that over time so I'm in my project of course I'm gonna go ahead and CD into day 12 and then python 3 - i and scrape-up I and so I get a response does not have status a spelled status incorrectly status code maybe you've got that don't you love spell spelling errors okay so I ran it and let's go ahead and take a look at day 12 in the finder and what do you know there's world dot HTML now if you open this up into your web browser notice that I have this page is now on my local system right so everything about the data that we need has already been saved so this is an interesting way to eventually store data like if you don't know how to actually parse the data yet this would be a really clever way to keep that data and have it relevant to what it is that you're working on so another way to think about this too is to just grab our date and time so if I did import date time and then I said something like now equals to date time that day time dot now and then year equals to now to year or rather now just just now that year what I can do is actually change this file name to being a little bit different and I'll just go ahead and say world - year and use that string substitution and then do dot HTML again exit out of this and scrape it through and what we'll see is I now have a year or some sort of date object in there you can do a lot more you can do you can do day month you can do all that stuff and pass that into the actual file and save that locally and then once you do that you can start doing this on all sorts of web pages and then come back and try and figure out how to actually parse through this data because we now have it stored and saved of course we don't need to do that right now but it is a interesting way to go about doing things but what we want to do is actually find the data that we really need and in my case I've got this 2020 world box-office data it's already nice and structured for me but I just need to extract it and turn this into like a CSV file a comma separated values file so how do we go about finding any individual element in here like how do I get this entire you know table instead of anything else so to do this we're gonna use another package and that's pip R let's go ahead and just do python 3 - M pip install and it's requests - HTML this is a package that is used in conjunction with Python requests or you could just use requests on HTML and the reason that that requests - HTML is because it's much better than beautiful soup beautiful soup for if you've ever watched any my other web scraping stuff this right here is something I've used in the past H request - HTML is definitely the preferred method now so make sure you install that one okay so with this in mind what I'm going to do then is I'm gonna add one more aspect to this and change this to write to txt or URL - txt as in the HTML text and then I'll go ahead and say save equals - false and then if save then I'll actually go through here and save it but as my default I'll just say false so I'm not gonna do that but it is nice to know that we can okay so next thing is now that we've got this URL text I need to extract what's in that URL text so URL to txt let's go ahead and exit out of this again and this time I actually want that resulted file or the resulted HTML text in here right so I can say something like this I've got my URL I've got a function that we'll call it so let's go ahead and do the interactive shell again this is obviously gonna run that URL for me so I can go ahead and copy the HTML text here and I should actually see that HTML text so what Python requests allow me to do is we can say from requests underscore HTML import the class of HTML and then we just need to pass this HTML text or that string of text into this HTML text now that means that what it's gonna allow me to do is actually use requests HTML built built-in features to start parsing out the sort of standard HTML features that are coming in here now do keep in mind if you want to get really nuts you could learn how to parse each an individual item yourself and not rely on something like Python requests which I think would be a really interesting exercise for those of you who want to get even even deeper into this so what I'm going to go ahead and say then is the R underscore HTML equals to the HTML class and this is going to be we want to pass in the HTML string which is a lot of HTML saying a lot we're gonna go ahead and pass that in here with that actual variable and then now if we look at our HTML we get an object from that example here I notice it's not showing the URL that's because I'm not actually passing the URL to this there is a way to use the URLs themselves but I'm not going to do that in this case because it's taking something we know and just adding a little extra feature that we don't necessarily know okay so what we can do in here is do something like our underscore HTML dot find and we can say like table okay I hit enter it gives me a list of things that match the element table now if you're familiar with HTML you can type all of the different HTML elements like h1 a all of that you can definitely do those things and start to go even deeper and find under there which we'll talk about in a second but how do you actually know what to look for so inside of this box office here I'm going to go ahead and in control click or right click and in to inspect this is called inspecting the element and you'll see something like this let's see if I can zoom in there so now I can actually bring my mouse over any given element and it's going to zoom into what that element is right so in this case I've got this div ID here this appears to be the the actual table that we need which I could try this out by coming in here and hitting delete element and what do you know that that table goes away completely so let's go ahead and refresh that page and inspect the element again and just just kind of navigate up I have this other one called table as well but it doesn't seem like that's the whole table as you see the highlighted blue area isn't actually the entire table it's just the really top part so the only thing that's really the entire part is this right here and again we just needed that same element so it's it's definitely clear that that's the one so I also noticed that how they named their classes don't even worry if you don't know exactly what this means but you do see that it says class equals and it's a string of stuff how they name their classes kind of give you an indication that hey this might be the entire table it says IMDB - scroll - table that's not that's not that amazing it's pretty simple so let's go ahead and try that one out so what we want to do then is with this HTML text I'm going to come in here and I'm just going to print HTML or well first of all we have to read Eclair the HTML text as our underscore HTML equals to HTML and HTML equals to HTML text let's make sure we import that so from requests underscore HTML we're gonna go ahead import the HTML class and there we go so this is how we declare it this turns any HTML string into something that's managed by requests HTML pretty cool so now that we've got that what I can do is come in here and say the table underscore class and we're gonna set that equal to something so I'm gonna go ahead and just grab this entire thing here I'm not gonna make any assumptions just yet I'll just go ahead and grab this entire thing okay so this is the table class that I'm gonna attempt to find so I'll go ahead and do our underscore HTML dot find and we'll pass in table class that's string itself and I'll just go ahead and print out what that result is and I'll just go ahead and say my table or rather our table as a requests table request HTML table I'm gonna print that out and let's take a look I'll exit out of my old interactive shell press up run into the new one and I get this empty list here alright so it's looking for this string here but it's not find anything now remember if I just changed it to a it will find something so let's go ahead and exit out of this run it again this time it actually shows me all of the links that are in there so of course the a element is for all links it's an a tag it's called an anchor tag so what we want to do is we want to find one of these elements now with HTML classes and how you actually find them is you use something called a selector which we're just going to use one of these classes and like I said my intuition is gonna be that it's this right here and I'm just gonna find the number of instances that are there maybe there's more than just this one and if there's more one time I have to go back to the drawing table or the drawing board so when it comes to classes the selector for class this is the technical term for it is you put a period in front of any given item that's in there right so period mojo - gutter period imd - scroll - table - styles if you're looking for a specific ID you use a hash so another potential option I'll put in here is table class equals to hash tag table because it's that ID okay so if you're familiar with CSS this is very clear to you if you're familiar with jQuery this is very clear to you even JavaScript users just pure JavaScript you might be familiar with this as well probably har cuz it's pretty basic HTML stuff here okay so now that we've got this let's go ahead and run that interactive shell again and run scrape and what do you know I have only one element in this list right as you can see by this enclosing thing so this element is another HTML type class it's just called an element so what that means is I can actually search inside of that class as well another thing that I can do is I should be able to actually grab the value that's inside of any given HTML element so let's go ahead and say if the length of our underscore table equals to 1 then we're gonna go in and print our table our table make sure we're using tables on HTML our table dot txt much like requests just pure old Python requests does our dot txt we should be able to do the exact same thing with the requests HTML so we are gonna get rid of this our table here instead I'm just going to print out the text that comes from it so let's go into our interactive shell again and exit out of the original one and run it again and I'm getting a list item has attribute text that's just a little mistake because our table is a list as we were verifying here so we want to make sure that that length is one so we're just going to go ahead and grit get the very first position which is the 0th index or zero index so let's go ahead and leave it in like this now so back into the terminal exit and out run that Python interactive shell again and what do you know it's actually giving me all of that data this is gonna be a lot more clear if I if I do something like this or even on your own machine you'll see all of that same data so that's pretty cool so that's nice but it's not as structured as I'd hope now what I'm going to do is actually turn this table into a list of lists in other words the very first item in my list of lists will be the header the second one will be the actual first row of that table this is going to turn it into a CSV at some point or comma separated values item at some point so to do this I can do something really cool and that is I'll go ahead and say parsed table equals to our table and the zeroeth element and inside of here I'll just go ahead and say rows equals to parsed table defined TR TRS typically stands for table row inside of HTML so if I tab down on the table rows themselves I see that I've got my table rows just like that and it is showing me all of these table rows okay so one of the challenges of this though is I don't know where JavaScript is potentially causing an issue with my glancing at these things so before I actually look for those table rows let's go ahead and see what javascript is doing so notice that inside of my scroll table I have two actual table elements here are actually three table elements in here that are causing potential issues for me when it comes to doing this web scraping so inside of having this console open remember it's got a view developer you can view source and this will actually bring up that page so you don't want to do that you want to do view developer tools I just toggled it I closed it so let's make it open and the inside of here will do command shift P or control shift P if you're on Windows and then we'll go ahead and disable JavaScript after we disables JavaScript I'm gonna refresh this remember we had three tables in here maybe even more than that now if I break everything down inside of that scroll table I only have one table now or at least one that I can tell just by a quick glance that is the difference between having JavaScript enabled and not it just sort of parses things that makes things a lot different although my web browser looks pretty much the same I'm gonna go ahead and re-enable JavaScript but just understand that that's the reason I'm doing what I'm about to do so command shift P and type enable JavaScript you no longer need that on and I'll just refresh that page okay so again I can now file the rows with TR or at least I hope I can so I'll go ahead and print out rows let's go ahead and exit out of that interactive shell and run it again and there are a bunch of table element rows here so right now I have a list of elements what I want to terminate into is a list of Lists so let's go ahead and print out okay for row in rows I'm gonna go ahead and print out what each individual row is and of course it's actually the element so I'll go ahead and print out the text much like I did up here which I'll get rid of that actually we don't need that any longer and let's exit out and run that scraping again and notice it's back to what we saw before when it was just an entire table which is which is nice it's actually getting a lot closer to what we need but if I scroll to the very top let's go ahead and expand this a little bit scroll to the very top what I see is the very first row appears to be my header okay so scroll this or break this down so what I'll say is and I'm just I'm just making educated guesses here rose zero that's gonna be my header and then I want to iterate through all of the other ones after zero so one and beyond this is a way to do that so now if I run this this should give me my header row and then only print out the actual row checks that I need so to make this nice and clear I'll go ahead and close out that Python and open up a new terminal and make sure that I'm in the correct day of course so back into day twelve and we'll do Python three - I scraped up PI so this time I should not see that header if I did everything correctly so I scroll up and I don't I just see the very first row which coincidentally has a number of one because it's based off of rank right which is based off of all this other data cool so I now have my rows of rows but I just need to turn them into something so that means I need to just take a look at the data a little bit closer yet again so the first one I want to parse is the header so inspect this header here and in this header or this TR and of course I potentially could use the JavaScript being off but I don't think I'll need that right now because of how hTML is structured in general so I've got T TR and th so th is each header column and then if we scroll down a little bit inside of the elements we've got the TRS here as well and I've got also th but notice that this one says display:none so if I get rid of that element hey what do you know that columns are back there so that's the actual one that we're working off of even though if it doesn't seem like it or even if that's confusing no worries so this TR has TD so typically on HTML elements you'll have the header row we'll have th as in table column and then TD as in like a table cell each cell for any given row okay so that means is long-winded way is saying columns equals to Rho dot find TD and now I can go ahead and print out let's just go ahead and say the 4x in calls I'll go ahead and print out X let's go and print out X and then a new line and a new line okay I actually want to get the iteration or the number that it's at so like what position is it in in this column so I can do something called enumerate and enumerate will turn it into it'll give me the actual iteration it is as well as X so I can say I or whatever the actual iteration is so another is kind of a cool thing to know but for row in our 4x and row you could do that same print statement and then you could say something like I equals 0 and then I plus equals to 1 and then printing out I and then X right so this this right here does the same thing as this it just uses a another built-in function called enumerate ok so now that I've got that let's go ahead and exits out so this should give me each row which maybe we want to enumerate the rows as well but it should give me each row and then it's going to give me the columns in that row with their iterations as well ok so there we go so I open this up notice that it's elements now it's not actually the text which is nice because it does differentiate things a little bit here but it's giving me 0 1 2 3 4 5 6 pretty cool so the next thing of course is just to turn those columns into texts so X dot txt and maybe we should make this a little bit more explicit by just saying call as in column and exit out of this again oops run it again and now we see those actual values okay so a quiet place part two looks like I made a hundred and sixty three dollars so let's go into chrome and scroll to the very bottom a quiet place part two one hundred and sixty three dollars and it was for it okay and what road number was it well it has a rank of 32 which we can tell by this right here so that was that final one all right cool so with this new knowledge we can now get the header names themselves so I'm just going to go ahead and do this and say header names equals 2x dot text for X in header and this is really header row so let's just call it header row okay so that's the header row now we iterate through all of the columns let's go ahead and do that so I've got all of my rows all of the columns is really what's interesting here so I'll go ahead and say row data is equal to an empty list and now what I'll do is just do row data dot append call dot txt okay so I'm going to get rid of these print statements here for a moment and with this row data I'll go ahead and make another list here called table data and set that equal to an empty list so each remember I'm making a list of Lists so table data dot append row data okay and then finally I'm going to go ahead and print out what that table data is I'm actually going to declare this table data up here and I'll also declare an empty list for header names so let's go ahead and print out both of those I'll print out the header names and then I'll print out the table data okay so it's gonna parse the table it's gonna look for all of our rows it's gonna assume the very first rows a header row because of what we did with the investigation obviously this I can't work with everyone and then it's gonna loop through all those row rows and then it's going to loop through all those columns and each one of those columns is gonna be put into its own list at a specific point right it's gonna be in there at a specific iteration which will correspond to its position inside of this row data okay and then all of those things should match let's go ahead and exit out of here run that again okay so I get element object is not iterable of course okay so I had a row it's giving me that that's because I didn't do the thing that I should have done which was header row columns or header row missed one piece there we did talk about it and it was th okay so that's roughly the same thing that we did in here but it's just in one line okay so we go ahead and accept this out and if you're more comfortable writing a for loop in multiple lines by all means go ahead and do it I realize this is maybe a little bit advanced of a syntax but you gotta start seeing the real things as they happen okay so now if I open this up here we go I've got a row of rows here's the columns all right so I've got rank release group and so on and then if I look at the very first item in here I've got rank the name of the item here the element and so on so it's actually giving me all of the things that I want so I can actually test this this out all right so I have let's go ahead and close this down a little bit I have I've got table data so let's go the first element in the table data so a table data zero and I've got bad boys for life okay so header names header names there we go so what's the release group name of table data and position let's say position I don't know five okay so this should give me a rank of five potentially six if we do our positions right and then Oh give me the name of that movie based off of these things okay so hit enter the gentlemen so it's actually rank of six remember those index is they start with zero and the worldwide gross is a hundred and seventeen million and some some change there so going into rank six we've got the gentleman 117 million and it's some change cool so I now have scraped this data successfully okay even though it might have felt like a little bit of a stretch but the data has been scraped I certainly want to improve this over time but before I improve it I'm just gonna go ahead and save it to a CSV file I'm gonna exit out of here let's go into the terminal and I want to I want to actually clear this out and I'm going to install one more thing and that's Python 3 - M pip install pandas now python pandas makes it really easy to work with csv files there's a lot of things that go into making pandas work really well that i'm not going to cover right now but instead what I'm going to do is I'm gonna pass the data that we have into pandas just so we can save a file it's it's really just that's all we're doing we're not manipulating things we're just saving the file itself okay so let's go into vs code here and I'm going to do the default for pandas which is import pandas as PD before I actually finish this off I do want to mention that there is other ways to write CSV files inside of Python there's convenient ways to do it there's inconvenient ways to do pandas is by far the most convenient but also is going to set you up for doing a lot more things related to data and CSV files and other data science related things inside of pandas or instead of Python even if you don't intend to do that ever but I do think it's a really good thing to learn eventually once you have a better grasp of Python itself now we're gonna declare what's called a data frame which is typically denoted as DF and then we'll do PD as in pandas dot data frame or the data frame class this is how it's written all the time they don't actually import the class very often and then we're gonna pass in our actual data so in my case it was table data and then we're gonna also pass in the argument columns equal to header names and then we're gonna go ahead and do DF to underscore CSV and then the name of it I'm going to call it movies dot CSV and then we'll just pass in index being false by all means try index being true and see what that result is but in my case I'm going to do index being false okay so pandas makes it really easy to write all of this stuff and we'll see that now so let's go ahead and try out the scraping program again I don't have any errors in this case so make sure you you actually copy it exactly like this and let's go ahead and look at my project I'm going to open up the Explorer a little bit so let's go ahead and actually just see this in a program that you might be familiar with like numbers or Excel so reveal and finder and I'll just double click that CSV file typically CSV files will open up in numbers or Excel and what do you know I've got a header and I've got all of that data now if you love the index in there you would see a whole nother row or rather a whole nother column with starting from 0 and going to 31 most likely ok so now I have the actual way to take that data and put it somewhere that I want to but I'm not quite done I actually want to make this a little bit better in the sense that if we look at our data so going back in to the twenty20 worldwide box office I want to go to 2019 and this is where URLs are often very very logical if we look at the URL it says 2019 so let's go ahead and jump into guessing I don't know 2015 and what do you know 2015 worldwide box office so this is where web scraping becomes very very valuable because you might have been like oh well hey I can just copy and paste that whole thing into a CSV file probably work no big deal but we don't have to do that we don't want to repeat ourselves over and over again okay so let's go ahead and turn these things into functions a little bit better right so I have the first function which is you know URL to txt the next function would be to say define parse and extract let's call it that and again I'm gonna go ahead and pass in that URL I'm gonna leave the only thing I'll leave out is the URL everything else I'll just go ahead and tab in okay and in this case I'll go ahead and save file name base or just like name it in my case I'll call it 2020 alright so this name itself will be what's gonna be for my CSV file which I want to put in here correctly not to worry I will go back up and make sure I did do that okay so let's go ahead and first of all make sure that there is a table with the class that we have and now once we have that table let's we've got the parsed here and then we've got this for loop and then finally we can go ahead and come in here in this case I'll just give this that name that I'm passing some name now I'm actually I'll probably want to have a location for this like typically you would want to store it not in the root of your project so let's go ahead and just call this a data slash and in this case I'm gonna be a little lazy and just type out data for that directory and going back we would want to you know OS path dot join and then you know get the actual path to it and actually create that path so it's completely reusable and let's be honest we should probably do that so I'll just go ahead and say path equals 2 OS path to join and bass dur and data OS top make ders and its path and then we'll just go ahead and say exists okay equals to true okay so that means of course we need to import OS whenever you feel like being lazy don't be lazy that's my rule so base dirt equals the OS path that dur name of this file and there we go so now that we've got that of course that that actual final path will be file path equals 2 OS path to join and it's gonna be data you know a big part of the reason I'm doing this too is for you Windows users that are like hey that's not how my pads are set up and what the heck man okay cool there now it's a much much more functional function okay so it's got person extract it calls that other function that we already set up and let's go ahead and just call that now and I'll do the font parse and extract of that URL okay so the first one I'll go ahead and just grab world and do 2020 so at the very end of this I'll go ahead and say 2020 let's bring this down and put it right above where we actually call it okay so there we go let's exit out of here and I apologize if that was really fast for you I just sort of assumed that you've already done all of this stuff in the other parts but if you haven't just just keep in mind that we do cover it in the other parts if that was really fast for some of you so let's go ahead and run that and now I've got Python 3 scrape I got path is not defined yes of course it's not because that should be a lowercase path let's try that again exit out and run it and now if I go into my Explorer into data I've got 20/20 here and of course what I want to do is I want to actually test that the URL and the naming stuff is actually working in other words I want to go ahead and try 2019 and then just pass you know name equals to 2019 save that and let's go and run it again so hopefully what this will do is actually give me a value right next to this and sure enough it does and I can open that up right here inside of vs code I see the header and then adventures in game is number one for 2019 no surprises there ok so now what I want to do is actually run through and scrape any given year and maybe go back several years from that so I did import the date/time module for a year of now I'm gonna go ahead and cut this out and paste all of this down here so I'm gonna call this run so in here I'll go ahead and say start here and then I'll also go ahead and say years ago and I want to set some default values here so four years ago I'm gonna go ahead and just say ten so it's gonna start ten years ago and then I'm gonna go ahead and use the start year being none and I want to go ahead and assert that the start year is an instance of an integer so start year is an integer so is instance of int and of course the reason for that is we want to make sure that whatever we're passing through here is an actual integer otherwise it'll raise an error for me but before I even do that what I can also do is just say if start year equals to none then I can just set that year as to now or this year whatever this year is cool okay so with that I can now replace our URL here and say F and change this to being whatever that year is and naturally I could also make sure that it has a of a certain length so the length of start year is equal to 4 so we want it to be 4 digits long right and I could do additional things like making sure that it is a valid year because like year 3000 is not a valid year but that's okay it's even more okay because up here we have this URL to text and basically it's a status code is 200 it's going to return well an empty string so maybe we should actually leave it as an empty string because what's gonna happen is it's gonna look for you know that HTML inside of an empty string and then it's gonna try and find that table and only look if that table has a value of 1 so we're pretty safe even if we have the wrongness starchier in here or whatever this year ends up being okay oops I should probably place that as starch here and then year and then here okay so notice I did not actually do anything with time ago yet but that's okay so I'm just gonna go ahead and run Python 3 scrape dot PI and it shouldn't actually do anything right so it's taking no run statement so let's go ahead and make sure I actually call this and I'm going to call that in the main statement so again if name equals to main then we'll go ahead and run it and I'm just not gonna take any arguments yet so I'll save that and I'm gonna go ahead and delete this directory of data and let's go to run that and I get object of type int has no length of course it doesn't so we want to turn it into a string here so the string you can only count you can't count the length of a of a of a number the only way you can count for the link of a number is by turning it into a string and therefore you now have that that's a good error to talk about now I run this again and as we see here it actually created the year and started with all that but what I actually want to do is be able to iterate through all this stuff so I want to have a start year but then I want to do years ago and subtract from that start year because I made sure that the instance was an integer and I made sure that it you know it had four digits and hit basically so what I'm gonna do is say for I in range and this is going to be 0 and it's gonna be it's gonna take 0 and then years ago plus 1 okay so I also might want to assert that the years ago as an integer as well in just in case you didn't actually do that correctly so with this loop then I'm just going to go ahead and tab this in and now I'm gonna go ahead and say that I is going to be the start year still so each iteration I will use whatever that start year is and then at the end of that iteration I will subtract 1 from it so it's going to go through the entire years ago iterations and they'll subtract one eventually so let's go ahead and and obviously it starts off with that initial start year so no matter what it is it's starting off with an initial start year and I'll go ahead and print out string here saying finished and maybe the year okay so let's go ahead and save it and I'll run the scrape again and now what I should actually have oops that should be start year my mistake okay let's try that again and now what it should be doing is actually iterating through all of those years it's actually scraping you know 2020 and then 2019 and then it'll be 2018 and so on it will actually go through all ten of those years and if we did everything correctly we look in here and you know what do you know there it is right there right so I could go one step further and actually pass in arguments that are coming from you know the the the command line itself if we are so inclined so let's go ahead and do that and I'm just gonna still use sis there is another way to parse arguments the something that we haven't covered yet so I'm not gonna talk about it just yet so if you remember correctly we use sISTAR v so the number of arguments that are on the command line scraped up high is considered an argument so it's the actual you know file that we're gonna be running is considered an argument so we want to get argument one and I'm gonna go ahead and get argument to and all I'm gonna do is say year and or maybe start in duration that's probably a little bit closer or maybe count as in the number of years here and I want to turn those both into integers so this is a really a simple way to set something like that but I want to go ahead and say try start equals to the end of start okay and make an except block here and then just say start equal to none and then we'll do try count equal through the int of count or again whatever that system argument is and then otherwise we will just return a value and I'll just give a count of one so run here we'll take these two arguments and change them slightly so we'll go ahead and do the count being years ago and the start being start year and there we go and I've now finished all 10 of those years so let's go ahead and try scrape and I'll do let's say 2005 and I'll do 5 years so there are my two arguments there I hit enter and what I should see is hopefully 2005 showing up here as it shows finish 2005 and there is 2005 that's pretty cool okay so that is now scraping all of that data and realistically the historical data we're probably not going to need to scrape on a regular basis it's more of like this year's data that we need cuz the 2019 is done none of that data is really gonna be updated and if 2019 or a year ago data is still being updated as they calculate delayed reporting or whatever it is or they fix reporting because that happens to like they might have to fix numbers for sales or whatever so every once in a while you might actually do periodically go back in time but realistically what you'll probably want to do is change the default years ago to just to just one so up here that argument of one so if I come in and just say 2020 what we are getting is oh here's our first error of list index out of range because this is saying hey you got to make sure that you're passing in an argument of some kind so instead of just doing it blanket like that I'll just come in here and replace it into this try block and I still want to turn those into integers and this means that I don't have to pass in any arguments basically because it'll go with those defaults so run that again and now it's just gonna be doing 2020 finishes off 2020 and that should be it right it shouldn't actually go into up in this case it did 2019 now why is that well actually it hopefully makes sense because it does this year and year go so if you change that count to beans zero it's only gonna go through one number so again years ago no years ago just 2020 and that will give me only 2020 and if I try a different year that doesn't actually exist like 2021 I'm probably not going to get any data here right so it says document is empty okay so that gives me another thing that I can add here and that's saying that the HTML document is empty so instead of returning an empty string here I'll just say none and then this HTML text if HTML text equals to none return so just in the function itself right it won't go through the rest of it so we save that and we run it again and there you go so finish 2021 and so on right and so finished should probably actually be more accurate to what happened here so at the end of this I'll go ahead and say return true as in it did finish so it went through that loop here and now what I'm gonna do is just change that length a little bit so I'm going to add this back and say if the length is not equal or rather if the length of the table is equal to zero then we'll just go ahead and return false and yet again I'll return false up here okay or something along those lines so if it's equal to zero or if it's not greater than zero then we want to return that false so then this means that down here I can say finished equals to that and we'll do if finished else print something like start year not found or not done not finished okay so we save that run it again and there we go that's that's probably a little bit better so we can say you know 2018 again and this time it should actually go through fine 2018 and do all of the things it needs to for actually scraping at that webpage okay so as far as the code is concerned there are a number of things that we can consider here to improve it number one is we perhaps we want to extract the table as its own function like looking for the actual table whatever that might be or even passing and extracting just all of the table data with the header and then doing something with that data right so like actually saving that data storing it doing something like that or instead of adding it in as a CSV file we we actually added into a database of some kind like a django database or a django model that's something else that we could do another thing we could do is consider how to actually update how this data is actually handled right so if you remember back to when we were doing all this the header names those have iterations so each index item for any given header corresponds to the row below it so you could actually use key value pairs instead of just appending data here so what I mean by that is in this row here let's go ahead and just say row data I'll actually even get you started here so we'd say row data and we have an index item here so I could say header name equals to header names of whatever that index value is and then we would say Road data header name equals to the column text okay now the reason I actually didn't fully do this is because if we look at that header the very top has a parentheses sign on both of these right so that actually means domestic percent and four and percent right so what you're gonna find is your row data is not gonna match or rather the row let's change this from row data to Road dict data what you're going to find is the row dick data it's not gonna match the row data and that's actually a pretty good challenge for you on how you can go about solving that of actually changing this header these two headers into their corresponding domestic and foreign % because if you know as you probably already know this is a little redundant but if I did something like this twice it's just overriding the other value so it's not gonna be accurate as to what that is or you could even consider deleting it altogether now if you end up using a dictionary or a row of dictionaries which you absolutely can you would still want to append this to a row your data frame here so I'm gonna go ahead and say up here and let's call this table data Dix as in dictionaries then I would actually just append what that data is down here append and that's gonna be raw data ticked just like that and this data frame here instead of using this one we could then just use just that dictionary itself because of how cool pandas is it actually can convert dictionaries really easy into data frames dictionaries of course can also be used to create django models or the sequel alchemy you can use a dictionary and unpack a dictionary to match the table values with the key value pairs that dictionaries have which is also really cool i want to actually leave these things commented out so if you ever come back to this you'll see the actual comments but i'm not actually gonna show you exactly how to do all of that in solving that problem i think that's a really good challenge and I'm hoping that you're at a point now that you could actually go ahead and address it if you do find that challenge feel free to submit it on github as a pull request and we'll take look or you can submit it as an issue just submitted on github and we'll take a look okay so that's it for day 12 I realize we did taun here and hopefully you got a lot out of it the keys that would take away from this is number one you can always just start out by saving the data and parse it later that is actually a really good method of doing things because it's a lot easier to make sense of the data once you have it versus trying to make sense of the data after you don't have it right of course that makes sense so the more data you're gonna collect the easier it's gonna be to make sense of it later so honestly even if you just stopped at this and you're like okay I don't know parsing yet I don't know how to do that I want to do some more web scraping that's completely okay the next thing is this is where some understanding of HTML and CSS comes in being able to actually parse out the correct data in there but as you see the tools for parsing are very simple or at least they're simple now they have been a little bit more complex and certainly even requests HTML can get more requests more advanced than this this is just sort of simple usage but you know sometimes doing things simply is better or easier than doing it more on the more complex level and then that's that was my hope here and that's also true with how I loop through all these rows you know I made these lists specifically to make it as simple as possible and that's why I talked about the data dictionaries later for that exact reason and then of course this run part this doesn't have to be done you could just do it where the arguments are going directly in to this extraction I just did it this way because I know that in general I'm gonna want to scrape multiple things not just one and so this is a way to think start thinking about how to scrape multiple things not just one in by having a single function that can actually scrape and save everything that part is really cool as well alright well thanks for watching day twelve of course we went through web scraping because the website we looked at it didn't actually have what's called an API now API is make it super easy to grab the data that that website or service might be using and that's actually how so many of the applications that you know and love today are so powerful it's like why do they have so much data in some cases yeah they do use web scraping in other cases they use these third-party api's or other people's data think about it this way if Box Office Mojo the website we used if that one had an API to just give us this data would we have much to talk about well probably not not that much because API is make it unbelievably easy and that's actually what we want to do very soon so make sure you stay with us

Info

Channel: CodingEntrepreneurs

Views: 20,595

Rating: 4.9846449 out of 5

Keywords: djangourlshortcfe2018, install django with pip, virtualenv, Django Web Framework (Software), Mac OS (Operating System), Python (Software), web application development, installing django on mac, pip, django, beginners tutorial, trydjango2017, install python, python3.8, django3.0, python django, web frameworks, install python windows, windows python, mac python, install python mac, install python linux, pipenv, virtual environments, 30daysofpython, beginner python, python tutorial

Id: 5u391bX9FVE

Channel Id: undefined

Length: 61min 41sec (3701 seconds)

Published: Fri Mar 27 2020