Python Automation: Downloading 95 Excel Files in Seconds

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome back to another episode of making data useful with Adam I'm back here on data.gov dot a uh and I thought it'd be a great opportunity for us to jump in and understand some of the data that's available and kind of go through a bit of an end-to-end analysis from the data collection to the transformation to the visualization and I thought what we could do in this video is really start with that data collection and see how we can overcome some of the challenges that we face when we're collecting data not all data sources are created equal sometimes we do have to deal with lots of Excel data and I've got a really good example for you today so what I've been able to do is I've been able to navigate over to the taxation statistics 2019-2020 um it looks like it's got some really interesting um data that we can slice Dice and you know put into different dimensions and and really have some fun with but the key challenge here is that it is almost 100 xlsx files um there's a download button we can click um but us being data people we don't want to be there clicking the button 100 times we want to automate that so what we're going to do in this video is we are going to download every single one of these files in an automated way because we don't want to have to click that button a hundred times let's get into it let's get started what we're going to do is we are going to take this particular URL which is a data.gov.a URL I'm going to go over here I'm going to call this one data dot gov dot a u Auto downloader okay A little bit of a original title there let's go ahead and import import requests now if you're new to the Channel Please Subscribe there's more content coming but if you haven't seen what I'm looking at on my screen here this is a jupyter notebook I've downloaded this one via Anaconda but you can download it directly through other means Jupiter notebooks are great they're an awesome way to interact with python I'm a big fan of them and while I'm sort of doing discovery mode it's really helpful to be able to sort of open it up and you know shift enter into this and sort of you know create new cells and do all that sort of stuff so we're going to be using requests today if you've seen any of my previous videos you know I'm a big fan of requests it's a great way to get information from the internet first I'm going to do is Define the URL we can make this bit more Dynamic later if we wanted to point it to different pages but for now we'll just we'll leave it as is we won't try to optimize too early we'll run that and now we've got is the URL I'm going to create a variable called response and in response it's going to hold the response object of requests.get URL now it's poor practice not to put headers in there so I probably should but I think the default headers just say hey on python I would like to think data.gov today you isn't going to sort of block us from automatically downloading things it's a data website where data people we're going to want to automate these things so let's go ahead and shift enter on that and let's check out the response we get uh so we've got a response of 200 which is awesome and I'll just dot text this is going to bring back the entire HTML of the page this one looks a little bit light on so what's happened here so uh for full function of societies to enable JavaScript oh it says we need JavaScript to be able to use the website instruction how to enable JavaScript enabled okay I could fight the website and try to figure out how to make that happen or I could just use selenium so use a bit of a web driver web browser thing to sort of boot up and go off and navigate to the website which will have JavaScript enabled not great if a fully automated but for this purposes it will be a bit of fun so I'm going to jump into the getting started selenium you haven't seen this before check it out it gives you a really good example we'll grab that one now and just sort of talk through it so we're going to import the Webdriver this is the thing that actually drives the Chrome um and I prefer Chrome and I actually prefer a web driver manager so the way selenium works is it uses a chrome driver which drives your current version of Chrome the desktop application um and so this just makes sure that you've got it set up correctly and this is how it looks so first of all I'll do is I will create a new cell and I'll actually just paste in exclamation mark pip install just making sure I've already got it installed great that's awesome and then over here I'll go back to the web driver and I'll say okay that's good that's good that's good um we want to use this one so I'll talk you through that as well uh where are we so from webdrivermanager.chrome import Chrome so driver is equal to Webdriver Chrome and then here is where it's sort of doing the automatic install so we'll grab that and we'll go down to here and we'll start to say driver.get I don't want any other fine element by blah no actually I'm going to do this a weird way trust me it's going to be fun um so now we've got selenium uh from selenium import Webdriver from webdrivermanager.chrome import Chrome what we're going to do is we're going to take our URL which is this one here and we're going to pop it in here oh well didn't mean that and we'll close that off we'll run that and we'll make sure I get everything set up correctly I think I've missed the bracket there cool getting the latest trying to download driver has been saved in cash that's great and as you can see it's popped a window and it says Chrome is being controlled by automated test software so what we can actually do now that it's popped with all of that I should be able to say driver.source I think nope that's wrong driver dot Source nope that's wrong oh what is it let's Google it uh yes selenium get uh page source and you know what's funny I'm pretty sure there's a previous video where I've actually done this correctly uh how to get page as it is in the browsing user so how to get the entire page so it's using selenium here we go um oh it's page underscore Source good one Adam so let's put that in there it's an attribute so we'll run it like that and now we've got some HTML so I'm going to say HTML equals that and then I'm going to say driver dot let me use close let's double check with the um drive it up close yep perfect so what that means is this currently is window popped here shift enter on that and that window is now gone but what I'm left with is a little bit of HTML so this can be fairly well automated there is ways to um stop the actual browser popping it at all so just kind of happens in the background so what I can do now is I can actually go ahead and up here actually what I'll do is just move that down so here I've got all my imports I'm also going to add an input to that which is from you guessed it ES4 import beautiful soup so for those that are new again please subscribe it helps the channel beautiful soup is an amazing web scraping package that allows you to interrogate HTML really really easily there's a heap of different ways you do it I'm probably a little bit old school still using the sort of find methods oops little fine methods so let's go ahead and do something like this so we'll say We'll create a new variable and we'll call it soup and soup is equal to uh what are we thinking beautiful open close bracket b e a beautiful uh you know I haven't done I haven't shift entered on that because that should happen automatically if I don't add anything extra here and hit shift enter it'll default to a parser meaning it will just figure out how to parse the HTML so what that will look like is if I go into soup and I say dot title I've now got the title of the web page and I could say title dot um text and it'll give me the text cool so if we go back to the web page now that we're here and I'm almost impressed by myself of how quickly we're able to sort of switch over to selenium I kind of don't impressed by myself I take that back I'm impressed by Python and how just quick and easy it can make things it's a really powerful tool so let's quickly recap so we're importing selenium which is great the main part of that is the Webdriver that allows us to pop a window here just saves me an absolute headache of trying to figure out how to match the two versions of Chrome from bs4 import beautiful soup is what we're about to dive into now I first Define a URL which is cool because then I can sort of structure this in a way that I can pass it any data.gov.au URL and it we can build it so we'll just download anything it's available as a file driver is the sort of object or the variable name that we assigned to this it goes off and gets the URL then we asked for the page source that gets assigned to HTML which is our new variable and then we close off the driver and that's that sort of part of the process done with then we create a new variable called soup which is now the part where we're going to start to interrogate the page and the page being the HTML which if you recall is the page source so give me a second all right let's keep going now what we've got is a whole bunch of files it's and apis but there's not many apis it's all just files I think there's like five apis I don't even know what they are well they've gone there was more links before this is really weird okay so what does that look like to do this what we need to do is I'm going to right click and inspect and what I'm looking for here is sort of the the row that sort of encapsulates both the title because I want to use that as the file name um just so I know what it's actually talking about I'm sure the file names aren't too terrible but it's you can be pretty careful so let's uh well let's actually have a look so the file name is this is a link that goes into the file but the actual download is where is it oh as you hover over it's highlighting different elements so then this element here the span where are we oh it's called No print that's kind of weird um in here then we have the data set and that has a file name so it has download and it has tax stats 2020 index oh it's not bad I might accept the phone and let's try a different one let's inspect this one if it looks like individual Finance ratios then we're going to accept it um ts20 financial ratios one individuals okay these files are actually pretty well named so I saw me being lazy but I'm going to accept them so to do that what we're going to do is we're going to make it really easy on ourselves we're going to soup dot find um we're going to find all we're just going to find all of the new keyboard and all of the A's and what I mean by that is all the hyperlinks everything that's in an ahref is just going to be made available to us then what we're going to do is we're going to say well I'm not interested in every single link on the page because a bunch of links here that aren't data sets so why don't we say give me all the ones that start with maybe this part data set now I'm going to double check to make sure this one isn't formatted the same at the beginning because that would ruin my plan so let's go ahead and inspect that this one's just slash data set so I think this is probably the way to go so what I'm going to do is I'm going to say give me everything that has https.data.gov so let's oh it clicked on it oh now download it okay that was yep that was a happy little accident let's copy I guess the element like what does that do let's find out oh that's like a whole thing um that's fine so let's grab this part and this is cool because this will give us part of the URL and then what we can do is let's grab this so we're going to find all I'm going to um actually assign this to a variable so my expression doesn't get too long so I'm going to call this um a bit a bit long as well start of URL okay which is this part here and then we're going to do is right now I've actually got a lit this results in basically a list you can tell by the square brackets but it's kind of a list of sort of elements and in there each of those elements is the beautiful soup representation of a URL every single URL on the page so what I'm going to do is I'm going to use list comprehension I'm going to say link for Link in in and I'm going to close that off run that again nothing actually changes but what we now have access to is this variable within this sort of list called link so I can now say something like link um we'll always get this one wrong starts with start of URL um oops oops oops so I'll try that again put an if statement in there and non-type objects not callable okay that's annoying um non-type object is not cold or so if link uh well actually I take that back because the reason the reason I got that error that's my fault if link I can't just Access Link I actually access the href in the link so I go href dot starts with start of your shift enter beautiful so it's a bit tricky to look at but I really like Bliss comprehension um it keeps everything down to a single line um and when it does get a bit long you can chop it up and put certain aspects certain elements of it into their own variables and it becomes a lot more readable you know start of URLs this and I'm saying link for Link in this find all links okay if the link href which is if you look at the actual website it's actually got a thing called href which I love about beautiful soup because I can ask for all the A's which is this part here and then I can say give me the href by just putting it in square brackets I can probably have also put in Target and it would in relative will give me whatever I want um as long as I put in square brackets which I think is really handy cool so now I have what I believe to be and I'm very descriptive in my variable names all the download links I think I have all the download links so let's actually check that by just saying what's the length of this 98 which is pretty decent um on the Assumption all of them have xlsx in them okay so 95 it looks like I've got 98 so let's have another little look at this thing um what I'm probably also going to want to do is I'm probably not wanting to get the whole beautiful super element I'm probably going to want to get the href out so let's tweak that ever so slightly shift into shift enter um and I'm asking for the href which is this part here so why isn't that working let's have a look so I'm asking for the hrefs I've got a bit of a problem here let me think about this so each one I'm asking for the href but it's not to do what's it doing it's weird hmm I'm I'm expecting it to return just this part but why isn't it Adam think about this all download links oh there oh I did didn't didn't refresh the cell so there's a hot tip guys if you are looking at the data make sure you shift enter again uh because it will say something different cool so now that I've done the first pass it looks like if you think about it let's pick the first element here and if we use a bit of string slice notation what we can say is something as simple as give me say negative four forwards okay so colon and all the way to the end and what that looks like is xlsx I might want to say negative 5 and then I might want to say does that equal xlsx and it will say true and it's very similar to how I did this first pass I could do an and statement in here but the data is not huge it might just be easier on our eyes to then say well actually uh let's have a think about this link for a link in all the download links now we no longer need that zero because we're talking about all elements and what we're saying is if link uh go back five spaces and think yeah think move all the way forward does it equal xlsx and anything that's true will return so before I had 98 links I might still have 98 to be honest let's have a look I've got 95 which is the expected number that now tells me that there is a couple of files there we go that are not xlsx but for now let's leave it xlsx because that's what we're looking for so we'll grab that and like I said this could quite easily be and we can probably even do it here if that and this part and so this time it's here so and this wait a second is equal to this now that should work let's rerun the whole thing and test it oh this is very strange I've usually got a number of Errors By this point so I'm kind of freaking out um but now we have a bunch of links which is really cool so what do we want to do with that information so um what we want to do is we actually do want to go ahead and download every single one of these files um because that's what we're here to do we're here to download 95. Excel files so how do we do it now what we're going to do is we are going to do a for Loop so four we'll call it Link in all the download links um let's do a bit of a test let's print the link of the first one and let's put a break on straight after that so what that gives us is this first URL which is really cool with this I'm going to do some testing just outside of the loop so I'll take my variable called link and that's the cool thing about Python and the for Loop even things that are happening within the loop the very last iteration of that is available outside the loop so now they've got the link what I want to do is similar to before where I was attempting to use requests to download the HTML page which wasn't great because we had to have JavaScript turned on so we fell back on to selenium I think we should be able to use requests to download the binary object so what I mean by that why don't we do for no what am I doing let's do uh we'll first start with a response request dot get and we're going to get the link um and then we're going to check the response so ah spell things yeah there's an error requests ah requests no what am I doing sorry response requests that are cool so did I get a response from that 200 awesome now if I was to ask for say dot text I would get some very weird output I would get if you were to try to open an Excel file in notepad all right it's the whole heap of binary data because it isn't text right it's it's an Excel file in fact it's an xlsx file and we know that because we've filtered all of them on xlsx so what we need to do is we actually need to go ahead and we need to get the Excel file so what we can do is we can actually say something like dot content and Dot content is going to generate a binary format and then from that we can actually save it to disk so the way we're going to do that is we've got this thing called link and so from link with first of all we also need to get a file name so to get the file name what we can do is we can have a look at link and we can say right everything is gone I've got a forward slash so what we can do is we can split up the link split split up the link by the forward slash what that's going to do for us it's going to generate a list which is really handy right and we know for a fact that the very last element in the list is the actual file name so we can do is we can say negative one negative one him and we've got the file name which I'm having way too much fun so now I can just call this file name file name okay is now equal to that so we've got a few key components here we've got a file name and we've got the ability to get the actual sort of binary object by using a response.content so what we need to do now is we need to save it as a file so we can say something like with open and what we're actually opening is the file name and what we're actually doing is we're going to write a binary file okay as f cool and then we're going to say f f dot right come on Adam use your brain response oh response start where are we content awesome so that's now written this file name which is uh this one here uh to the file system so to quickly test that we can import OS and we can do OS Dot list the IR and I can see that that file now exists let's pull that one out for a moment because what I'm looking to do is import pandas as PD PD dot read Excel we'll paste that in just for testing shift enter and I should return a data frame of the first sheet which it did which is awesome and the first sheet looks like it's got a bit of a table or an index of things so that's really exciting um what I can now do is kind of really just Loop through and download all of them so let's go ahead and package that all up into one neat little sort of loop which is this one here okay uh let's have a look so we're going to say four link in all the download links response equals request.getlink file name equals this guy with open file name as write bytes as f f dot riotresponse dot content that's really neat um to make things a bit more fun what we can also also do is we will just print out the file name um as we're going and we'll go ahead and just power through them so that's um it's really quick awesome so I'm going to sit back maybe fast forward this a little bit and uh I'll see you guys shortly alrighty that was uh that's really quick so if all had has gone to plan now if I go and say um import Os Os dot list dir um and yeah now there's a bunch of files there which is fantastic um which is really exciting so that is how we go ahead and download a heap of Excel files um I don't want to drag this video on for too long um so I'm going to make a few follow-up videos for this exact project so please I encourage you to subscribe if you haven't already and if you have thank you so much drop a comment to let me know what you want to see out of this sort of ongoing little series on these data sets I'm thinking um why don't we go end to end from data collection data cleaning data you know munging transforming visualization and even start to look at other tools like maybe power bi or something similar prepping data for that so we can build a cool dashboard and that type of thing so leave a comment let me know what you want to see I'll um I'll drop this URL into the description so you can even check it out to see if there's anything really interesting you want to see in there I think for the next video we'll definitely be going into how to sort of automate the discovery of all this data so we've now got 95 spreadsheets it's not a great place to be but I'm already noticing that there is some file names which make a lot of sense to be grouped together and then how do we do that so if they've got the same or at least how do we detect what metadata is in them and go ahead and start to go through that so yeah drop a comment let me know what you want to see I'm really excited to start digging into this data I think it's got some really interesting insights even if it is tax data it's not the it's not the most fun subject for some but I think it's quite interesting to see the different ways we can slice Dice and cut this data because the skills we pick up here can be transferred to anything so again thank you so much for watching I really appreciate it and if you are not a subscriber please go ahead hit that subscribe button it really helps the channel talk to you soon
Info
Channel: Make Data Useful
Views: 8,087
Rating: undefined out of 5
Keywords: python data science, python data science project, python data science interview questions and answers, python data science course, Python automation, Learn Python
Id: jVz8mBRPOmY
Channel Id: undefined
Length: 24min 49sec (1489 seconds)
Published: Sun Oct 23 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.