How to scrape map data with Python | How to make money with Python Episode 5

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone thanks for tuning in to episode five of making money with python today we're going to learn all about how to get the underlying data that's available in a google map we're going to be using the python requests library beautiful soup and a little bit of pandas as well before we start huge thank you to all my subscribers everyone leaving comments i do have that big hairy audacious goal to get to 1 000 subscribers by the end of 2020 we've just kicked off august and i'm already over 600 so fingers crossed i can make it but yeah thank you once again i really appreciate it let's get down to business and what do we have here scrape data from map points on a map on a website cool build web scraper app in azure or aws to collect the underlying data from the following map into a table interesting they're specifying as you're in aws it sounds like they want it cloud hosted i'd probably just pop that into maybe like python anywhere on a schedule collect all details including company name address description website email etc and the label information such as no walls mfr walls mfr etc cool so this is the website we've got open uh allianz housing innovation already here's the map looking good looks like a google map but it says z maps that's interesting and already it's firing off and giving us some some data now obviously this web page isn't refreshing so there's a good chance this is coming through as network data and if you're familiar with my previous episodes what we do here is we go f12 and we open up our network tab and we're most interested in this xhr and fetch so these are all the sort of private api calls that websites do that allow you to get new information without actually refreshing the page so let's scroll let's have a little scroll around let's start by refreshing the page all right give it a refresh what's happening cool a couple of things are happening i like the idea of a get get get all even it sounds good zmapps.com legends ghetto legends ah okay this is interesting so if i scroll down wow look at all that network traffic this might be a bit hard to navigate let me just make this a bit smaller so here's here's what i think are legends and we'll turn them we'll see if we can get them all on and this get all legends is quite an interesting one c1 id103r false value walls mfr c2c2 1 false no walls mfr so what i'm what i'm thinking this might be is this might be a bit of a translation table that does some stuff so okay let's get to number eight yep and then number nine it looks like number nine has no value and none of them have a value beyond that point so straight away i'm thinking this api call might be useful in a moment when we get the main set of data to be able to translate it into it and this id here i imagine is probably the id of this particular map um installation from z maps so where to from here so there's a couple of things we want to do we want to obviously understand where the data is coming from we want to get hold of this z maps legends and i've just noticed every time i hover over a little map icon this thing goes absolutely crazy and it starts spitting out api calls oh great which looks to be the underlying data of each of those map points okay so breathe adam i'm getting all excited again uh so i i think we're in good shape um yeah let's let's just go ahead and give this a red hot crack so let's start off by getting let's get this legend just for funsies so easiest way to do that is right click and copy as curl then we go to our trusty curl converter one of my favorite websites link in the description uh and let's have a look what this does so this basically takes the curl command which if you've seen my previous episodes i talk a bit about how curl works and how this website works but in short it takes a curl command which you typically run in the terminal of linux or a mac and then it basically translates it into a python request and this website is great if you're working under different languages it'll translate it into different languages as well so i'm going to copy the import request and i'm actually going to pop that into our jupyter notebook so now we've got the data in the jupyter notebook the first thing i'm going to attempt is i'm going to try to take off the cookies i'm going to see if we can get this thing to run without cookies so cookies yeah we don't want cookies now if you look at this request you'll notice it says verify false that's purely because it's using a http site not a https we can test that and see if we can just add https to this website https colon slash slash and no it can't be reached weird so the alliance housing innovation website only allows http already that's a bit sus to me but that's cool so what the headers we've got these parameters and this is just passing in the g whatever that is i assume that that is that instance of the map so it's the one that's unique to allianz zmapps.com legends get all the legends sounds funny alright so shift enter on that the first thing we're going to do is we are going to have a look at that response and what we're looking for is a 200 which we have which is fantastic we are expecting some json data but let's ask for the text data and we know it's text data because it does have these um single quotes around it but what we can do because that does look like json data we can ask for the json data and now what we have is a nice diction sorry we have a list almost got that wrong we know it's a list because we have square brackets around that but if we ask for its type we can see it's a list now what's great about this list is it has all the data we would ever want it has everything's false i'm going to ignore that we have a bit of an id and then the id has a value and that value appears to be the actual name the id appears to be repeated and then i'm not entirely sure what these numbers are about but i think once we collect the main data we'll understand that what we might do just to make our lives a little bit easier we might create a very quick function okay define function and we'll just say get legends now i'm doing this because i want to be able to um divide up our code and maybe call upon this at a later point but for now i don't want to get too much codals kind of everywhere so we just make a function that gets the legends so what we're actually returning is the response json so we might just say return response.json so shift enter on that and all that means now is what i can simply do is call upon this function which is this you know repeatable code and put brackets around that and what we might do is we might give this a name so we'll call this one you know legends oops legends is equal to get legends and now when i ask for legends the variable i now have all that data available to me so one more time we've taken that get request super simple we've wrapped it up in a function and we're just returning the output of that and then down here all we're doing is we are creating a variable and we're saying hey inside this variable just give me whatever's returned from this function which in this case is this list now that we have the legends let's go and look to get the actual data now straight away i'm envisioning that there's probably going to be three main api calls one to get the legends one to get the main sort of data and then the final one is for each of those data points when you hover over we need all that detail that comes with it what we might do is we might clear that map and give it a bit of a refresh and then from that let's have a look at what the data looks like is coming back so there's one here called get get annotations and this one here looks quite interesting e markers and there's that same sort of g equals three eight number which we had before uh we've got a k equals let's go ahead and just copy that out and let's see what we're working with so same as before really simple copy as curl pop it into our curl converter and we're to do a very similar exercise to a moment ago so we're going to go ahead and we're going to ignore those cookies we don't need um to import requests again because we're doing it at the top here we've got our headers which you know are very similar to the head in fact the headers are probably identical but for now we're just going to wrap that up into the function just in case they are a little bit different we're going to get rid of these cookies um verify false okay g makes sense to us now we know that g is the sort of mapping instance k regular not entirely sure e true not entirely sure and dc i imagine it's probably some sort of zoom level maybe um we can play around with the map and see what it does differently but let's go ahead and test out this uh this api call so shift enter on that first thing we're going to do is similar to before check the response 200 which is fantastic and then we're going to ask for the json data which might be a bit large but let's have a look what it does okay straight into it so again we're getting a list and it appears we're getting the address we're getting part of the address getting all the sort of address information and if you think back to the legend it feels like we might be getting something new whether it be b or f and again we can sort of play around with this data and do some reverse engineering and work out what each of these sort of ids mean now this is really good data this is very structured data and each of them appear to have an id so what i might do is i might again wrap this up in a function okay we've got a function called get legends let's create a function called it's going to sound really lame get main called get mapped get map data and of course we've got a def so git map data we don't need to pass anything into this function we are just going to run it as is four spaces in or in this case a tab tabs aren't always the best in python four spaces is recommended and we're going to return exactly the same as before but this time around obviously we are querying the api a little bit differently so what that looks like is return response.json and so what we have now is the ability to say something like map data so map underscore data is equal to git map data which is that one there which is our function shift enter on that what we're doing here is anything that is returned from this function will be passed into our variable called map data so if you look at that map data now it's just a really nice big long list now the length of that list is 991 map points now that was exactly one thousand i would think maybe okay maybe we need to you know zoom in zoom out and find more but 991 i feel like that is just all the data now what do we need to do next so we've got a couple of functions we can just pop these up here all righty so we just run them next to each other that's okay so what we've got here we've got import requests we've got a get legends function and we've got a get map data function what do we need to do next now for each of the points on the map we want to get more detail so how do we do that let's go ahead and look at the map now keep it keep a close eye on these sort of requests here as i hover over one of these points maybe that one there boom straight away do you see that we then got all of the detail in that pop-up okay this includes some really rich information and some of that rich information is actually in html so what we what we might need to do is we might need to use a package like beautiful soup to go and extract some of those key values from the html but first let's figure out how we can get all 991 pieces of information so to do that very similar to before right click copy is curl noticing a theme here okay pop it into our python requests and let's go ahead and have a look at the response so similar to before we are going to get rid of any sort of cookies the website seems to function okay without them the parameters here look interesting now when i use the curl converter whenever i see any sort of forward slashes i usually get rid of them and the reason for that is because they usually break my queries now what do we have here we have g we've got the three eight number and it appears twice in the list i'm not entirely sure why but it is very consistent there's the three eight number there ending in nine triple one that's the gitmap data and you can see uh three eight three nine triple one there so for whatever reason in this particular get the more detailed version of the data it would appear that it is a list and it's appearing twice i'm not going to change that it seems to work so we'll leave it as is what we'll do is we'll test this out though so let's go ahead and shift enter on this one and you guessed it similar to before let's check our status code fantastic so we've got a 200 which is very good news and if i look at the json data it's very rich okay now the challenge we're going to face with this particular one is obviously we're going to have to hit this api 991 times and each time we are going to have to replace that id now the good news is we know how to get those ids they come from git map data so the next step for us is to pretty much wire this all together we've got this really nice query that basically you need to pass in the id of the map location it's going to spit back this really rich information for us so what we'll do is wrap that up in a function so uh get maybe add details i'm using the word add because i see here they've got add we'll get the add details now this time around we're not just going to put open or close brackets we are going to pass in the ad id now ad id is a variable name that i've just made up which is locally scoped to this functions all that means is ad id is available within this function but ad id isn't available outside of this function so keep that in mind and all we're going to do is we're going to pass in ad id here now a couple of different ways to do that my favorite way to do that in the more recent versions of python is to use an f string okay if you watch my previous videos i go into the detail f strings but all that means is we're going to pass in the ad id so demonstrate that really quickly if i had a variable called add id and it equaled one two three i can then create an f string which passed in you know one two three alrighty so nice and easy let's go ahead and make sure we get that response which is you know what we'll just copy it from here cool all right so what we'll do is we'll bring all of our functions together so we've got the get legends to get map data and the get add details now the our details requires an ad id which gets passed in here through an f string so to bring this all together let's start with the get map data okay so what we'll say is we'll say map underscore map underscore data is equal to and we'll get map data so that's running now it's collecting 991 uh pieces of information which look like if i do length we can see that is 991 and we can see what that data looks like which is this data set here now each of those has an id and from the id we want to be able to store the ad details so what we'll do is we are we're not going to go too deep and clean up the data as we go we're going to chunk it all out so to do that what we're going to do is we are going to do a simple for loop okay so for an item and again items just a name i've provided you can call it whatever you want for item and map data what we're really after here is the id okay so the id is what's it called it's called id make sure you put your single quotes around that and what we might say just for this one we might say id is equal to now id is a reserved word so we'll go underscore id is equal to item id so we're just asking for this id here and to prove that we're collecting that correctly we'll just go ahead and we will print the id now i've got an error this always happens for oh my god i do this every time but in oh i should get a t-shirt don't forget to put the word in there um fantastic so looks like we've got 991 unique ids and what we want to do is we want to go ahead and query all 991 um to get all the ad details now in the interest of time we're not going to do all 991 for this video let's just grab the first 20 okay so if we do 0 to 20 and what we really want to do is we want to just get the add details for each now first of all we are going to go ahead and create add details list which is you guessed it an empty list why do they create an empty list because we're going to fill it with each of the ad details and how we do that is really simple we have this function available to us now so get add details and what we're going to do is we're going to say inside our little for loop add details is equal to git add details and we know we have an id available now so let's pass that into our function so that's what's asking for here and gets referenced here and then what we're going to do is we're going to say add details list oops didn't mean to press enter there add details list dot append and what are we appending to the list we are appending the add details okay shift enter on that name get add details is not defined ah shift enter on this now it is defined and what we might do is we might just put a print statement in here to show that we are going through each id print underscore id shift enter on that and off we go i'm going to speed up this part of the tape sort of tape adam okay that ran nice and quickly uh that was great so now if i look at my ad details list what i actually have available to me shift enter on that you guessed it every single ad detail that is available to us but how do we make this data useful how do we actually do something with this information so what we're going to do now is we are going to go ahead and break out as much detail of this ad detail as possible i'm saying about detail a lot uh and the first sort of challenge i see is this t key so from what i mean by that as well as to take the very first item in our list so the very first add detail i just get a single entry and looking at that i'm thinking okay let's isolate t shift enter on that and what i'm what i'm noticing here is this looks a lot like some html which which it is looking at this html i feel like we can definitely get some detail from it so for example we should definitely be able to get out the url of the website uh ideally we should be able to get some of the description something interesting about this html it would appear that the description doesn't have any sort of class class bold there's some stuff assigned to it can be a bit tricky but let's let's really break this down so to do this what we might do is we might say something as simple as for for add in add details list and i'll put the word in there this time for now we're just going to print the ad and then we're going to say break just stop there don't iterate any further just go for the very first iteration so shift enter on that and now we're working within a loop which is a really nice place to be because once you take the break off you're going to go through all the ads okay so let's put that break back on and get down to business now in there we have got that t so what we might say here is add t okay and we might just take off that print for now go ahead and call that html and just just so clear on what we're actually dealing with here we'll go ahead and print that to print something you need to put the word print in there there we go now we are printing out some html i'm thinking we put this through beautiful soup so to do that what we might do is give ourselves another cell pop it at the top and we'll run our imports at the very top so we've imported requests but we also want to import beautiful soup so to do that we're going to say from bs4 oop import beautiful soup shift enter on that and now we have the beautiful soup library available to us so to turn our html into a beautiful soup object which will then allow us to extract data from this big blob of text simplest way to do that is create a variable any name of your chosen but a lot of people work with the word soup and we're going to say something as simple as soup equals now we're going to say beautiful soup the first thing we need to pass in is some html and the second thing is we do need to specify the parser now nine times out of ten you are going to be working with html.parser so stick with that for now and you should be fine now all that's happened is we now have this soup object available to us so what i can say is i could say hey go ahead and print the type of object you are you can see here it's a class of bs4 beautiful soup all that means is i could do some cool and funky stuff like soup dot find a okay let's print that out and let's have a look what we get back sup dot find a let's get rid of the html for now soup.find a straight away we're getting back this hyperlink and if you wanted to get the href which is the website listed on this ad it's as simple as square brackets and href shift enter on that and now we have the website now why that gets so fun is because i could take off this break shift enter on that and we got an error straight into it the fourth one has an error the fourth one has an error because there is no website listed so what we need to do in that case is and i don't digress too much we just go hey let's try let's try and get the website but if you don't get the website so if you have an exception in in the case of an exception why don't we go ahead and just pass the shift enter on that and what this is showing us out of the 20 these are all the ones that do have a website so the try and accept functionality in python is great you'll find this in lots of programming languages but the reason i love it is because you can just try something and hey it might not work for every single example but in those that it does work for we do have a result okay so let's turn that off for now let's focus on getting some of this detail out so we have the super object uh and straight away we have a website available to us which is this bit of code here so what we might do is in each loop one of my favorite things to do is go ahead and create a small dictionary which is going to store all this information and then just assign that to the dictionary so what i mean by that is we're creating an empty dictionary and then we're creating a key called website and we're saying website is equal to and we're going to make it equal to the soup object we created a second ago so let's go ahead and have a look at what happens when i go to shift enter on this now make sure we get our break back in there so we need to do one iteration now when i look at my data dictionary i now have website and i have a website available to me so why don't we go ahead and print that html and let's work through it together so we've got the website out of there the other thing that the person was requesting was the phone number okay so the spam class phone and there's a phone number there so to get the phone number what i'm thinking we need to do i can say soup dot find and we're looking for a span but we're looking for a particular class so if i just did find all span so find underscore all you'll find that i found a whole bunch of spans one of which has the word phone in it so what we can do there is we can very simply put some curly braces around that one and pass in a key value pair of class and what class are we looking for phone there it is now i now know there's only one spam class phone so what i could do is i can actually get a way of using the find functionality and you'll notice here i'm actually collecting some of the html as well so what i do in that case i just do text and now we have the phone number now obviously we can clean this further we can get rid of some of those brackets and the spaces and the dashes but for now that should be enough so to add that into our smart little mix here what i'm going to say is you guessed it data phone is equal to that one so shift enter and if again if i look for my data dictionary what do i have i now have a website and i have a phone number okay so what we've got is we've got this interesting one where the description is within the span tag but then the actual description itself is kind of like a sibling of that so off to the right so what we might need to do as everything appears to be kind of in a span what we can get away with we can do something as simple as super defined all and we'll use the underscore version okay we're going to find all the spans and here we have description phone there's the phone number so that's not going to help us services primary structural material so i think it's going to give us everything we kind of want and then what we might say is for item in uh let's go ahead and just print that item for now okay and for me i'm gonna ignore this phone number one cause that's just gonna put us off a little bit but what we might be able to get away with is something as simple as i'm just thinking out loud here so bear with me item dot text the item dot text should give us if i was to go ahead and sort of print that description phone services primary structural material cool but what i really want is the data that sits over to the right so what one way i could do that is i could say print uh item dot next sibling shift enter on that going to give us everything we we kind of want the best way in my mind to sort of structure this up and again this is me thinking out loud which is a dangerous game is to dynamically build out a little dictionary so what we might do is again thinking out loud dangerous game might go ahead and say something like data underscore dictionary which is a data dictionary which takes in a key and i think in this example the key we'll just let the data tell us the key so description even this phone number one we can delete that later and then the value of that would actually be this item.next sibling so if i go ahead and sort of run that shift enter um what i end up with is the data dictionary which whilst the data is quite messy um we have some very good good stuff in here so it's got the website which we got before got the phone number which we got before then it's grabbed this description for us uh and it's done some really cool stuff now i'm quickly noticing that there is this annoying little character so one quick way to get rid of that is to simply say hey we're going to replace that with um with nothing okay and that errored out for some reason why did the error out let's think about it is it this it could be this the shift let's shift that for a sec yeah it was that let's convert this to a string i i think maybe this is still a beautiful soup object so let's go ahead and convert that to a string and once you're a string then i should be able to replace yeah beautiful cool so descriptions looking really nice um i sometimes like to do a dot strip at the end just in case there's white space and there's a couple things we don't really want so what we're doing is we're adding to this dictionary and i'm just wondering i actually think we'll just add it all even this sketchy dodgy data and then later we'll just we'll do the final selection of what we really want out of this so to do that let's go ahead and grab that little for loop which is going to sit under our sort of main loop and we're going to go ahead and sort of run that again so shift enter and shift enter we're going to have a look at what this looks like now this is all very good but we do need to make sure that we can run this across all 20 of our examples so let's go ahead and take this break off and let's shift enter on that and straight away we've got the error which is a good thing for us and it's telling us that the website's not available we've encountered this one before so to overcome that on the exact website one we will just go ahead and say hey give us a try otherwise if you are going to accept why don't we make the website for now anyhow just so we've got consistency in our data why don't we go ahead and make that equal nothing so shift enter again uh same with the phone number so let's go ahead and fix that so one more time we'll go try and we'll say try and get the phone number but if you can't that's okay we're not angry not even disappointed that's fine just have a phone number uh and we'll make the phone number equal nothing so that's equal to nothing shift enter on that shift enter shift enter there we go so that's all good and well we are going through and we are creating a data dictionary each loop it's going back to being emptied out and then so what we need to do is we need to save each of those iterations and how we do that is we will go ahead and let's think about this we've got the add details and what we might say is add html details is equal to a blank blank little list uh and then we'll just append that at the bottom here so dot append and add html details shift enter on that and what we have available to us now is nothing what have i done absolutely nothing are we sure let's go ahead i don't know what have i done oh no that's that's not right at all we've done something i see okay i've literally appended um itself that's completely wrong ignore that step we should be appending the dictionary shift enter on that shift enter and it looks like we have successfully done that nope oh no oh my god okay so this is great so in many of the cases we do have a number of the data attributes that is required from the client to be able to complete this job but to completely bring this all together what we're going to do is we are going to go ahead and build out a nice spreadsheet of data as a final output so to do that we've got a couple of different things at our disposal we have two main lists we have a list here which is the add details list which has all the ad details and then in there it has the html and then we have our new add html details alrighty add html details very straightforward similar sort of thing now i'm expecting them both to be the exact same length uh and and in the exact same order and the reason why i'm expecting that is because they're one was created from the other so we can do we can actually go ahead and combine those two dictionaries and the easiest way to do that is something as simple as for add and html in it's going to sound confusing but we're going to zip those up we're going to zip them together so we're going to say add details and add html details okay and we're going to say something like final final output's a terrible name for a variable don't do that final output make sure we got that little guy there final output is equal to and we're going to say star star add comma star star html and as always we'll put a break here just to make sure we've got our code right and everything's working and so when we look at our final output what that does is it brings together all of the keys and values from the main uh dictionary along with all the keys and values from our html dictionary now this is probably a good point to point to say that we could potentially at this point you know remove this t from here but just in case there's other attributes that we may have missed it's always nice to have the data for later analysis and so what i'm also noticing is we have this key here called add which then has another dictionary so country city street postcode state we could ultimately add this to the final output and the easiest way to do that is to then say well final output is now equal to let's bring together our final underscore output so star star final output and let's go ahead and star star the add which is this one here or we can reference final output and we're looking for the add key so let's go ahead and shift enter on that and shift enter again and all that means now what we have is we have this ad which is almost like nested in this dictionary just available to us in the main dictionary so what we might also do while we're here is we genuinely do not need this ad at all because we're taking 100 of its dictionary and putting it in the main one so what we could probably do after that is say something like hey final output we don't want the ad anymore in there so we just go ahead and we can just say something as simple as delete that so shift enter shift enter and so what you'll notice here is at the top the ad is now gone there's no ad in there and all the details of the ad are available down here all right so where are we at so we're now at the point where we're pretty close to wrapping this up uh we should be able to make a sort of csv output from this i'm going to take the brakes off and we may get some errors especially if some of these keys aren't available so let's shift and turn on this and see what happens uh nothing cool that's probably a good thing so the final step now is we have this final output which generates a final dictionary of each of the items in our list so i'm going to call this one something like master list is equal to a blank list and what we're going to do is at the very end we're going to say masterlist.append and we're going to append the final output so shift enter on that and what we end up with is this master list which contains all of those dictionaries for us to enjoy okay so where to from here so we've got our master list so what we can actually do and obviously the data does need a lot more cleaning uh and we are running a bit short on time for this video so what i'll do is i'll input pandas as pd and pandas like i mentioned in my previous videos is a really great data science sort of data analytics package and what we can do is we can create a data frame because what we might say is add data frames add df is equal to pd dot data frame case sensitive and we'll pass in our master list which is this one here shift enter on that data yeah you go you've got to spell things right mr nem there's a data frame shift enter on that and now when i look at the ad data frame what we now have available to us is we have the longitude the latitude we have a whole bunch of information all the different titles these phone numbers are just an absolute mess we need to get rid of those that's for sure but at the very least we have a whole heap of data we can then build out and clean up so i'm thinking maybe we'll do a data cleansing video next on how to really clean this data up but as far as extracting the data from the map i'm pretty pleased where we got to so we were able to get the legends which we didn't actually need to use in the end we got all the map detail and we got all the ad details as well and from the add details we were able to extract the html and from the html we're able to get things like the phone number and website and generate some of those key value pairs based on the next sibling along with the main block of text as well so this is probably a good point to wrap things up um i do want to just say a huge thank you to all my most recent subscribers and everyone who's dropped me a note in the comments i really appreciate it i think it's absolutely unbelievable that uh you know 600 plus people have decided to subscribe to my channel uh and as most of you already know i do have a big hairy audacious goal to get to 1 000 subscribers by the end of 2020. i'm feeling quietly confident about that but if you can help me out if you haven't already subscribed please please give it a consideration uh no pressure uh and look thank you so much i will be creating videos several times i'm gonna say a week maybe fortnight um but yeah look forward to your feedback in the comments have a good day
Info
Channel: Make Data Useful
Views: 12,550
Rating: 4.9906759 out of 5
Keywords: how to make money using python, .py, best first language, best languages, earn money online, how to learn to code, how to make money as a programmer, how to make money coding, learn python, learning python, py, python, python datetime, python for beginners, python programming, python projects, python scraping, python scripting, python tutorial, python3, pythonnbeautifulsoup, pythonrequests, top programming language
Id: zesUhmT7Oz0
Channel Id: undefined
Length: 37min 24sec (2244 seconds)
Published: Sun Aug 02 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.