Intro to Web Crawling & Scraping in R

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

record on this computer so i have a copy okay i think it should be recording um so yeah so today we're going to talk about web crawling and scraping um i'm i just opened up my html file and i found a typo in the second word so it's always a good start but let me share my screen and we'll get started so you can see here i've already misspelled web crawling okay okay um if there's questions just please um just use your mic um it's hard to follow the chat as well um but yeah so anyway uh a lot of people in econ like web crawling or web scraping um because there's a lot of data online that doesn't come as a downloadable csv rather you have to go out you have to get it you have to kind of make a data set for yourself and this is pretty much one of the major tools to do that so i think it's super important for economists to know and i think it's kind of a fun a fun activity as well so the introduction um i want to talk about the differences between what when someone says they have a web crawler versus a web scraper what the difference between those two things is really mean so a web crawler is pretty much what google does where they go um and they visit websites and they just kind of index them and they just find okay this website is a shoe website i'm going to put that under that category so when people search for shoe they can find it or something like that right um web scraping is actually the process of getting the data off of the website so once you have it indexed or once you've saved it or something like that then you go on there and say okay on every one of these pages there's some amount of data that i want so that's how i'm going to kind of i'm going to go and i'm going to scrape that data off that web page but before we kind of move to that there's a few things we have to learn specifically we have to learn how to work with strings we have to learn about loops and we have to learn about a few packages and html as well so the first thing i want to talk about is strings because a lot of what html is is just a bunch of text so we have to be able to parse that and manipulate strings as well as urls so there's a few major really important functions for strings that you might that you might need to know this first one is going to be paste so paste is going to do what you think it would do where you have two strings or a vector of strings or something like that and you're just going to smush things together so for example if i have the string tom and i have the string york if i use paste and i show the output it's going to make it into one string so i go from two elements to one now notice that when i call paste there's a space here there's no space in the string but whenever paste sees one object separated by a comma it says okay i'm going to need to fill in a space there now maybe that's not what you want maybe instead we're dealing with urls you can have spaces rather you want to have dashes right so you can specify the separator so here i'm going to specify that i want a dash so if i paste this it should be tom with this with a dash and then york now suppose we want to be a little bit shorthand right we have a bunch of things that we're combining and we don't want any spaces we want to type the string exactly the way we want it we want it to show up then we can use paste zero paste zero putting a zero there just specifies that the separator is going to be nothing so there's no separator there's no space there's no dash there's nothing it's just smooshing them together exactly how they are so if we show like that there's gonna be no space so i i usually use page zero just because i i don't wanna have to deal with remembering that there's a space or figuring out some sort of separator unless i'm dealing with a bunch of different things then i might wanna have a separator or an and sign or something like that but usually i just use p0 so that's if we're just pasting two character elements but if we have a vector of them so here i'm gonna have tom and i'm gonna have johnny and i'm gonna have york and greenwood and i want to separate them by an underscore it's going to paste tom and york and johnny and greenwood both of these separated by a uh underscore let me show the output for that one get tom york and johnny greenwood so that's if you wanted to combine strings if you want to separate strings or maybe take a little piece of a string you're going to use this substring the substring function so the so the substring the substring function takes three arguments takes the string that you're starting with and then it's gonna take a start and a stop so if i say tom york one comma four it means start at start your the substring at the first character element and then the fourth character element so it's gonna start at t and end at m now suppose i wanted to maybe find and replace for that we're going to use this thing called g sub so g sub is going to find all the instances of this first element so or this first argument so it's going to find all the o's and replace it with all the underscores so if i take radi if i take radiohead which is my vector of tom york and johnny greenwood then i want to substitute o's for underscores it's going to do that so all these all the o's are going to be gone now replaced with underscores now maybe i don't want to replace that maybe i want to find does this character string contain this other this substring so for example i'm going to use this uh function called grepple i'm not sure what it actually stands for it's like general something i just call it grapple but what it does is it tells me does an m exist in this vector radiohead and it's going to do it for each element so if i look at the results for that one it's going to say an m exists in this string but does not exist in this string and then if i look for a y it's gonna say okay it's gonna exist for this one and it can exist for this one so they should both be true so so these are these are pretty helpful this last one is called um it stands for regular expression so what this is going to do is it's going to give you the position of that element in this in the string so for example if i if i put an m again instead of returning true this is just like a binary one zero this is going to tell me what position the m is in this in this string tom york so it should return a four so it returns a bunch of stuff but we don't really only care about this one this part so it returns four for tom and we're also searching for m in john and greenwood but there is no m so rather it just returns a negative one to let you know that hey i couldn't find this so instead of confusing you with some other number or a zero maybe i'm just gonna give you a negative one and then if we look for y it'll output the same way this time it's gonna be six because it's gonna be tom york so here's four characters five characters the sixth one is a y here it's going to be the fifth one as a y and then the final thing i want you guys to know for strings is this thing called string split so this is going to do is it's going to cut your string at every point it sees a space or any character right whatever you put here so for example it's going to take my radiohead vector and it's going to chop it up every time i hear i see a space so let me just put this into r so if i do this and i run this line it's going to give me a list so it's going to convert my vector radiohead to a list of that many elements so one so two elements of a list oops must be a space sorry and then in this list it's going to give all the elements of that string how many times it's broken up so now i'll have tom and i'll have york because i split the string by a space split the string by space if i split it by an e it's going to give me one because that last character would be the e right so it would go away and then johnny gr and then nothing and then n would but now it's a little bit um difficult to work with if it's in a list like this so we can use we can use this function called unlist which is pretty much going to take the list in order and it's just going to create a vector back for you if i unlist this remember we started with a list that has technically four elements in total if we unlist it we get tom york and then johnny greenwood are there any questions about strings so we're going to use this a lot for like urls and parsing some some stuff like that any questions all right so now we're going to talk about um a for loop there's a few different loops that you can do in r most notably there's the while loop and then there's the for loop we're just going to talk about the for loop because um this is going to pretty much it's going to be a block of code that you're going to execute for a given amount of times so for example suppose i wanted to introduce myself to every one of my students or i wanted to send an email to every one of my students personalized or something like that in like a micro 101 class that would take like 200 emails right so rather what i can do is i can just put the code one time in a loop and it will loop over it as that many times so for example i want to say hello four times i wanna say hello world four times so every time through this loop this i is gonna start at one it's gonna go through the whole loop and when it's done it's gonna come back and it's gonna add one you can go back to two that's gonna do everything come back gonna go it's gonna go to three go back and then go back to 4 and then go through so it's going to print it four times now this is going to be independent of this counter or this index variable rather we can actually use that index variable inside of our loop as well so for example suppose i have names right i have alex brad brian and adam maybe i want to say hello to each one of these people well i can write a loop so i can say for name in names so i'm creating this index for this vector so this first one it's going to be it's going to enter the loop and it's going to say okay name is equal to alex the first time it's going to print this paste 0 hello alex exclamation point then it's gonna after it's finished it's gonna come all the way back then it's gonna say okay name now equals brad i'm gonna put that into this hello brad and then i'm gonna have exclamation point and so on so now you can have your code execute four times even though you only wrote one line okay are there any questions about loops they can be a little bit tricky when you're first learning them but i'm thinking that some of you have already probably been exposed to these okay if statements so now what we're going to do is we're going to let our code make decisions so we're going to introduce a little bit of logic into our code so for example i'm going to write this for loop i'm going to do the same exact thing but maybe i only care about saying hello to the people whose names start with b so i'm going to take the substring of this name if you remember names was alex brad brian and adam it's going to check this the first element of the substring it's going to compare it to b if it's true it's going to execute this if this statement is false meaning like the first one would be alex right equal to a then it won't do anything this it'll it'll never even look inside of this if statement and it'll just loop over and then i'll go to brad and say okay does this work then yes then it's going to print so for example it's only going to print two times here even though i i looped it over all four names but maybe i don't want to ignore alex and adam maybe instead i want to say something different to those people so in this example i'm going to have an if and then an else this else is going to be a catch-all so if this happens this will not happen likewise if this happens this will not happen so only one of these things can happen if you have a bunch of if statements only one of those things can happen once it executes it'll continue on so for example if it's equal to b then it's gonna say nice to meet you if it's equal to anything else it's going to say hi so if we look at this say hi alex nice to meet you brad nice to meet you brian hi adam now we can also do multiple conditions so if we don't just want to have an if else right we can have if and then else if and then else so we can really kind of customize this that to make have the computer make these decisions for us so if the first name starts the first letter starts with b is it nice to meet you if it equals adam you could say hi and then equals alex we can have it say goodbye or something like that and then if we want something to execute every time we could just put it at the end of the loop because it's going to execute all this information and then it's going to move on and it's going to hit this thing um you can also put this print statement below this one below this one and below this one but why write it three times if you could try it once so that output will look like this bye alex okay i'm going to go meet the next person nice to meet you brad okay i'm going to be the next person right okay now we can combine these two things so we can combine if statements and loops in a very particular case with this if else we used this before when we were generating data and we were conditionally coloring plots but now i wanted to talk about a little bit more so because r is already vectorized you really don't want to write loops in r you want to try to avoid loops as much as you can they're pretty slow um sometimes unavoidable like for example for this some of the scraping stuff it's unavoidable to write loops to go through every single url you might have for something like this um but in general you want to try to avoid loops as much as you can now i say that but at the same time loops are also very easy to read because you can see what's happening every time and maybe that's helpful for you so it's kind of a trade-off between readability and performance depends on what you're interested in but anyway what if else is going to do is it's going to pretty much loop through your vector and it's going to perform an if statement on every single item and this is going to be way faster than a for loop with an if statement nested inside so if i do this it'll be way faster than if i were to do something like this so this first one i'm going to check the sub the the name the the first letter of the name if it equals b i'm gonna put a one otherwise put a zero now notice how oh i just use name here okay never mind so now i'm checking the entire vector i don't have an index there i can just put the entire vector in there so this is gonna output a vector it's going to check each one and it's going to output and then likewise i think we talked about this before briefly but you can also nest these if l statements so you can say okay if it equals b give it give me a one otherwise if it does not equal b check this next condition or this next condition is going to ask you for if the last letter so i'm using this argument called n char so it's going to be the number of characters in names so if atom is four letters long it's gonna put in four there and four there so um it's going to ask if it's equal to m but a two otherwise put a zero so brad and brian would get a one adam would get a 2 and alex would get a zero okay are there any questions about loops or if statements or strings are we all good okay so we're gonna move on to crawling soon but i just wanna show you guys one package um before we go there so it's mostly cosmetic but it makes your code a lot easier to read especially when you're writing what can end up being pretty complex code relatively so it's this package called mag river and all it does is allows you to use this little thing here called a pipe so what i'm going to do is i'm going to create 100 random numbers right and i'm going to do this pretty complicated thing to it i'm going to take the average of the of the 100 numbers i'm do e raised to this average i'm going to take the square root of it and i'm going to print it so it's kind of like just a bunch of functions that are kind of build out on top of each other so if you show the output it's going to be 0.9 the the outcome of this is a material but rather what i care about is the flow and how you write this so if this was maybe longer or maybe even a lot more complicated this would be really difficult to read because you'd have to think almost backwards you have to think like inside out whereas with magritter you can rewrite this so instead of writing it this way you can write it this way so you take your x and i think of it as like a waterfall and it falls into mean and then it falls into the this exp function then it falls into the square root function and then it falls into print so to me this is a little bit easier to read though it might take a little bit of practice for people who are used to writing functions like this but it gives you the exact same output it's not different um there are a few other little tricks you can do with mag ritter um but i'm going to hold off on that for now but i'm just going to talk about this one so an important um nuance of this is that if it will always put it into the first argument so for example our norm if you put in one argument for our norm it's that that's going to be the n so if i do an r our norm 100 it gives me 100 random numbers if i do r norm 10 it gives me 10 random numbers and it just assumes that your mean is 0 and your standard deviation is one that's already preset so what magruder is going to do if i do 10 paranormal it's going to put 10 in that first that's that first slot now suppose i don't want it in the first slot suppose i want four random numbers so for example this one i want four random numbers but i want the mean to be five if i put a period here magruder will look at this five look at the pipe and say okay i don't want it in this first slot he wants a four there he actually wants it in this slot so it knows that you put the five in the second slot the second slot is going to be mean so now this is going to be five random numbers to the mean of five and the standard deviation of one which it pretty much looks like same thing with here this is going to be five random numbers with an average of zero and then likewise you could put it into the third argument this is going to be average of 10 standard deviation of five four random numbers and then lastly you could do all all three of them right you can put it in for each argument you just have to put the periods there this is going to be five random numbers an average of five standard deviation okay so let's talk about what html is html is pretty much what runs the internet it's um how uh it's like the computer code behind websites and it looks like this this is a super super basic html document and if you notice it should look pretty similar to something we've already talked about it should look like an r list like a list in r right i can kind of write the same exact thing if i were doing this in r maybe i want to call it html and i want to make it a list and i want to make that uh one element called head and i want to call it my first web page and then i want another element called body and body is now going to be a list a body is going to be a list of my first web page and this is a paragraph so my first i'll just leave it as web so it's a little bit different and then this is a paragraph so if i run that print html it should look pretty similar to what this is maybe a little bit different just because they're different languages so of course they're going to be a little bit different but pretty much you should think of html as a list it's just a list that has elements and each element might have multiple elements right it's not always just like um a flat a flat table okay so crawling we're going to talk about what what it means to crawl and just means you're just going to visit a website you're going to look at it you're going to have the computer go there read the html and save it to your computer that's all it is you're just going to save so before we crawl sometimes it can get you in trouble because certain websites maybe will kind of block you out if you're like attacking their servers too quickly so there's this thing called the robots.txt file which will tell you what you can crawl what you cannot crawl um and how fast you can do this so for example where i got all that nick's data from was this website called basketball reference and if i just go to basketball reference is what it looks like fantastic you got players you got stats but if i type in slash robots.txt most good websites will have one of these and it will give you a little bit of information about what it wants from you so for example it'll tell you what user agent so this user agent af a hrefs bot this means disallow anything that looks like this so pretty much everything right this is this one this uh this guy is not allowed on their website so twitterbot i think is also just slot for everything now this is where you would be concerned user agent star this is every other user is not allowed to access something that looks like basketball reference dot com slash blazers slash and then something else i don't know what this blazers means i don't know what this dump means um anything with game log apparently you're not supposed to supposed to scrape i mean you can scrape all of these things it's just if that's what our friend sees you doing it a lot and they don't like it then they'll stop you um and this is also important this crawl delay they're requesting that you wait three seconds between each call we're gonna be scraping pretty i'm gonna ignore this for now but um if you were maybe a better person than i then maybe you would have all of your all of your scripts wait three seconds in between each call so here are some other ones here's one from the cdc i feel like this is pretty uh topical right now so user agent you can't scrape this stuff that i want you to at least so do not index the following urls so this this rover bot apparently is a bad dog and they don't like this person anymore so they won't let him scrape at all right um and this is kind of a funny one this is nike so if you scroll to the bottom they have a little swoosh mark i thought that was kind of neat um some of these websites you can find like some pretty cool robots.txt files like i think tripadvisor has a cool one where um they say oh if you're like if you're not a robot and you're a person and you're at this page then we probably want to hire you or something like that because you know they want uh like search engine optimization people who will scrape a bunch of data so anyway i say this just to like you know if you are scraping a bunch of stuff you might want to just check like for example um i was scraping this nypd website one time and i was doing it too quickly they requested a 10 second crawl delay and i just totally ignored that and then my ip address got blocked from my apartment so i couldn't access the website at all from my apartment but i could go to like a coffee shop and do it or i could like change my ip address and it's fine um but yeah just be aware of that that if you are doing that you probably want to check out this web these these robots.txt and just make sure that what you're doing is is uh is all aboveboard okay so let's talk about setting up our script so like i said before in a previous one where this is pretty much how i start all of my scripts i just have the packages i'm going to need so for example we have this one which is going to be doing 99 of the heavy lifting for us this is called arvest this is the website that this is the package that's going to let us scrape and all that stuff we're going to load mag ritter so that's going to be what we talked about before that makes our code a little bit easier to read i think um maybe maybe you'll agree with me by the end of this and then we're going to load this last package called json lite this is going to allow us to read javascript objects so if there's anything in json it's a little bit tricky to read because it's like a weird list format that we're not used to and we can read them really really quickly with json lite so we're gonna read these three packages in there i'm gonna set my working directory so i'm gonna set it to this empirical workshop scratch scraped html so give me a moment i'm trying to pull up so i may in in the folder i had here i had um in my empirical workshop folder i have this scraped html folder it should be empty there should be nothing in here this is what i'm going to use to scrape stuff today i'm just going to dump it in here so i'm going to set all this stuff as my working directory load all these packages up you see this that's fine it's just loading a second package in the background okay so now we're gonna go scrape the 2019 nyx roster so i go to basketball reference here's the website this is what it looks like a bunch of data a bunch of bunch of player stuff this is the thing we care about this is what i think i gave you guys in that csv before so you can see you have kadeem allen number zero position height weight birthday before i think it's actually said u.s and didn't actually have like the flags and stuff experience college and then notice these hyperlinks okay so this is what we're going to want to scrape or crawl rather so first thing i'm going to do is i'm going to give it a url well just use this one so i'm going to set my link my link is going to be what i just copied and pasted into the r it's going to have my link here and i'm going to put my link into read html so with this pipe operator i can rewrite it exactly the same as read html link and i can save it like this my html right there's a bunch of different ways you can kind of write this but this is the way um i'm gonna do it for now this way with the pipe so if i save this link i write that in there now you'll see here i have this html object so if i type in my html you have a head a body right that kind of looked like what we had before with this thing we had a head and we had a body so we know it's in uh it should be an html item and all we're going to do is we're going to put it into this right html now if we go here now we have this html document called nyx 2019 i'm going to pop it open and it looks exactly like the html found here right now you can there's a few differences if you look this one versus this one for instance here there's a bunch of ads here there's no ads if you go down i'm sure there's more ads right likewise if i go here and suppose i want to click on one of these things suppose i want to go to tim hardaway's um his his uh website his page i can click on it and it says not found so your page is limited to exactly this so when you are crawling you're only getting a like a snapshot of that web page this is good but this can also be a problem as well so a way that it might be good is imagine you you have a website that updates every single day but they don't keep historical data you might want to scrape this website every day and kind of make your own time series or make your own cross-sectional data set over time um it could be bad because you might think okay this is the website i need but then you realize oh i actually want to know player information and then you have to go back and you have to scrape all this stuff over again but yeah so the advantage of crawling is that you have a saved snapshot so you don't have to keep going because maybe crawling is a super tedious thing or maybe there's a really long wait period like the nypd checkbook website that i got blocked from if i had just crawled it i would have all that information kind of saved so that's that's that's like the basis of crawling we're just going to visit the website and we're going to write it out to our folder so now if we wanted to crawl multiple right we can use a for loop so i'm going to say for year in 2010 2019 i'm gonna change the url luckily for us this url is really clean right we can just see okay this is how they're gonna this is how they're gonna change it they're gonna change it by this year so i can just put in my year here it'll update the link it's going to read it it's going to write it and then i'm going to change the name of the file each time so if i do this then as it it's going you'll see it populate in my folder and then i have all 10 or yeah 10. now suppose i this was a website that was a little bit more uh strict with their robots.txd and i wanted to adhere to their um their weight limits or their their crawl delays i could just type in i want to have my script sleep every three minutes every three seconds sorry let me delete these now if i do this they should roll in a lot slower because now we're respecting their server load or whatever they don't want you to overwhelm so i'll let that go okay so this is great but this is also a super basic website this what this website is just kind of like a baby website kind of it's just there the data is just always presented for you um and there are a lot of websites that have that are way more difficult than this so rather this website right here is a really good example so if i go to this website it tells me how many people are on government websites right now and suppose i want to scrape this every hour or something like this and i want to keep a time series suppose they didn't keep a time series so what i could do is i could do something very similar to what i'm doing with this i could say okay i'm gonna go here i'll do read html rate html and i'll call it example so now if i go to my webs my thing i have example.html open it up it looks a lot different and i have these this dot dot dot so dot dot people on government websites we have 400 000 people on government websites but this doesn't tell us that tells us dot dot dot if you reload this and you look really really quickly right here and see if you you might have been able to see like the dot dots so what it's doing is it's loading the website and it's literally loading this this is like the shell of the website it's loading this and then if i open up the html for um people watching i just hit ctrl shift i on my google chrome and it pops open should pop open this this is this is the html but if i go to network and i rerun this it's a whole bunch of stuff that gets loaded i'll stop recording so it's a whole bunch of stuff that gets recorded once you actually visit the website there are javascript stuff or api calls so once you once you run that that website once you access the website stuff gets filled in afterwards so what you might have to do is kind of dig around in some of these things and try to figure out where well where is this coming from like where is this data being pulled from if i go back here this is just me trying to show you that the dot dots will exist so what i'm going to what what i what i'll do is i'll just kind of sort by type and i have all of these xhr documents so what i'm going to do is i'm going to click through like each of them and just see if i can find where that data is coming from so if i look at this one i have active users just so happened that it was the first one i picked when i was really doing this in real time it wasn't the first one but we can see that number matches here right so four three six three seven four exactly the same number here so what i can do is i can take this i right click on it and i can do copy copy link address so now i'm given this link which is a little bit different than this one but if i open it up it gives me this weird looking thing this weird looking thing is called json it's just a special format for um for for data being passed by api so pretty much what we want to do is instead of scraping it with read html we have to use that oops that website that uh package.jsonlight and we're going to call from json if you click on it it's going to give us an r list it's converting this into r and then maybe we can save it save it as x now we can do x it's all a bunch of stuff here maybe we want to go to data active visitors then we can get this so now if you were scraping or crawling a website like this that you knew was maybe a little bit dynamically generated um it wasn't as easy as basketball reference where everything's already there you would want to instead instead of crawling this thing you'd would instead want to crawl this thing right so you'd want to figure out a way to index all of these json files in a timely manner okay the last way to um get into these websites is some of them have logins things like that so we're gonna go to stack overflow it's a pretty helpful website for coding and here's a bunch of questions that are recent or something like that so we're going to do is we're going to try to visit that and we're going to try to scrape some of this stuff but we have to log in i'm not sure why it's not maybe i'm already logged in well not sure how to log out but if i tried to do this regular i did url okay what did i do wrong here so now it should be here now it looks like this right so i have to log in i'm already logged in because of cookies and stuff like that but um this is what it would look like so i have to actually fill in this this thing log in and have that work right so rather we're going to have to use this function called html session and we're going to have to find the forms the html forms so for example this is a form because we have to fill it out and submit it here's another form it's a search form so for example if i open this we have two forms we have a search one so we have the search form and here's the input text so that's going to be this one and then the second one is going to be a login form so this is going to be a bunch of hidden stuff here's your email here's your password here's your submit button so what i'm going to do is i'm going to save that first form as form dot unfilled and i only care about the second one because the first one is the search bar and i only want to log in and then i'm going to take my unfilled form i'm going to set these values so where i see email i want to put my email and when i where i want where i see password here i want to put welcome to chili's which is my password for this website um so this last little step here this is not common for some reason i was googling online a bunch um this is not identified as a button or something like that so i had to set it as button but maybe it might be working now and then all you do is just submit it and then once you submit it um you should be able to then go back to the website and it should be okay so at that point you should be able to see this instead of this i'm not gonna go through it um but you guys can check it out by yourself i don't care if you have this password it's the only one i use for this one so um but are there any questions about that um it's a lot and you shouldn't be absorbing at all if you are that's like pretty cool but yeah there any questions or concerns about crawling um how do we show all of the elements in a web page again like in that side menu uh if you're on a uh a pc control shift i on google so if i'm here oops um if i'm here i hit ctrl shift i and it pops up or you can right click and hit inspect and then it should pop open all this stuff i'm going to talk a little bit about this in a second okay f12 works uh on windows at least maybe on mac as well oh does it f12 it just changed my volume okay um anyway okay so now we're gonna talk about um actually scraping so we're gonna talk about how we need to uh if there's something on the page that we want how do we get to there right so if i go to basketball reference here is the next roster from 2019 here's what we want to take there's a bunch of other tables on here too but we just want this this first one right here so what we can do this is pretty important and a little bit tricky and takes a lot of practice but if you take some element here right make sure to click on like probably something up top so i usually try to pick something in the middle of the table but towards the top so like i'll click on this sg i'll right click on it and i'll hit inspect and now this looks different than it did before when i hit ctrl shift i before that html looked pretty small but now it's pretty long in fact it's like really long but what it did is it opened up a bunch of stuff and it found me that exact thing i was looking at so it highlights that exact sg so this so that so then this line also you know remains highlighted even though i'm off right to this remains highlighted now if i'm looking for a table i kind of look at this and again this takes a little bit of practice but if i see this and i see td that means like table data if i go up to the top all the way up until the next element that it's part of so it's part of this tr i can see okay tr stands for table row now if i close that okay now if i go up again here all now i hear all the rows of that table if i go to this one this is t body so this is like the the body of the table i keep going up this is the header keep going up this is the column group it doesn't even look like it's showing anything but if i go up one more time i find caption and then i find table so i want the table that's going to have everything that's going to have the player's number like the actual header as well as the actual data so i'm going to click on table i'm going to right click on that i'm going to go to copy and then copy sorry copy selector i'm going to copy selector so now i can go over here and i can put in this so this is going to give me the pretty much the name of that element on the page hashtag roster then the basal reference people make it really really nice and easy you would expect this to be called roster so it's called roster now because i've already scraped this data or crawled this data i don't have to go back and access the internet version anymore i can access my my local version i can access this one so now i'm gonna go to read html again i'm going to read in nick's 2019 dot html and i'm going to save that as my html so now my html has a head and a body but i'm looking for this thing so i'm going to do now is i'm going to say my html i'm going to put that into html nodes so it's now now this is going to ask me okay given this html what node do you want and i'm going to tell it i want the roster node so if i run this it comes up with table tells me it's a sortable stats table id roster data calls to freeze 2. right if we look at this it's going to look exactly like that so here's a table sortable stats table now sortable roster data calls to freeze too right it's exactly the same so now i'm just gonna say okay because that's a table i can just say convert this to a regular table so html table if i run that bam we have data all right again so now here's the next roster in r now from that website just using that hashtag roster so once you can crawl all that sometimes crawling is even harder than scraping because you have to figure out how to log in and all that stuff and if it's dynamically generated but you can pretty much get all of this this way so then what we can do given this we can as that data frame so i think this is a good explanation or a good example of why i think mag ritter is nice because you're able to kind of do it line by line instead of having to build it out this way so i could save it as a data frame i'm going to call it x now if i go here i have my data frame this is this should be pretty much exactly what you guys um have in that data set from before this is how i got it are there any questions about that process everything's all good okay so now we're going to talk about multiple pages so similar to how we're doing it before where we wrote a loop we're going to write another loop so i'm just going to copy this and then i'll just explain it line by line okay so now what i'm going to do first is i'm going to list my files so i have i can put a work if i wanted to i can put a working directory in here or i can put some sort of something in there right i can put a file path but if i do this way it just gives me all of these things but now i know that this one's not what i want really so there's another argument that you can use called pattern i can say i want the only give me the files where the pattern is nics so now notice it doesn't even take this one anymore it only takes the next one so i'm going to save that as my files so this is here's my vector of all my files i want to uh pull the the um the rosters from i'm also going to create an empty list called roster so now i have this roster is empty list list of zero i have my files down here one of ten okay so what this is going to do is going to take the first file in files so let's just set that so we can see it files one so now file is going to be knicks2010 so i'm going to take this i'm going to read the html that's going to give me an html document for my from my hard drive already not even online i'm going to go look for the roster node there it is the table convert it to a table convert it to a data frame now it's in a super usable friendly r way and i'm just going to save it as temp so now temp is going to be the next roster from 2010 all the information but if you notice there's really if you know it's the next roster you can you you can tell it's the next roster but if you don't know there's no information here telling you that it is a nix there's no information telling you what year it is so i'll save it as temp and i'm just going to add a file column and just add file to it because file has both team and name right so imagine you didn't just have nics you had all the mba teams and you wanted to make sure who was on what team and trade or something like this so i added this file column go back to temp we can see it it exists here nyx 2010 next 2010 next 2010. i pretty much always do this because my file names are usually descriptive in this way so however you want to be able to uniquely identify these and track where they came from and then what i'll do is i'll i'm gonna save the first element in roster as this temp so the length of roster right now is zero but if i add one to that it's going to be one obviously so the first element i'm going to save over as temp so if i run this loop take a second and then it's done i now have roster here which is a list of ten so data frame is the first element data frame second element data frame third element right so now i'm gonna smush these all together so when you're dealing with lists so so normally we we would use r bind right if we had like two data sets we would just r bind them but we have to our bind all ten of them so what we're going to do is do this thing called do.call r bind roster and we're going to save this as uh final roster so now final roster is now 190 observations so we got all these players from knicks 2010 as we go down mix 2011 2012 2013 14 15 right and we have all the players so now we can sort by this and you can see okay amari stadamire was on the knicks for a few seasons right you'd be able to see multiple so that's how you would extract data from multiple things here's another way you can write the for loop it works exactly the same but it uses an index like one to like the length whereas this one just uses file and files you can look at this one on your own time it works exactly the same it's just a number instead of the file you have to call the file but the same thing okay so let's talk about extracting hyperlinks because if you look at this table there's two hyperlinks there's kadeem allen and there's arizona but if i look at the final roster and i look for kadeem allen there's nothing about his hyperlink and his arizona there's nothing about his arizona a hyperlink so a lot of times this might be the first step of of crawling like maybe i don't care about like uh this guy had a long career so tim hardaway like i don't care that tim hartwell is on the knicks i want to see his stats for every season so rather to go to actual basketball reference and i go to the knicks 20 2020 whatever same thing so if i go to taj gibson and i click on him this is this is his link and now here's all of his stats so maybe this is what i want maybe this is what i'm after but i don't actually know like this is pretty easy to follow because i can tell it's or this one's pretty easy i know all i need is nyk there's only like 20 teams you know i can just loop through the ears but maybe i don't know all of these there's like you know thousands of mba players i don't know all of their unique handles or their unique urls so rather i have to actually go here and use this table to find what their urls are so i i'm going to start the same exact way except i'm just going to do it from the internet this time which i guess it doesn't really matter so i'm gonna save this as my html i'm gonna go to the roster i'm gonna make it the same exact thing just to get a base so i run this i go to nyx this is what we've been seeing but there's no information about keenan allen or kadeem allen's uh url nothing about arizona's url so now let's go back into the roster let's do my html let me look at this it's a table but now if i go back here i inspect this this is not the right one let me make sure it's this exact one okay so here's the actual thing we would be scraping i'm going to go check out this again right exactly the same way it opens all it all up and now i'm i want to go to kadeem allen's thing and i want to find his url so if i go up one i notice that there's an arrow here this one just has nothing associated with it it just has this but this one has a little arrow and i can click down and it looks like here's his link right and you can even see where my mouse where my house is hovering you can see that that url for him it's really small but you can see it so i want to get this so i'm going to look for in this node of roster i want to look again for a nodes because this is this is a little carrot then a table it's a little carrot then an a so i'm going to look through this and then i'm going to do html nodes and then a so now i have a bunch of stuff coming down right and if we go back here we see href that stands for hyperlink or something like that so that's where that's going to live and if you look we have 43 of them so we have this one this looks like alan ka one kadeem alan so this so this is kadeem allen's his link this is his arizona link then we have ron baker wichita state right so we can kind of see that there's two links here in each row right so this is what we're going to want to take we're going to want to take these ones so now i have this list of of things so if i want to do html attribute i want to get all the hrefs right i just said that the href is what we're after because we already know it's kadeem allen so now here are all the links it's only half the link it's only up until right here so it's giving us this not this but this is pretty easy enough to just copy and put in there so we'll use the paste function for that later but we have all these hrefs but we have a bunch of college ones that are mixed in right suppose we only care about the player links we don't care about colleges so let me just save this as links so if i call it again it'll just like this well now we have to figure out okay we only want to keep the player ones so we have to figure out a pattern in order to keep these so maybe the pattern should be something about it's saying players right because this one says friv colleges right so we don't want these we want the ones that say players so we have to subset these we subset it with square brackets and we have to find we have to use a string function that tells us whether or not another substring exists inside that function is grepple so if we ask for players in links this should give us a true false right true false true false true false and it should pretty much alternate because there is two per row until we get here and if i do this then we get all these player player links so now i can say nick's player links like that now if i go to nyx i've now successfully pulled out the hd the uh urls for all these players the hyperlinks okay so that might not be good enough though maybe we want to keep going right why would you want just their i mean it's helpful because now you know you have a unique identifier like for example um maybe there's another player called billy garrett that's not that crazy of a name so maybe uh there's multiple players called that so maybe you want this to be his unique identifier so you can definitely use it as that but maybe you want to go to his page and get some stuff off his page right so maybe we want to go to kadeem ellen's page oh right wrong one if we want to go to kdmalen's page and here's a bunch of stuff right a bunch of stuff here a bunch of stuff here maybe we want to like get his twitter handle right if we wanted to get this well now we're gonna have to go to his page so nick's players links i now have to paste the front half of that url in there i think it actually has this last slash okay so now i've updated their url to have the front half too and now i'm just going to do kadeem allen's first so nyx player links one right so this is how i would build so if i wanted to loop through these this is how i would build it i would say okay this is his url let's visit it so now i know i'm i have this i have this page now in my html so now i want to get to here so i told you guys to do right click inspect on it and it should kind of get you to that slot right so we can right click copy selector paste it like this so now my html html nodes i could put that in there like that so exactly what we had before now you can say okay here it is so html attribute href there's his twitter handle and if we go to this i'm hoping he hasn't changed to twitter and there's kadeem allen fantastic so now you can go to their twitter and maybe get their link their tweets and stuff like that um in this i go through a little bit more carefully where i go to their page i get their hrefs that are in that meta thing so i do it a little bit differently where i go here and i get all of the links in here then i remove the ones that are not twitter links so you can see here i'm saying okay it has to include twitter.com and then what i do here is i say okay some players don't have don't have twitters so maybe there's a player um i'm not sure who wouldn't have a twitter maybe one of these players doesn't have a twitter right i'm not gonna go through all of them then if i'm only if i'm only keeping the links that have twitter.com and there is no link.twitter.com it'll be an empty vector right it'll be just a zero vector so i have to check okay is the length of it greater than zero meaning does it exist if it does then we can update that twitter account or that for that one so i can run this loop i think i had to paste it but anyway it works here and if um we show it here's the player links here are the twitter handles and you have one for each for each for each one of them are there any questions about that i'm throwing a lot of stuff at you guys um but hopefully it's more not so you get it in the moment but so you can kind of go back and say okay this is kind of what he did to jump from page to page or absorb stuff like that any questions okay i got a little bit more for you guys um error handling so this is um when you're doing really really large um crawls sometimes it's helpful to be able to anticipate when your program is going to break and this is not just for um web crawling but this is for everything if you're dealing with user input specifically like if you're writing a program that handles user input people type in the wrong stuff all the time and you want to be able to kind of check for that stuff so for example if i tried to crawl this website the nyx roster in 3019 that year doesn't exist so if i do that it's going to tell me html or http error 404 so similarly if i go here and type in 3019 i get a 404 so maybe you have this list of years and maybe there was a year where um the nba didn't play or something like that and you want to be able to anticipate this you want you want to be able to have your code run overnight and not just break because of a broken link you can use this try catch function so what this is going to do is it's going to test this bit of code and if it works great it's going to save it so let me just take this so let me do 2019 and i'll save this as my tml so if i take this and i run this it'll work right because we already know that this this this that this will give us what we want this will give us the html that we want but if i change it to this it gives us an error so if there's an error you write it like this you can print your own custom message and you can try some other code too you might want to like return zero to this to tell it that like it didn't work or something like that so instead of having an html item you'll have just a single zero so if i run this it prints my error but it doesn't break there's no error message it says hey 404 error try again if now if i look at what my html really is this is zero so this is helpful because you're able to kind of anticipate things when you know it might break and you can kind of help yourself um i wouldn't start throwing around try catches around all of your code because sometimes there are reasons why your code should break because you you know maybe did it wrong or had errors but if you can anticipate the error and you don't want the code to break and you just want to keep moving you can use this try catch so here's another example where it works two more things really quickly this last thing is called our selenium so sometimes no matter how hard you try just some of these web scraping methods just are not going to work there's this thing called selenium um a lot of people use it in python but there's also an r version but it's constantly changing it um i think last year just went under uh someone else like a new person was like maintaining it but pretty much what this does is it will like literally simulate a browser experience so when you like are using our selenium it will like open up a new tab it'll look like this you'll have to tell it that you want it to type in basketball reference it does it you wanna you have to tell it where you want your mouse to go what you wanna select it's like you're really like building a robot at that point like a like a web browsing robot um more so than just reading html or logging in or something like that it's really cool but it's pretty slow not always um reliable but sometimes it's just like you have to do it there's no other way but i would really really try to just not use it finally pdfs people always ask me oh okay well can you scrape pdfs and the answer is uh yes but it it might just be easier to do it by hand what happens with pdfs is it takes each page this pdf tools it takes each page and combines each page into one character element so pretty much like each page will be between two quotes so it'll be like page one and it'll just be all of the text in page one it's a lot it's really really hard and you have to be able to like parse all of the strings and make sure to get patterns like for example i scraped a bunch of these pdfs before but these were pretty easy because there's some sort of pattern but even still there's like this this this object this row didn't have names here so i have to know that okay if there's extra space i have to figure out there's no names it's really really tedious and annoying and i would just say just stay as far away as possible and then that's all i have for web scraping and web crawling i think adam is going to go through some more with you guys but i think it's um pretty much going to be a rehash of what i did um are there any questions or does anybody have websites that they want to scrape that they've been thinking about for a long time um i can give you like a head start or something like that any questions so you said adam tomorrow is uh it says text analysis but you said it's sort of the same idea no no text analysis is going to be completely different oh okay yeah text analysis is like um so he does a lot of real estate research so suppose he had like housing listing data he would take the descriptions of the house and stuff like that and be able to analyze the text and put it into a model um but he's next week i think he's doing a web a web scraping thing um but it might he does it he uses a slightly different package than our vest so he does it his looks slightly different but it's equivalent but it still might be helpful because he he'll just show you more stuff okay thank you anyone else have any questions um or have a website in mind i've seen a couple of papers but by pablo i think barbara is his name where basically he scrapes all of twitter to find um who's following who on that and he made it seem like it is really simple how everyone knows you can just scrape twitter accounts do you know anything about that or could give like a just show us where to go for that yeah scraping twitter is not easy um so i was trying to do this um just kind of for fun at one point and what twitter does so for example suppose this was a twitter page what it does is as you scroll more tweets load but as you scroll the tweets above go away so you have to like scroll crawl that page and save it with like you know you have to save it as like this underscore number of times you scrolled and that's if you're like actually scraping it um but there are twitter apis that you can like wait like way easier to do so you would just use i'm not even sure i would look up like an api might be a lot uh better for what you're trying to do because th this will maybe this won't let you call like hundreds of thousands of tweets like just with one with one go so yeah don't scrape twitter it's too difficult i think this is what he did thank you oh okay sorry interrupt um what would a website look like if you couldn't crawl and scrape it or is it possible to do that for every website if you couldn't um you can you can always do it okay so there's not like there's not like websites where you can't like just like they they don't like restricted or anything um they might make it really difficult like those uh like those capture things like things where it's like click here if you're a human um those are actually really good at keeping robots out they're really good at figuring out if you are a human surprisingly enough um so yeah every website can be scraped because all it's doing is just taking a a screenshot essentially and just copying down the html code that's presented to you um because you cause you because you can do it you you can scrape it yourself like for example you can go to like any page like this and imagine it was super difficult to scrape or something like that oops you can inspect you can go to html if you just copy element and you go to a notepad and you paste it you have the entire html code of that website there so even if they're doing like recaptcha and stuff where like they're making it difficult for you if you as a human just did it and then copied the html code you just scraped your website but like manually so you can always do it thank you yeah no problem someone put this website covid19 into um yeah at the bottom of that page if you just click that url um there's a table that i was thinking that i wanted to to copy but i'm i mean i've been inspecting it and like it seems like it's dynamically generated and like i'm having trouble reading the the um the code here and figuring out everything that's going on yeah um so like for example it says show 20 countries but you can show all of them and i figure that that's something that i would have to to tell the scraper to do and then like each element of each row seems like it's dynamically generated and i don't know yeah so so here's something i would try right so i would try reloading it i would try to like see if there's like some sort of api that they're calling to get the data so for example you load it a bunch of stuff comes down yeah so if there's no xhr that means that they're not pulling from anything it's all like javascript which is really difficult to scrape that's what that this is something where you would need like our selenium to do just because you need like the actual like browser experience to trick the computer into triggering the javascript um but yeah it's that's a little bit of a tough one but i think most websites are not like this where they're dynamically generated with javascript i think some are but not not the majority okay yeah i think it's all javascript but anyway anyone else well if there's no other questions um that's all i have um i don't have any homework or worksheet for you guys for this one but um there's a ton of information here that you should be able to kind of come back to if you need i think just the hardest part the only thing that's not on this page is pretty much how to search for the element using like doing it this way where you want to find the actual selector but besides that everything should be in that and then if you ever have um problems or questions you can always you know shoot me an email or something like that but yeah i'll hang around for a little bit if anyone wants to talk about a website they have in mind or anything but not gonna hold it too much longer you might have already covered this when you're talking about um pdfs versus scraping pdfs but what about like downloading a pdf to file this directly uh programmatically yeah you can definitely do that so that would be like a two-step process then instead of the one step yeah so so for example this one's a pdf right uh-huh so i would just go here and you can use this thing called download file okay yeah and it just downloads um you might you might have to put in an argument for where let's see error yeah destination file missing so let's do example.pdf so it's trying it downloaded it should be in here example pdf now here's a alex version maybe maybe not you could definitely do it i've i had to do it for these i think might it might not just be showing up but awesome it's something like that if you look up download that file you'll you'll you'll find stuff um

Info

Channel: Alexander Cardazzi

Views: 451

Rating: 5 out of 5

Keywords: Web Crawling, R programming, Web Scraping

Id: cbcISjisOKs

Channel Id: undefined

Length: 80min 29sec (4829 seconds)

Published: Sun Aug 23 2020