Python Tutorial: Web Scraping with Requests-HTML

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there how's it going everybody in this video we're gonna be learning how to scrape websites using the requests HTML library now I've done a video on web scraping before using beautifulsoup which is one of the more popular tools out there but request hTML is a newer project written by Kenneth writes and he's the same person who wrote the request library and he has a history of writing libraries that are easy to use and pretty intuitive so I figured we'd give this library a look to see how we can scrape some web data now if you don't know what it means to scrape websites basically this means parsing the content from a website and pulling out exactly what you want so for example maybe you want to pull down some headlines from a new site or grab some scores from a sports site or monitor some prices in an online store or the stock market or anything like that so to show you an example of this let's take a look at the finished product that we'll be building in this video and then we'll actually learn how to build it now this is going to be very similar to the script that we built in the beautiful soup video but also I'm going to show how we can do some more advanced parsing such as grabbing dynamic data generated by JavaScript and how to make asynchronous requests and things like that so if I go over here to my personal website I have a home page that has a post of my most recent videos and every post that I have has a title with this big heading tag here and then I have some text of the summary of the video so a description of the video and then I have the embedded YouTube video here so let's say that we wanted to write a scraper that would go out and grab this information so we wanted to grab the post titles the summaries and the links to the videos and we just wanted to ignore all this other stuff here so to show you what this would look like let me run the finish script that we'll be writing in this video so that you can you know see what something like this can do and what it's capable of and then we'll actually learn how to build it so I have a script here if I'm just going to run this it's a CMS underscore scrape pie so if I run that script with Python then you can see that it went out to my website and pulled down a bunch of information from my homepage so these are the latest videos that I have and it grabbed the title here and it grabbed the description and it also grabbed the link to the YouTube video now not only did it print this information out here within the terminal but if I go to the directory where this script lives then it also created this CSV file here so if I open up the CSV file I'm on a Mac so this is going to open this in numbers by default yours might open in Excel if you're on Windows but we can see that it pulled down this information within a CSV file too so we have the headlines the summaries and then the video link over here now we could reformat this here a little bit to be a little bit better so I will set the column here to 300 point instead and then I'll also turn on word wrap so turn that on ok and now that's a little bit easier to read there so this is what you can do with a script to scrape websites like this now if we were to try to parse this information with Python alone then we'd probably run into a lot of issues but luckily we have libraries like python HTML that makes parsing out all this information a lot easier to do so let's go ahead and get started and see how to do this so first we need to install request HTML so I'm going to pull my terminal back up here now we can do this with a simple pip install it's just pip install and that is request - HTML now I already have mine installed here but yours would install there and once we have that installed let's look at a very basic example to get us started and then we'll work up from there now you don't have to be extremely familiar with HTML in order to scrape websites but it definitely helps basically hTML is structured in a way where all of the information is contained within certain tags so if you're familiar with you know XML then it's very similar to that so I have an extremely simple example here a basic HTML file and I have one open here in my browser and we can see that this small example here just has one big header that says test website and then we have two large links here for article 1 and article 2 and then we have a summary underneath for both of those articles and then we also have a footer down here and also I have something that says this text is generated by JavaScript so this is how you're probably used to seeing browsers display HTML but in the background the source code looks a bit different so let me switch over to the HTML code of this simple website here to see what this looks like so I have this open in sublime text I'm going to make this half screen here and actually let me make this website a little smaller here so that we can see a bit more of this HTML so we can see how this is structured here so we have these tags throughout our documents so there are opening tags which are surrounded by these brackets here so for example if I look at the head then we can see that these angled brackets this is the opening tag and the closing tag is the same thing except it has a forged slash here right after this so this is the opening head tag and this is the closing head tag and these tags can also be nested so if we want to find our article headlines in our article summary summaries then we can look down here in the body tag and within the body tag we have our test website heading here and then we have a div tag right below this and that div tag has a class of article and those classes are mainly used for CSS styling and can also be used within JavaScript to identify specific elements so within that div tag with the class of article we have our article information so we have an h2 tag here which is the heading of our article and within that h2 tag we have an anchor tag which is a link so this is a link to article 1 HTML with the text of article 1 headline and then below that we just have the paragraph tag with the summary of our article so we can see how the structure of these websites can look confusing but when you dig down a bit and actually look at the structure then the structure usually repeats so for example with our second article here it has exactly the same structure as the one above it so we have a div with a class of article heading two and this one goes to article two dot HTML with an article to headline as the text and then below that we have a paragraph tag and a summary for article two so these are very similar here we have our article one div here in our article two div here so let's use this very simple example to see how we can parse out information using requests HTML so let's say that we wanted to parse out the article headlines and also the summaries for our website and nothing else just the article headlines and the summaries so in this example it's just going to be article one headline and it's summary text and then the article two headline and it's summary text so I'm going to open up a blank file here and I just called this file our HTML - demo and now we can parse the HTML within this script now we can parse HTML in multiple ways so we can either use an HTML session to go out and pull HTML from a website and we'll see how to do that in just a minute but we can also just parse HTML directly and that's so I have it saved as a file on my machine so that's what we're going to do right now so for now we're just going to parse this simple dot HTML file and that file is located in the same directory as this script so if you'd like to download this simple dot HTML file to follow along then I'll leave a link to this code in the description section below okay so now let's go ahead and open this file and pass it into request HTML now first we're going to want to import the HTML class perm request HTML so at the top here I'll just say from requests underscore HTML so that's an underscore whenever we pip installed it it was a dash but when we're actually using the module it's an underscore and now we can say import HTML okay and now let's open the HTML file and pass the contents of that file into the HTML class so we'll say with open and this file was called simple dot HTML and that's what this site is over here and this is saved this HTML file is saved in the same directory as my as my demo here okay so now I can open that as HTML file and within this with block here I'll say source is equal to html5 dot read so we're grabbing the contents from that file and now let's pass that into the HTML class so I'll say hTML is equal to HTML and hTML is equal to this source which is the contents of that HTML file now if working with files is new to you and you want to know more about how we opened and read the contents of that HTML file then I do have a video that goes into a lot further detail on working with files so I'll leave a link to that video in the description section below if any of you would like to learn more about that ok so now we have an instance of this HTML object here so first of all we can access the HTML of this object just by printing out the HTML attribute so I'll just say print HTML dot HTML so this is the HTML attribute of this instance that we just created here so if I save that and run it then I'll make this a little larger here we can see that we have all of the HTML from that simple web site now if we just wanted the text from the HTML without the tags then we can use the text attribute so I'll say print HTML dot text if I save that and run it then we can see that now we just have the text of the website with the tags taken out now dynamically generated data comes out a bit weird at the bottom here if we look at this this is the text from our website that is showing up here that says this text is generated by JavaScript now we're going to take a look at that further in just a bit but let's not worry about it right now okay so now let's find out how we can parse out that information that we want from this HTML so let's say that we wanted to grab the title of our HTML page so if I look at the title tag of this HTML up here at the top then we should get this here where it says test - a sample website so in order to get the title we can simply do something like this so I'm going to say match is equal to and I'm gonna do HTML dot find and with this fine method we just want to find title ok and now can print out that match so if I save that and run it then we can see that it prints out this list of elements here now I'm not sure how many people are familiar with CSS selectors but those are how you specify what you want to find and I like that a lot because it's something that a lot of developers might already be familiar with so this right here what we passed in the find is the CSS selector that we would use to find that title and when we print it out that match it gives us this list of elements and this list only has one element so that's probably our matched title so instead of printing out the whole list let's instead print out that only that first element so I'm going to print out match and access that zero index there to print out the first one and I will run that and now I just have this element and with these elements we have access to a lot of the same attributes and methods that we had before so we could find additional elements within here if this were nested we can view the HTML or we can simply view the text so if I wanted the HTML then I could just say print the HTML of that element if I save that and run it then we can see that we get the HTML of that title tag and if we only wanted this text without the HTML tags then we can use text instead so instead of printing HTML I'll print text save that and run it and we get the title of our website now instead of accessing that first element of that matched list we can instead just tell our find method that we only want the first match and it will just return that first match instead of a list so instead of accessing at this zero index here I'm just going to remove that and instead in our fine method I'm just going to say first is equal to true and now that match won't be a list of elements it'll just be the first element that gets found with that search so if I save that and run it then you can see that we get the exact same thing now like I said the find method it uses CSS selectors so if we wanted to get an element by a certain ID then we can use the pounce so for example if we wanted to grab the div with the ID of footer then I could simply say pound sign footer and now it's going to return the element that has the ID of footer so if I save that and run it then we can see that we get that footer information okay so now let's see how we get those article headlines and summaries from our HTML now most of the time you aren't going to want to look through all of the HTML of the page because with larger sites it can definitely just be a mess of code so a nice little trick if we go back to our browser here and I'm going to make this just a little bit larger for a second now in order to dig down into the HTML and find exactly where our article headline and summary is I'm using the Chrome browser here but other browsers have this feature as well but we can just right click on what we want so I'm going to right click on that headline and now I'm going to click on inspect and that will allow us to inspect this element so this inspection popped up here on the right side let me see if I can make this text a little bit larger here so that we can see so now we can see that we have all of our HTML here but the one that's highlighted is the one that we right clicked and inspected so we can see whenever I hover over this it actually highlights that in the browser so if I hover over the h2 then it shows me all of the h2 if I hover over the article it shows me all of the article and so on so if I go to the body then it highlights everything so this is a nice way to narrow down exactly what it is you're looking for so like we saw before our article headlines are within a div with a class of article and then we have an h2 tag and then an anchor tag so first let's grab the article div so I'm going to go back I'm going to close this and go back to the HTML or the scraper here and now instead of match I'm going to call this article instead and now and in the find method here I'm going to find a div with a class of articles so to do that you can say div dot article and if you're familiar with CSS selectors how we would find that with CSS selectors as well so if I save that and run it then we can see that I grabbed the first div with the class article and printed out its text and we can see that it gives us the headline and the summary within that div so now we have that first article we can just search this element just like we searched the entire HTML object so let's say I wanted to access the headline and the summary individually so to do that we can simply say I'm going to overwrite that part there I'm gonna say headline is equal to and now we're going to use this article and I'll say article dot find and within the article I want to find the h2 and I'll also say first is equal to true here and now I'm going to copy this line and paste it in below and I'm also going to grab the summary so the summary is the article dot find and we want to find the first paragraph tag so now let's print those out individually so I will print out let me give some space here so I will print out that headline and I will also print out the summary so if I save that and run it then you can see that we have those matched elements if I just wanted the text then I could either say headline text here or I could even just add it on to the end of my query here so I'll say dot text after that find dot text after that find save that and run it and now we can see we got the article one headline and the summary for article 1 okay so now that we have this information from one article we can most likely reuse this to parse the information from all of the articles so let me change the first search here to where I'm finding all of the divs with the class of article 2 not just return the first one so I will rename this here I want to say articles is equal to HTML dot fine div dot article I'm going to take away that first equals true so now that should return a list of articles so now let's loop over those articles and reuse the same same code that we had before to access the headline and the summary so right underneath here I'm just gonna say for article in articles and then we will use this same code here to find the headline and the summary so let me spread that out a little bit and let me save this and run it and now we can see that we have the headline and the summary for both of those articles instead of just the first one now if I those are bunched together if I put a just blank print here then it'll spread those out a bit and now we can see we have those spread out okay so now we got the headline and the summary for every article in our simple HTML file here so that's good so we're starting to see how this would be useful for getting information from websites so now let's do something similar but with an actual website so I have my personal website pulled up here in the browser that we saw before let me make this a little larger here and like I said I have a list of posts here and all of these posts have a title a summary and then an embedded video so let's say that we wanted to write a script that would go out and grab those titles and summaries and links to those videos so first things first we're going to need to import a different class from request HTML when we're grabbing data from a URL we need to use something like an HTML session so let me make this a little larger here and up at the top instead of just importing HTML I'm also going to import HTML session ok and now let's get the source code from my website so I'm just going to comment out this with open that we have here before because I'm going to do one more thing with this file later and now I'm just going to overwrite everything that we had previously so now to get the source from my website I'm going to say session is equal to HTML session and now I can say R is equal to that'll be for response I'm gonna say session dot get and I want to get the home page of my website so I will just copy that and paste that in there so that uses the request library to get a response from our website and this our variable is going to be the response object now if you are familiar with the request library then you can pretty much use this like you would use a response from that request library so we could check the status code we could get the content and bytes get the content and Unicode all kinds of different stuff now if you'd like to see more of what you can and do with the request library itself then you can watch my video where I go into more detail about these request and response objects so I'll leave a link to that video in the description section below if anyone is interested but what we're interested in for this video is the HTML attribute so I'm going to print our dot HTML if I save that and run it then when I printed that out that HTML attribute gives us access to the HTML object for that website so that HTML object is just like the HTML object that we interacted with just a second ago it's just the same as a setting this HTML equals HTML with the source so we can use that to find what we're looking for on my website so just like before let's start off by grabbing one videos information first and then we'll loop through those to get the information for all of the videos so to grab the first headline and snip it let's inspect my website and see what the structure looks like so I'm going to go ahead and enlarge this here so just like I did in the simple example I'm going to right click on this headline and then go to inspect and now we can see here our article headline is in an h2 tag with a class of entry title okay so that's how we're going to find that if I right-click on the description here and inspect that then we can see that this is a paragraph tag inside of a div with the class entry content now both the heading and the summary are both inside of this article tag here so this article tag is for one post so if I hover over that then you can see it just highlights that first post but not the second one if I hover over the second one then it highlights that post so first let's just grab that entire first article that contains all of the information and that we want so to grab that first article in the source code let me go back to our script here we can simply just say I'm going to overwrite this print statement here I'm gonna say article is equal to RR dot HTML dot fine and we are going to find that article tag and I just want the first one for now while we're messing with this so I will say first is equal to true and now I can print out that article dot HTML so I'll save that and run it okay and this gives us all of the HTML of that first article now I've actually liked using this library more than beautifulsoup but there was one thing I liked about beautiful soup that I haven't been able to find in this library and that's some sort of pretty Phi or a pretty print method now I don't think that this library has anything that will print this HTML out nice and neat but if it can do that and I've somehow missed it then just let me know in the comment section below because I think that would be some good functionality so this isn't the neatest HTML here but if we read through this then we will be able to see that it's the HTML from that first article so just to see this a bit better I will go ahead and format this so that we can read it I have an online format err pulled up here in my browser and I'm just going to use this to pretty up our HTML so I'm going to paste that into the HTML input part and then click on beautify and you can see that over here it formats it nicely so now I'm going to copy that prettied up HTML and paste this in here to sublime and also let me set the syntax as HTML okay so now we have our pretty printed HTML okay so now that we have the HTML for this article now we can figure out how we want to grab what we want to grab so we want to grab the headline summary and you two video link from this article here so let's start with the title so like I said before this is in this h2 tag with the class of entry title and then we have this link here and the link this is a little long here actually this would probably be more readable if I turn on word rap okay but we can see that this is the link here and the text of that link is the article headline and actually this link has its own class so that makes it a little easier on us it says entry title link so let's just grab the text of that class so I'm going to go back to our script here and now instead of printing out that article dot HTML I'm going to use it to find our headline so I'm going to say headline is equal to article dot find and within that article we want to find a class of entry I can actually just go back to the HTML here so I don't screw it up entry title link I will copy that and paste that in and we just want to grab the first result so we'll say first is equal to true and we just want to grab the text from that match so now if I print out headline if I save that and run it then we can see that we got the first headline from the first post okay so now that we got the title of my latest post now let's get the summary text for that post so let's go back to the HTML for our article here and let me scroll down until I see what looks like the description and this is it right here so our summary text is within this paragraph tag and that paragraph is inside of a div with a class of entry content so I'm going to copy that and now back in our scraper let's just copy this here and paste this in and instead I'm going to call this summary and that summary is going to be if I copy that entry content it is going to be a class of entry content and then it is the paragraph tag within entry content so now if I save that and run it that now we have our headline here and then right below this it's a bit bunched together but this is the description of that first post now again the syntax that we're using inside of this find method is the same syntax that you would use in CSS so if you're not familiar with how that works then that's where that comes from okay so lastly we need to get the link to the video for that post now this one is going to be a little more difficult but I wanted to show you this because sometimes parsing information can be a bit ugly and require you to take several steps before getting your final desired result so if we look back at the HTML of the article then let's see if we can find where this video is so it's actually down here within this iframe this is that embedded video here so let me just grab the HTML of that iframe so that we can see just that so back in the scraper here I'm going to say let's say I'll just call this vid underscore source is equal to article dot fine I'll just copy that from above we want to find the iframe within that article and I will say first is equal to true and now I'm going to print the vid source dot HTML and I'm going to copy or I'm going to comment out these other print statements for now so I'm going to save that and run it okay so this is the HTML for our embedded video so if we inspect this iframe here if we look at the source attribute here this source is has a link to the embedded version of the video but it's not a direct link to the video itself we can see it goes to youtube.com forward slash embed forward slash this video ID here and then it has a long URL after that but if you know how YouTube videos work they all have an ID for the video and the ID for this video is right here it's everything before this question mark so the question mark in a URL specifies where the parameters start so it's not actually part of the ID so with that ID we can create a link to the video ourselves so we need to parse that ID from that URL so first we need to grab that URL from the iframe which is in the source attribute so this is pretty simple to do so instead of using this HTML of this vid source element here we can instead access the attribute by saying vid source dot attrs for attribute so if I run that then we get a dictionary of the attributes for that iframe element so to grab the source we can just access that like any other Python dictionary so I'm just going to access the source key of that dictionary so now I'll run that and now we can see we get that youtube link and now we can see that that gives us the entire YouTube URL for that embedded video so I'm going to move the attributes and source key up on the previous line here so that we have our URL in a single variable so what I mean by that I'm just going to cut out the dot attributes and the source there and I'm going to just put that up here on the previous line so this vid source variable is equal to this link here so if I run this we should just get the same result okay so now that we have this URL we're gonna have to parse this URL string to grab the ID of that video and we'll break this up into several lines so first we can see that our video ID which is right here comes directly after a forward slash so let's split our string based on forward slashes so I'm going to comment out actually I'll just remove that print statement there and I'll say vid underscore ID is equal to vid source dot split and we're going to split our URL on forge slashes and now I can print this vid ID so if I save that and run it then now we have a list of values from our string that we're split onto that Ford slash now if you've never used the split method on a string basically like I said it just splits the string into a list of values based on the character that you specify so now we can see that our URL is broken into several parts based on where the forward slashes were so if we look at the items of our list here we're looking for our video ID which is right here so which index is that located in so this is the zero index here 0 1 2 3 4 so that is within the 4th index of this list so let's specify that we only want the 4th index of that list so right after the split I'm just going to access the 4th index there save that and run it and now we have just that video ID with the other URL parameters but we still don't have the ID itself so like I said before the question mark specifies where the parameters for the URL begin and the video ID is before that so if we do another split on the question mark then it should separate those out so I'll go to a new line so that we aren't making this one too complicated and I will just say vid ID is equal to vid ID dot split and we want to split this value on a question mark so I will split that and if I run that now then we can see that now we have two items it's split by our video ID and then the second value in our list here is all of the URL parameters but we don't care about those parameters we just want the video ID so we want to grab that first element of the list which is at index 0 so I'll put an index of 0 there save that and run it and now we have our video ID okay so I know that that was a lot of parsing but sometimes you know website source code just doesn't have the information that you want in the most accessible way so I wanted to show you how you might go about getting the data that you want okay so now that we have that YouTube ID now we can create our own YouTube link using that video ID so the way that YouTube links are formatted are like this let me make this a little smaller here so that we have some more room so I'm going to remove that print statement there and now I'm going to say YouTube link is equal to and I'm just going to make this an F string so the way that YouTube links to videos is like this so we can do HTTP colon forward slash forward slash youtube.com forward slash watch and now for our YouTube parameters we want a question mark v is equal to and now V is going to be equal to the video ID so since we're using an F string we can just put that in there now I'm using f strings to format the string but those are only available in Python 3 6 and above if you're using an older version of Python then you can use the format method and I have a separate video on both F strings and formatted strings if anyone needs to see how to do that so I'll leave a link to those videos in the description section below so if I print out that youtube link that we just created now we can see that we have a link here that should go to our video so if I copy this and paste it into my browser then it should open up that video in the browser okay let me pause that okay so we can see that it opened up that video in the browser using that link that we created okay so that's perfect so now we've scraped all of the information that we want from that first article so just like in our earlier example now that we've got that information for one article now we can loop over all of the articles and get that information for all of them so to do this I will just come up here to the top where we found that first article and instead I'm going to call this articles and take out this first equals to true because we don't just want the first article now we want all of them and now we can say for article in articles and we can reuse all of the same logic that we used before but we'll just put it inside of that for loop so now for each article that we found we are parsing out the headline and let me print that out we are parsing out the summary and I will uncomment out that print statement and then we're doing all of this parsing here to also grab a youtube link so now and also let me put a blank print statement here at the bottom so that we have some separation between these articles okay so now if I run this then we should be able to scroll up here okay so now we have all of that information this is the first article here here's the headline here is the description and here is the youtube link and it should do this for every one of the articles on the site or at least on the home page okay so that's good but sometimes you're going to run into situations where you might be missing some data and if that happens then it could break your script to scrape the website now maybe you're pulling down a list of items and one is missing an image or you know something that you thought would be there so to show what this looks like I'm going to go to one of my older posts that doesn't have a YouTube video associated with it and see what happens so blog posts from a long time ago don't have videos so if I go back here let's say I'm going to close down the inspection now I think on let's say I think it's like page 14 or so of my website let me go to page 14 here so if I scroll down here a little bit okay so I have an old post here from 2014 where I made a doll bed for my niece and just wrote you know a quick article about how I threw that together but the post doesn't have a YouTube video associated with it and there are a few more of my old posts like this as well so let's see what happens if I try to scrape page 14 here of my website so I'm going to copy this URL and go back to our script and the website that we're getting I'm going to paste in and that now we're gonna try to scrape page 14 so I'm going to save that and run it and if I do that if I scroll up here we can see that it got the headline and summary and link for the first post on that page but for the doll bed post we have a trace back here and it's saying that it can't actually be at access the attributes of the vid source and that's to be expected because there is no video in that post so when we try to access the attributes of that video then it's not going to have a value and it's going to throw an error so to fix this we can just put in a try except block that will check for errors and if it does you know run into an error parsing a video then it will just skip that part of the post so to do this I'm going to make my output a little smaller there so to do this I'm just going to create a try except block where we are trying to parse out the video and in the try section of this I'm going to take all of our code that parses the video and creates the link and I'm going to paste that into the try section of the try block and for the exception if it runs into an exception then I'm just going to say that the youtube link is going to be equal to none now again if you're unfamiliar with try/except blocks and would like to see a more detailed video about that concept then I do have a video specifically on that so I'll leave a link to that video in the description section below if anyone would like to learn more about that ok so now if I rerun this same code from before if I look at the output now then whenever it gets to the post without a video then we can see that it just prints none instead and then it moves on to the other posts now that didn't print a summary either and the reason for that is because the first paragraph of that post is an image instead of the summary so you know we're not going to mess with that so instead let me go back to the home page and rerun that code it's still working okay and the code is still working for our homepage as well okay so now that we've scraped the information that we wanted you can save this in any way that you'd like right now we're just printing this out to our terminal here and maybe that's fine for you but you can also save this to a file or a CSV or whatever you'd like so for example real quick let's say that we wanted to scrape this page and save it to a CSV file so we've already done the hard part of getting the information that we want so to save it to a CSV file we could simply come up here to the top and I'll say import CSV and then here towards the top of our file before our for loop we can just open up a CSV file now I'm not gonna go into as much detail here but we could use a context manager here but the way that this is currently set up I think it's just a little bit quicker and easier to just open up the file directly so I'm gonna say CSV file is equal to open and I'm going to open a new file I'm just going to call this CMS underscore scrape dot CSV and we want to open this in write mode and now we can write some header lines to set up our CSV file so I'm gonna say CSV underscore writer is equal to CS v dot writer and we want to pass in that CSV file and now we can write our header lines into our CSV file so basically what we're going to be putting into the CSV so I'm going to say CSV writer dot right row and we want to pass in a list of what we want to write so first we're going to write the headline second we're going to write the summary and third we're gonna write the video that's all the information that we posted from those posts or that we parse from those posts sorry now again I'm not going into as much detail here but if you've never worked with CSV files before and don't know exactly what's going on here then I do have a separate video that goes into deeper detail into CSV files so I'll also be sure to put a link to that video in the description section below as well but for now now within our four where we're getting our scraped information let's just write that information to our CSV file so at the very bottom of our loop I'm going to copy this CSV writer here at the very bottom of our loop I'm going to say CSV underscore writer dot write row and we want to write the headline and the summary and the YouTube underscore link okay and lastly here on the outside let me close the output there so we can see a little bit better lastly since we didn't use a context manager for the CSV file we need to close our file at the end of the script so I'm going to say CSV underscore file dot close ok so now whenever we run our script it should create a CSV file so let me go back to my Finder window here and this is the CSV file that I created at the very beginning of the video to show you what our final product would look like let me erase that and delete that and now let me run our script that we just wrote so that is running it printed out all this information in the terminal but also if I open finder back up then we can see that that CSV file was created again so this is the old one here let me close that down and this is the new one that we just created so again let me format this a little better here I will put the columns to be size of 300 and put the word wrap 1 so now we can see that we have all of that data that we scraped available here in a spreadsheet so now it's a bit easier to read all of that information that we pulled down so that's kind of cool that we wrote a script to go out to that website parse out all of that information and save it to this CSV file so this would be extremely useful for a data collection or putting together reports or whatever whatever it is that you need ok so everything that we've done in this video so far is stuff that we also did in the beautiful soup video but let me show you a couple of extra things that we can do with request HTML that we didn't do with sooo so first of all it's a common thing to just want to get all of the links on a site so perhaps you're writing a crawler and want to visit each page on a site or something like that well that's so common that there's actually a links attribute in the HTML object that has a set of all the links on a page so let's go back to our script here and see what this would look like now I'm going to close out our output there I'm just going to comment everything out here and also I'm going to comment out our CSV stuff I'm just going to work with this website data for now so let me paste that in okay so like I was saying it's so common to want to just get all of the links on a website that HTML or that request HTML makes this very easy for us so to do this I can simply print our dot HTML dot links so if I save that and run it then this gives us a set of all the links on a website now this is a bit difficult to read so you can actually loop over that as well so I could say for link in our dot HTML that links and for every link we can just print out that link so if I save that and run it then we can see that now this prints out those links in a bit easier to read now I don't think that I have any relative links on my page but if you do have relative links and want to instead get absolute URL for your links then instead of using links here you could use absolute links so absolute underscore links if I save that and run it this is pretty much going to look the same but if you have relative links then it prints out the absolute path for those links instead of the relative path now another really cool feature with request HTML is the ability to grab text that's dynamically generated by JavaScript now I don't think something like beautifulsoup has a way to do this out of the box so let's take a look at an example so if you remember that first HTML file that I had say my machine had some text that was dynamically created with JavaScript so let's go back and take a look at that again so I'm gonna open up the HTML and I will also open that up in the browser as well so let me resize this so we can see here at the bottom of the page it says this is text generated by JavaScript so if I look in my browser and inspect this text so let's see what this looks like then it just looks like text and a paragraph tag within the footer whenever I actually inspect that in the browser but if we actually look at this in the HTML let me make this a bit larger here so that we can see if we actually look at this in the HTML then we can see here's our entire footer here we don't actually have that text in our footer but I have some JavaScript here at the bottom of the page that adds that text to the footer but the page actually has to render and run that JavaScript before the text gets added to the footer so usually libraries have trouble getting dynamic data like this but we can do this with request HTML so first let's look at the result before we render the page so I'm going to make this large again and go back to our scraper here and I'm going to uncomment out the part where we were working with that simple HTML document and I'm just going to comment out everything beneath it so let me put in some blank lines here so we have some space so like I was saying let's look at the result that we get before the page is rendered so I'm gonna say match is equal to HTML dot find and I'm going to find the footer and I will say first is equal to true and now I'm going to print that match dot HTML so let me save that and run it and if we run that then we can see that the dynamic text isn't included in our response here but if we want to render the page in order to get the dynamic data then as simple as coming up here underneath the HTML here and just simply saying HTML dot render so now the first time that you run this it might need to download a few things from chromium in order to use this functionality so don't worry if you see it downloading some stuff here but I'm going to run this and I've already run this command once on my machine so it doesn't need to go out and download anything but you might see yours downloading something at this point but once it's downloaded and you rerun it then we can see that the text generated by JavaScript is now included here in our footer so I think it's really cool that HTML or that request HTML allows us to do this out of the box okay now the last thing that I want to show you with this library is its ability to do asynchronous requests now if you don't know the difference between synchronous and asynchronous basically when you make a synchronous request you have to wait until we get a response back before we continue on with our script but when we make an asynchronous request we can move one and do other stuff in our script while we wait on the response so I have another file pulled up here and I'm going to show you both a synchronous version and an asynchronous version of getting responses from a certain website so let's look at the synchronous version first and we can time this so here I have a script that goes and does some synchronous requests to some websites so the website that I'm using for this test is called HTTP bin org and it was written by the same person who wrote the requests and requests HTML libraries so it's a cool website that allows you to test different requests and responses and for this example I'm going to use a route that simulates a delayed response depending on the number that we pass into the URL so in this synchronous version we're just going to these URLs and printing out the results so we're going to a route that is going to be delayed by one second here another route that's going to be delayed by two seconds and a third route that is going to be delayed by three seconds and I'm starting a timer before the first website and then getting the time after all of the websites have printed their responses to see how long that took now since this is running it's going to go make the first request and wait for a response then make the second request and wait for a response and then make the third request and wait for the response so we can imagine that this should probably take a little over six seconds since one plus two plus three is equal to six and those are all of our delays so if I run this then we can see that we got the first website back now the second and now the third and it tells us that it took about six point six seconds so that's about what we would expect okay so now let's take a look at the asynchronous version of this same test so I have this different file open here called async snippets so we can see that there is a bit more code here but let's look through this here so first we're importing async HTML session and then towards the middle of our script here we're creating an async function for each website that we want to visit and I'm not going to go into async code in this video that topic probably requires several videos just on its own but within these async functions we're saying a wait and then returning and then getting these URLs so if I scroll down here a little bit then we have this line here results is equal to asynchronous section sex session dot run and we are running all three of these async functions here get delay 1 get delay to get delay 3 so that is going to go and get the responses from all of those sites but it's going to do this asynchronously which means that as it's waiting on one response it's going to go ahead and move on and try to get a response from the next one and once it has responses from all of those we're looping over the responses right here and printing out those results so in this case we're just printing out the URLs and then we are ending the timer here and printing out how long it took in total so in this case we can imagine that the execution time is going to be around 3 seconds because when it makes the request with a one-second delay it's going to move on to the next without waiting for a response like it synchronously so basically it will make requests for all of them around the same time and then manage results as they come in and since our longest requests should take only about three seconds then that's how long it should take us to get some results back so if we run this then let me wait for the results here okay so we can see that it took just a little over three seconds so that's about half of the time that it took to run synchronously so if you have a lot of websites to parse then doing it a sink or so they could save you a ton of time so if you can imagine you had to crawl you know ten different api's that took three seconds each to compute a response then if you did that synchronously then that could take over 30 seconds but if you did a synchronously then it would take around three seconds so it's definitely something to think about if you're doing something like that okay so I think that is going to do it for this video hopefully now you have a pretty good idea for how you can go out and scrape information from websites now one thing I do want to mention is that if you want data from a large website like Twitter or Facebook or YouTube or something like that then it may be beneficial for you to see whether or not they have a public API so public api's allow those sites to serve up data to you in a more efficient way and sometimes they don't appreciate it if you try to scrape their data manually they'd rather you use an API instead but it's usually those larger websites that have public api's like that so if you want data from a smaller website then you'll usually have to do something like we did here now I also want to point out that you should be considerate when scraping websites computer programs allow us to send a lot of requests very quickly so be aware that you might be bogging down someone's server if you aren't careful so try to keep that in mind so you know after this tutorial try not to go out and Hammer my website with a ton of different requests through your program and that goes for other websites as well and some websites will even monitor if they're getting hit quickly and can block your program or IP address if you're hitting them too fast some websites will actually try to block BOTS completely and but that's another good thing about request HTML is that it spoofs a user agent for us to make it seem like a real web browser so it's hard for people to tell that you're using a program in this case okay but other than that if anyone has any questions about what we've covered in this video then feel free to ask in the comment section below and I'll do my best to answer those and if you enjoyed these tutorials and would like to support them then there are several ways you can do that the easiest ways is something like the video and give it a thumbs up and also it's a huge help to share these videos with anyone who you think would find them useful and if you have the means you can contribute through patreon and there's a link to that page into the scripts in section below be sure to subscribe for future videos and thank you all for watching you
Info
Channel: Corey Schafer
Views: 115,648
Rating: 4.9549837 out of 5
Keywords: python, requests-html tutorial, requests-html, requests, python requests, python requests-html, web scraping, scraping, python html, python html parsing, web crawler, string parsing, parse html, python csv, corey schafer, python3, programming tutorial, requests library, python web scraping, python tutorial, python web scraping tutorial
Id: a6fIbtFB46g
Channel Id: undefined
Length: 56min 26sec (3386 seconds)
Published: Mon Mar 11 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.