Comprehensive Python Beautiful Soup Web Scraping Tutorial! (find/find_all, css select, scrape table)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey how's it going everyone and welcome back to another video we've got a fun one in store today we're gonna go through all sorts of things related to web scraping and Python and specifically be looking at the beautiful soup library couple logistical things before we begin if you haven't already and you've enjoyed any of my videos it'd mean a lot to me if you subscribe then the second thing is I want to just thank everyone who responded to my post when I asked what should be included in this video I wish I thought of this before I actually made this post but thank you for the person that responded everything we'll just do everything in this video no I mean I thought about it a bit more with all your feedback and I think what we're gonna do is probably make a couple videos on the topic of web scraping but this first one will really be about the basics HTML and CSS you know what is web scraping then we're gonna dive into the beautiful soup library and kind of learn the building blocks that we need from that library and then the final part of this video will be a section of exercises where you can kind of test out your skills I think that should be a pretty fun section a timeline in the description or attached to the video there's all sorts of places you can find the timeline these days but let's just jump into it okay before we get into web scraping it's important to know how web pages on the Internet actually work so any site that we go to whether that be YouTube Amazon Wikipedia they're all composed of some combination of HTML and CSS so hTML is a language really to style web pages so here we have you know my YouTube page and the first thing to understand is that we can actually see the HTML source code and this is possible on pretty much any browser you are using by right clicking and clicking view page source so this YouTube page we're looking at right now it's styled by all this code that you see here and don't worry if this looks a little bit intimidating to start we'll start with a lot simpler examples of HTML but another important thing to know is that you know that was a lot to look at but if we're looking for a specific HTML that represents specific elements on the page so let's say my subscriber count' here we can use not view page source but inspect and so this is kind of a separate view of the HTML if I expand this a little bit you can see a bit but as I kind of navigate over these items we see we got my name and the subscriber count' I could just if I wanted to grab the subscriber count' and I can actually within the web browser I can edit this so let's say I wanted to have know 100 million subscribers I'm coming for Beauty pie now as you can see I've edited that edited the code and we can see it says a hundred million subscribers here so that's another way you can look for specific HTML on a page and note that I didn't actually change the code on YouTube servers if we refresh the page unfortunately it will go back to this number but these are two good things to know about as we get into our web scraping so the core of what web scraping is is using Python or another language to programmatically look through HTML source code and pull out only the things that we want to scrape the web pages for elements that we want and information that we want to collect so an example here would be maybe I wanted to scrape my YouTube home page and just grab all the titles of all of my YouTube videos that is one example tasks with why we use web scraping where we I didn't want to manually go through and write down all these video names I wanted Python to do that for me alright let's start moving towards getting into the code okay I want everyone to navigate to the page Keith galley github I o slash web scraping slash example dot HTML so here once you load that up it's a very very simple example of an HTML web page we really just have like six or so elements on this page so once again we can right click click view page source actually what will probably be easier is instead we'll do the inspect and this is small enough where we Pike see pretty much all of the code so we have a header so this head is what we see in the top left the HTML example that's the title and then we have the body and that's ultimately all this stuff so I'm going to just kind of unfold all this so we can see all of it in its entirety but here we go so we start with the body that's what is actually on our web page we have a h1 tag that means header we have a paragraph which is going to be linked to more interesting example and then the next thing is that a tag is a link which is pointing me to another web page then down here we have a smaller header that's denoted by h2 some more paragraph which this time is using italics which is an eye tag we have another header here same size as the other header and some more text and one thing to note here is as you can see this has an additional property and this is something we're going to look for as we start to scrape it has an ID equals paragraph ID but this is a basic page so we're gonna load this into Python open up your preferred editor and let's start writing some code I'm going to be using a Jupiter notebook through Google collab so the first thing we'll want to do is load in the necessary libraries and so with Google Cloud I already have these provided for me but you might need a pip install these so the first one we're gonna import is the requests library and this is going to be so we can load those web pages that we were just looking at so if you need to you might have to do a pip install requests but I already have it so I can just import requests and then the second thing we'll want to import is actually the beautifulsoup library so this is a little bit more complex of a line but I like to import it like this so from beautifulsoup for import beautiful soup as BS and so if you don't have this one you're probably not to do a pip install of beautiful soup 4 or maybe a pip 3 install beautiful beautiful soup 4 but we can run that and then let's load our first page so we're going to load that page we were just looking at and so we can do this by a the following so we're going to first load the webpage content and we'll do this with a quest library so we can do our equals requests dot get now we're gonna pass in the URL so the URL was HTTP slash Keith Galilee github IL slash web scraping slash example dot HTML so that's the web page we're looking for and so once we have that though we want to convert it to a beautiful soup object convert to a beautiful soup object and so how you're going to do this is the following you're gonna do soup equals BS and then surround it with our dot content that's going to be actually the HTML on the web page that we get back from this request and finally we will want to print it out so print out our HTML and we can just do print soup let's see what happens okay cool so this is our page and one thing to note in addition to just printing out all this we can actually add an additional little snippet where we do soup dot fur defy and this will just format it in a little bit more of a readable way so you can see exactly indentations and what level certain elements are compared to other elements and what elements are nested inside other elements okay so that's our page and as we can see everything that we were looking at before is all here all right let's now start scraping with the beautiful soup library and so I think what will be useful to start here is to look at some of the documentation for the beautiful soup library and as I go through some of the elements in this I think will be easier and easier to look through this documentation yourself there's a lot of useful stuff here and it's actually not that crazy large of a live but you know you have some initial stuff kind of initial navigation here at the start of this page has the installation here and I all link this in the description this documentation and you know kind of how we import it but the first thing we're going to look at is going to be find and find all so we see it just kind of here with navigating using tag names so let's look at find and find all I think this is what I use most frequently what I'm using beautifulsoup so it's useful as kind of a starting point all right so we have soup if you remember that's all of this HTML here so what we're going to want to do is let's say we wanted to grab just these h2 elements so very easily with beautiful soup we can go ahead and do I'm going to say first header equals soup dot find h2 so we pass into this find command the tag that we're looking for in this case it is the h2 tag and so if I run this and then actually print out the first header you see we have a header and note that when we just use this find command it's going to find the first element that matches the description that we passed in the other command that's useful to use and I honestly use this way more frequently and then I use just find is to use find also headers we'll say is equal to soup dot find all h2 so the syntax here is exactly the same but now we're not going to stop on the first element we're going to create a list of all the h2s and so even if there's just one element it's still going to be a list of that one element so now let's print out the headers and because I'm using a Jupiter notebook I can either do print headers or just type headers as the last line and it will show this but now you see we have a list of both these headers so that's a very simple example of gravity something from our page but we get a lot more complicated as we go so the next thing is that we can actually pass in a list of elements to look for so let's say in addition to the h2 tags we actually wanted bolt also the h1 text so any kind of header element so what I can do here is I can do same as last time first header equals souped up find and then pass in a list here instead of just a single object and I'm going to do h1 and h2 now let's print out first header and see we have HTML web page which matches what the first header is because these h2s come after okay and note that the order here doesn't matter whatever you put in this list is going to find the first occurrence of one of those items so I did the order the other way around and ran it we still get that same result okay moving on yet again as we can see with find all we do find all and pass in that list we will get both h1 tags and the h2 tag so headers we see we have now three tags because we're including h1 and h2 s so that's another useful thing to know and we're going to just keep building up our intuition for fine and final this is I would say the most important function within the beautiful soup library and we can get more and more complicated with how we use it and I'll show that so the next thing that we should look at is that we can actually pass in attributes to the find slash final function so an example would be a paragraph if I did equals soup dot find all let's say P and then print it out paragraph we see we have all these different elements there's three listed well let's say I just wanted the paragraph with the ID paragraph ID well we can pass in a second argument to this and you see right here attributes so this would be what you'd find in the documentation if you looked up find all so if I pass in attributes I can use a dictionary mapping of the property that I'm looking for in this case it's the ID and we want that to be equal to paragraph ID and I run that and now we just have a list containing the single paragraph item and note that if this wasn't a valid ID we'd get nothing here so that's another useful thing to know and we're going to keep just building up these building blocks so what's another useful thing we can do well I think something that's really useful as you're trying to get specific elements on a page is to note that you can nest find and find all calls so what I mean by this is that we could say something like body equals soup dot find of the body so if we look at our HTML up here let's say we only want to you know we have the head here but we all need wanted the stuff in the body so we can start with body equals souped up fine body print out the body as you see we have that here and now let's say we wanted just this div div is basically a container within HTML so now I'm going to say div equals body dot find div so now just within this body we are specifically looking for a div and so this is very helpful as you have a really really big page like you solve that YouTube page on narrowing down where you're scraping from so if I go ahead and print out the div now we have just this stuff within the body and finally let's say we wanted to just get the header from that well I can say header equals div dot find of each one and print out the header and there we go I guess one additional thing that I think is useful with this before we move on to the next function would be that we can search for specific strings in our find slash find all calls so let's say we wanted to find all I'm gonna real quick just print out our our soup oh no what happened okay so let's say we wanted to just find any paragraph with the text sum so really some italicized text in some bold text well we can do that by doing soup dot find all paragraph tag and then one of these arguments is text actually this documentation that Google Club is using is a little bit outdated it's actually now known as string in the beautiful soup 4 so string equals let's say sum and we're going to find all so we'll say paragraph equal this print out paragraphs and it is blank so we have an issue here why is it blank well if we think about how this paragraph text actually is it doesn't include just son it's either some bold text or some italicized text so if I did some bold text and did the full string now we see it that's not ideal in my opinion you usually don't want to look for an exact string you might want to find a specific word or two so this becomes really useful if we leverage it with the Reg X library so if I import Ari which is reg X and then I do our e.com pile and then I do some now it will just look for if I if I do it right what would I do wrong oh I had an extra thing man accident now it will just look for some somewhere in the string and this becomes particularly useful to another example we could do with the Reg acts is find all headers that have the word header in it and note that these headers have different capitalization so if I wanted to find those headers I could do headers equals soup find all those are both h2 elements and I could be looking for a string equals re dot compile and then pass in header if we run this we just get one because that just gets the lowercase one but because this is a reg X we could do something like H or H so that's now looking for a capital H or a lowercase H using reg X syntax and there we go we get both so that's useful too I think that's all we need to know about find and find all okay the next functionality we're gonna get into it's pretty similar to find and find all but it's going to be the Select method within beautiful stuff soup and this is really kind of like selecting elements based on kind of how you would select elements and CSS so I haven't talked too much about CSS yet but kind of has a quick introduction if we go to this page that I have linked to this is the more advanced example we'll do a lot of the exercises on you know this is a little about me page if we view the page source here this top stuff that we see up here is the CSS and it's basically how we can style specific elements in the HTML so Jeb that's just a little bit about CSS we'll see that a little bit more in a bit but let's cool we're gonna use that the way that you've style specific elements with CSS beautifulsoup kind of also mimics the ability to select elements like that so I think the best place to go to start seeing what you can do is this CSS selectors reference page and I'll link this in the description but basically it shows us kind of different ways we can select elements in our HTML just like if we do dot class that selects dot intro selects all the elements with class intro if we do a hash pound signed ID a pound sign a first name that selects all elements with ID equals first name we can also just pass in an element so like select all keys you can nest things so like all paragraphs within a specific div that's how you can select it you can do div plus piece likes all p.m. is there place immediately after div elements so there's a lot of useful stuff here you can also you know grab specific attributes so like if there's a certain URL you were looking for you could use it to do that and you'll see once this page will be very useful as you kind of see me using select in action but let's go ahead and start by just selecting all the paragraph tags in our page so soup dot select key and then I can print out content and as you can see this is very similar to doing find all of P one thing that's really useful with this type of method is yeah is using those paths so if we look back at our HTML and maybe it'll be useful for me to just kind of like print out some HTML again so I'm going to add another code cell I'll just print out soup dot body and this is kind of a nice little shorthand to get just the body but we have the body here and maybe I'll prettify that and print it cool so that's the body let's say we wanted to just grab paragraphs that were inside of divs so I could do soup dot select div and then paragraph and so we see now we have that only this one right here other stuff we can do with this we could let's say we wanted to grab all the paragraphs that were preceded by a header too so I could do paragraphs equals soup dot select we will want h2 and then I believe this squiggly and then P so that's going to be getting the the paragraphs directly after each two and when it says directly after that means on the same level so you see that the nested is right there and let's see if that gets it as we hope some italicized text and then we get some bold text so that did it exactly how we are hoping that's awesome let's do some more of this what else is useful well it's also useful to grab specific elements with IDs so let's say we wanted to grab the bold text the bold element after a paragraph with ID paragraph ID so I could do it this way you could say bold text equals soup dot select well I want to grab the paragraph with hash tag ID paragraph ID and then I want inside of that a be element the the bold text element now if we print that we get bold text so you know you have a lot of options with this this kind I would say if you're trying to navigate through a specific path using select is very helpful and you're going to get as you get more and more practice with beautiful soup you kind of get a feeling of when you want to use select when you want to use find and find all and always you can go to that reference page that I showed to see how you can use and one thing that's a little bit of a bummer is some of the things down here I don't actually have support in beautifulsoup but a lot of this top stuff I think is all supported I guess real quick one final thing that's worth mentioning is you can kind of run nested calls so if I said you know paragraphs equals soup dot font or dot select you know any the body tag followed by some paragraph element and then I wanted to maybe I wanted it directly the direct child to be a paragraph element so that would take out this as an option because it's this paragraphs inside the div it's not a direct descendant so direct descendants of the body paragraphs and that would give me these two things and one thing we can also do is I could say like for paragraph in paragraphs I could do I could do nested kind of select calls on these so I could do paragraph dot selects let's say I want it in I tag and I think that should print things I guess I'll do French paragraphs and then I'll print paragraphs select ok so as we can see we get the paragraphs and then we iterate through these two items in the list and first time we do have an eye element so we can select that and print something out the second time there's no italics in this element so it's just an empty list I'm gonna quickly paste in one more things to show I could get grabbed and I done with an align equal middle by doing the following but let's move on to getting different properties of the HTML so let's say like as a first example one thing that we might want to get is a string within an element so I don't want just the header I want the text a header not the full tagged element so if I did soup dot find all four maybe I'll just do find h2 and that's equal to header note if I print out header it's going to give me this but if I did header dot string it will give me just that text string as a nice thing to know however if we do it with the div so if I do div equals soup dot find each I guess it's div if you print out the div notice we have all of this and if I I think it'll be a little bit more clear if I do print div dot prettify but we have this div if I try to call print div dot string on this see what happens so it says none and the issue here with the div why it can't print out all the text in this tag is because it's not clear if it should print out HTML web page or if it should print out everything in the paragraph here so because this has two kind of elements at the same level as children it know what to look at with that div so if you ever run into this type of problem where string is none there's another built-in method of beautiful soup called get text which is very useful and we can use that for bigger objects and getting all the text inside kind of in a recursive manner so now you see we have HTML web page linked more exam interesting example it gives that link so if multiple like child elements use text otherwise we can go ahead and do use dust string so that's getting the string what else can we do here I think it is useful to actually get like this link and know how to get that href here so let's now go ahead and get a specific property from an element so to do this we could do soup got find we could find the links and note you see that this link tag if we print that out link we get this link tag we just want this href because that's the actual link that we would use so if I do link href it doesn't work but what we can do is I can go ahead and with that link tag I can pass in in brackets href like this and as you can see we just get that link here and you can use this in other ways where if I grab paragraph equals soup dot find or do soup dot select paragraph with the ID paragraph ID and we printed that out if we just wanted to get this ID from the element we could do paragraph zero because this is in a list and then we could do ID so you can kind of pass in anything in the bracket syntax that's one of those properties that's useful too all right the final thing we'll do before we get into the examples is some code navigation okay I'm going to try to go through this section pretty quickly first thing I think it's good to know about is path syntax so basically we have you know our soup object as we've seen before what you can do is there's shorthands I could do like soup body to just get the body and I could just keep doing this I could do dot div to get just the div inside of the body then maybe like dot h1 to get just that header and then I really wanted to dot string to just get the string off that header so path syntax is good to know another thing that's good to know about and it really comes down to three terms that you really need to know so I'm going to say know the terms and the terms are parent sibling and child parent sibling and child okay and this will be more clear if I do a pretty print of the body okay so what does parent sibling and child mean well when we look at our body here we have this nested structure and so these terms all kind of relate to that nested structure so the did this div its parent is the body because it is nested within the body likewise this body child is the div so the body is the parent of the div and the div is the child of the the body if we look now at this div we look at the elements that are on the same level of it and the next element that would be on the same level is this H two so if elements are on the same level we consider them siblings and just to see what we can do with those terms in beautiful soup there's several things that beautiful soup offers so if you look at the left side of the documentation here you see all sorts of things navigating the tree and it talks about contents and children descendants it talks about parents the really useful things that I think are kind of right around here with these function calls so find find parents find next siblings find previous siblings find all next and find all previous these commands can be pretty useful if you want to just get a subset of elements so just to do one quick example let's say I did define next siblings so if I next sibling would get is kind of like find and find net next siblings with an S is kind of like find all so let's just say we grab the soup duck body dot find each one urn now let's just find the div so we have this div and as we saw before this the siblings of that div are the h2 I guess the paragraph this other h2 this paragraph etc so there's I think for total elements that have that are siblings to the div so I did find next siblings we should get a list of four elements so we get a header paragraph first paragraph tag second paragraph tag our second header and the second paragraph so we could accordingly you know do some additional processing on those siblings and with all those other terms I just mentioned that documentation also has ways to access those so you know you could find all parents you could find all next etc so useful things to know about and you can kind of look into the documentation if you need a specific thing among there I honestly don't find myself using these types of functions nearly as much as just using final and select but good to know about all right let's get into exercises okay for the exercises we're going to be using in this page Keith galley github is slash web stripping slash web page HTML and I'll link this in the description but just a reminder what that looks like it looks like this it's kind of an about me page kind of a little fun thing I put together but we're gonna be getting specific elements from this page and to help you do that I recommend right clicking and clicking inspect and note that if you open up the body you can kind of see exactly how every one of these elements is styled in the HTML and that's gonna help us scrape specific things out alright so how this is gonna go is I'm going to present a task on that web page and the way that I think you're going to get the most out of this section is that every time I present a task if you pause the video try to solve the task on your own and then resume it when you're ready to see the answer or see how I solve it there's gonna be multiple answers to all these tasks it will really allow you to you know practice your skills and and drill down the library all right so let's start with the first task and I guess before we actually I present a first task let's just load load the web page as a reminder how to do that we can just run it load it the same way basically that we loaded that other example page and make sure if you haven't already make sure you import it requests and imported the beautiful soup library let's load the web page and the one thing I want to do here this is now web page dot HTML and the one thing I'm gonna just give this a different day I'm gonna give it the name web page and we can print out web page dot prettify now so this is loading the page and as you can see it's a lot more text than it was before so first task is going to be to grab all all of the social links from the web and to make this a little bit more interesting I'm gonna say that you have to do this in at least three different ways because ultimately we can select items in many different ways using and and not only does it have to be done in three different ways but one has to use find / find all and at least one has to use that is the select method so let's go back to our web page so what we're trying to grab is all these links right here feel free to pause the video try this on your own and then resume it when you're ready to see how I would go about solving this alright so what I would do would to start out and probably be kind of pretty simple right from the get-go is I would just try to see what happens when I select all of the a elements because a elements are the links on the page so what happens when I do that and as we can see we get some stuff there but it's you know our socials are here we get all this other stuff in addition to our socials so this is not the best way to go about this so what else can we do well let's go back to the web page and remember we can do inspect so if we go ahead and start inspecting these elements we see that once we get to these socials they are all in this class this unordered list class UL with a class name of socials so if we grab this then it's pretty easy to get the links from there alright so how can we do that well we wanted to get that unordered list and we wanted it to have the class name of socials so remember if we did pound that would be the ID but dot is for class names so you l dot socials and what does that just that give us ok cool that gives us what we're kind of looking for but now we just want these a elements within that so we can do you all dot socials a and run that line and cool we have a list of the a tags and now if we wanted to just get the actual links I'm gonna say actual links that's going to be done if we just do a list comprehension we could say link href cuz that's where the actual link is stored in the href for link in links and then let's go ahead and print out the actual links and look at that we got it cool so that's one way let's move on to the second this time let's try to use find so once again a nice starting point might be to do find let's see what happens if we find just first a tag print that out okay we get this tag all right that's my YouTube channel that's not what we're looking for well we can do the same kind of thing that we did before we ultimately if we just find this element so if I said ul hope nope that gives us the fun facts but if we passed in also the attributes dictionary with class equaling socials and then printed out links we see that we get the tag that we're kind of looking for here so we can go ahead and copy this code in and ultimately navigate over all those links again on that and we get an error string indices must be integers okay those print out links again okay I see the issue here we did find so this now is not in a list it's just a single tag element so what we could do is basically do links dot find all of the a tag with get a list just like we had before so I'm gonna say that this is called you list unordered list and ultimately the links is going to be equal to you list dot to find all of the a tag so this is our u list now we're finding all the links within that print out links cool we get this and then finally now I think we can copy this in and ultimately get our new number two way of grabbing these links cool we got it so now we just have to do one more let's go back to the web page to try to figure out a nice way to do this just looking at the inspect tool again and what I see here is that these have a class tag the individual this telemon's have a clash tag of social so if I did something like let's say links equals web page dot select of list dot social within the a tag within that we should I think at the same links as before and look at that we do so we just really grabbed the individual list elements instead of the entire unordered list of links and as a result our third and final way we copy that and we get everything alright the next exercise is going to be to scrape the table that is included on that web page so we go back to the page and you scroll down I actually included a table of my MIT hockey stats very fun stuff figure this was a simple fairly straightforward table that'd be fun to scrape I initially took this table from a site called elite prospects but I simplified it a little bit so if we want to scrape this I think the first thing we should do and actually feel free to pause the video into it yet and then resume when you're ready for the solution alright so I think the first thing that we want to do is to inspect this table and just see what we're working with and to get the entire table we see that we can grab this table with class hockey stats so let's do that in code so a web page or let's say table equals web page dot select and we're grabbing the table with class hockey stats and let's see what we have for table and as it looks like we have everything there so I'm seeing and just because we don't want this in the list we can just do select and just grab the first element which is the only element and then we get that just as the tag form not the entire table all right and next it's really a matter of I think the best way to shape a table is to load it into a data frame in pandas so let's import pandas so import pandas as PD and now how do we shape this table to actually go into that pandas dataframe so for something like this you might be able to do it off the top of your head but this is something that I would usually you know do at Google let's just tackle overflow for so let's do that so how to scrape a table using beautifulsoup we could look up something like that take the first stack overflow post and kind of look through it see if it's what you're looking for so this person's transcript table and this person responded with kind of how you can do that and the one thing that I kind of see with this response is that it's scraping the table but it's it's not it's you know it's printing it out as a string it's not putting it into a like pandas dataframe so what I think we should actually look up is scrape a table into pandas dataframe and I'll also look up butyl soup as another keyword in this search script tables into data frame with beautiful soup that looks good it's got a decent number of up votes and let's see what the answer is here try this that looks pretty straightforward so I'm gonna copy this code and just utilize it within our code so if I go ahead and insert a new code so I'm just going to paste this kind of as reference up here and try to mimic the behavior with our table so first thing is that our columns will be ultimately this table header stuff so I think the first thing we should do is try to grab all of these table heads so we could do that with I'll say column columns equal table dot find all th and maybe just to be careful let's because we can kind of see the scope of this let's do first table dot find table head and then do dot find all table heads here and let's see what we have for columns cool we get a list of what we're expecting and now if we wanted just the column names that would be we could do a list comprehension which would be a C dot string we'll say for C n columns let's print out call names look at that so we get all the column names this one looks a little bit weird so we might obviously ultimately get rid of that and also because we have these duplicate so over here that might cause problems in panas but we'll cross that bridge when we get there we have the column names next really we need to copy this code here so table rows how do we get the table rows well we're going to go into the table body so row are going to be equal to table dot find we want table body and then we want to find all table rows and just to look at the table again just to see how its laid out I think is helpful we have table heads here and that's all within the table head in the tail body you see we have these table rows and inside of that you have all these table datas so we're gonna need to find all of the table rows and then we're going to basically do a bunch of processing within the table data of those table rows so we're going to find all table rows and then now we can basically copy this code so let's paste that in here for TR and table rows so I'll just call this table row is just a mirror of the syntax fot find all table data Teairra text first TR and TD i do dot string just because that's the most up-to-date syntax may be actually text is totally fine too but you can use either dot text or not string a row equals to your outlet string l equal and row I'll also include that L that I didn't include up here so that's empty list that are basically adding all the row details to so after it on that let's see what happens if we print out L and look at that it looks pretty good like that's just do L the first row it looks pretty good except for the fact that a bunch of them are just like newline characters so how do we strip out newline characters I'd be kind of another Google search strip out white spay or white space and newline characters Python so it looks like a nice trip okay cool it looks like it'll strip any white space with the dot strip method so I want to try doing tear duct string and then dot strip as well and then see what now our tail array looks like hmm what happened there sir on that string top strip what happened this oh look at that we did it I guess you couldn't call the strip on just the string but once we converted it into an actual Python string object it was a lot more friendly and that looks like a pretty clean row so now what we'll do is we'll just merge this into the data frame so DF equals p2 up data frame L and the columns now are not that they're going to be the column names and now let's print out our data frame D F dot head come on oh that looks good I love it oh no it looks like some things are missing what is missing here I guess because some of this had nested elements the dot string didn't work we might have to do get text let's see if that fixes things oh look at that it looks pretty good yeah some of those taggers have nested elements so this string did none but that looks pretty good and we could go ahead and do pandas type stuff on this so I could do like DF team and we see we get which of those I could do TF dot lok of DF we're team is nadia cool stuff string team is I guess not equal to did not play what happens if I do that and look at that we filtered by that and then maybe we would want to do like dot sum or something like that just get the the totals Oh doesn't look right but we scraped the table I'm not going to go into the details you might have to like convert some of these columns to different things it also might be worth not including these last ones or just changing them to have slightly different slightly different names because if I did GP I don't know what would happen here yeah I guess it gives me both those columns but if you wanted it kind of makes things weird because we have duplicates so you might want to rename them so you can get each column having only a single each column name only having corresponding to a single column in the date frame but that's more of a panda's question than a beautiful soup question and just so you guys can give me a hard time in the comments if we look at the table again you see how all of this postseason stuff is empty yeah unfortunately in my four years of playing I guess I played five years I don't know why 2013-2014 is missing but in my five years of playing I never made the postseason so yeah a bunch of blank spots in that that side of the table but yeah that's faking the table alright next exercise we're gonna grab all the fun facts that use the word is in it so going back to page we got up near the top we have these fun facts let's read through them you know I I own my dream car in high school everyone kind of a baller if you click on this footer though you get some details about that and it's not my this might not be everyone's idea of a dream car because it was actually a minivan but it was an awesome minivan middle name is Ronald very fun never had been on a plane until college the first time I was ever on a plane was for my freshman year at MIT 'he's the cross trip and I was given a fun haircut before that trip next fun fact Dunkin Donuts better than Starbucks very very important you need I need everyone to know that you got a support Duncans some other things so we're grabbing all of the fun facts here that have the word is in it let's do that so we have webpage I think we're gonna have to do is find fun facts we see it's a class this is gonna be very similar to the social media links so let's grab the unordered list with class fun facts I'm gonna say facts equals webpage dot select ul dot fun facts and then I'm gonna grab all the list elements from that and that should give me something pretty good look at that we got all the list elements here now we just need to figure it out of those list elements so I'm going to do find all that contain the string equaling now we're gonna have to import reg X again should be already imported from before but case it isn't you can import reg X again and do re compile is and let's see what happens when I do tax is equals that this topic has no attribute find all so we're gonna make this list comprehension so we did a fact dot find all four yes fact dot find we don't need to find dog because there's only us Ingle string of these for our fact in facts it's at work now let's see what happens if we print out facts with is non nine cool this looks pretty good I think it looks like only the first and third didn't have is it and that's just confirmed that that is right so first doesn't have his third doesn't have his all the other ones have is as we can see here so the last step of this would be you know maybe just getting rid of the nuns so you can just do like another list comprehension if you wanted to do I'll just say facts with is equals fact for fact and facts with is if not fact or if fact because none is a false condition so if they are none this wouldn't be true so let's see what now happens look at that I think we got it so note that we're really close except for the fact that these had some like italicized elements in it and right now the way we're doing this it's stripping out the rest of this text so what we're going to need to do is in the string element here that we're grabbing we can do fact I find parent and just get the element that's directly above it so ultimately if we run this we see now we get everything in it and we could even then go ahead and do dot get text on the find parent and that should just give us what we're looking for for the fun facts look at that so that was actually fairly tricky with this nuance here at the end so this was kind of a fun little exercise alright next exercise is how can we go to this web page and download one of these images so we have you know the image of me and then we have some Hittle images of a Italy that I took last year when I made a trip there so this is like Como this is Florence and this is a sunset over rio Maggio what I can't say it I'm gonna botch anything that I say here but it's in the Cinque Terre a Cinque Cinque Terre tarry man Italians watching this video are gonna be are we pissed but I had a great time at all these places but let's try to download one of these images using web scraping and some other library so that's the task I try to do that pause the video and then resume it when you're ready alright because I am using Google collab right now instead of running this code here in my Google collab notebook I'm gonna actually use a local sublime text file to do this downloading so the start code here is just really getting that same webpage as before just now I wanted to do it again because this is sublime text but as you can see I ran this code all the stuff that was there before is still available but now let's go ahead and and grab an image and ultimately get the source for an image so that we can download the image so if I inspect these pictures we see we have images slash Italy slash Lake Como so this is a local path so really we need to get our current path which is the webpage that we're looking at and then add this on to download the image so that's good to know and this is inside of a div called row and a column called class so I could probably do something like we're going to do webpage dot select div dot row div dot column and then we want images within that so let's see what happens if we print that out and look we get just the images that we're looking for now we need to basically append on this to our URL so I'm going to pull out our URL and say your URL is equal to just this directory this is kind of our base path so now if we change things up a bit this would be equal to a URL plus web page dot HTML and now what I want to do is for any of these images so I think for simplicity sake we'll just grab the image of Lake Como so we have I'm going to say our images are equal to web page let's select we're going to just grab the first image and we will want to download that so we need to get the URL well we'll say image URL equals image 0 then we'll get the source for that and that's just print out the image URL the image is not defined image is 0 so we have this that's what we just printed out okay so we need to append this on to our so our full URL is now going to be equal to full URL equals URL plus image URL and now we just need to download that well I think this is something that will be helpful to Google so I'm gonna just say Python download image using your and then we get a Stack Overflow post right here save image from URL let's see what we got a sample code that works for me on Windows this looks pretty straightforward I want to see what other replies there are and this one's even shorter and I like shorter so we're gonna try doing this one in our code he uses requests we already have imported requests so that's basically just making another request now we're going to need to use the full URL because like the pace of this image this code in we didn't have it as we wanted it's a full URL we're going to get the content so this is just like getting the webpage content but now we're getting the image content on that page with open and then we can name this whatever we want so I know that this is going to be Lake Como it will open that as writing buffer let's just confirm yeah this is a JPEG image so we can use this extension handler dot rate image data that should be good and now that's also we're going to save it wherever we have this code locally so I'm gonna run this I think it ran I'm gonna confirm so over here I opened up the folder that I had this download image file in and as weensy I can open up the image of Lake Como locally that's pretty cool so we just downloaded that so we get the full quality here and Wow looking at this again this was just such a beautiful spot I definitely recommend traveling not only to Italy but checking out Lake Como's the it was so relaxing so pretty and then I'm missing this right now being in the middle of the pandemic but okay that was downloading an image so we're done with that exercise all right final exercise before we conclude this video it's going to be solving the mystery challenge so if we go one more time to this web page and we look at the bottom there's a bunch of links here and if you scrape just the paragraph tags with the ID secret word from all of these links and you got to do this in order each each one of these files is going to have exactly one of these secret word secret word IDs and just to show what the file looks like it looks like this but if you scrape those all and just grab the the correct ID you'll ultimately get a fun secret message alright so how are we going to do that well let's look at what these links look like inspect all right again they're similar to the image just from the last exercise where it's a relative path so we can probably utilize some of that previous code we did to get to open those files and we'll have to use of requests again to dive into them and then if we look at the actual file and we look at it we see that all of these have IDs secret word with to sees not the secret word we're looking for or one of them has the specific ID we're looking for so we're in a scrape for that alright let's do this so I think first off let's grab our elements that we need and see what they are let's just see what's in this respect this so we have a paragraph we have a div so these class block divs this looks like what we need to find and we need to get out the links from there so what I'm going to do is a select here I'm going to do web page or let's say files equals web page dot select it had it was a div with class block so we can just block it you select those and hopefully nothing else on the page has those and then we want to grab the link from those block divs let's see what now files gives us look at that looks like it's all the files we want one through ten in order so now let's we want to just get the relative paths I guess relative files will say equals file and we're gonna get the href element I think your file might be a special word and pythons I want to just say F F href for file F in files printout relative files and remember you can pause the video if you want at any point if you don't want to watch me solve this but look that just gets us the relative paths and then we need to kind of from our previous example we should go ahead and you know kind of copy some of this code so when I say URL equals this so our URL equals this then I say for file and for F in relative files we want to construct the full URL so full URL equals URL plus the relative path so the F here so that would be like this path this URL plus this that's going to be our full URL then we're going to want to load that page so we're going to do requests dot get the full URL and then ultimately page equals that and then we will want to load that into beautiful soup so be bs page equals bua beautifulsoup passing the page and then within the beautifulsoup page i guess let's just look at one of these pages so we'll just do a beautiful save page body print that out maybe prettify it and then break out of this so this is only gonna run once type response has no length or is that an issue oh okay we're have to do page dot content here okay and now we at the page cool so now in that page it's just a bunch of paragraph tags so if we just go ahead and do print vs page dot find paragraph or I guess we can do let's do find we'll do find the paragraph tag we want to pass in attributes equals ID secret word and that should get us for that file the secret word so let's run this again because it looked like the page was loaded in properly look at that make okay so that's the secret word for the first file with in the tag so if we wanted just the string so just two secret word element equals BS page to find that and then if we wanted just the secret word that's going to be secret word element dot string then mr. sprint secret word just make sure it works for one file mate cool so now we're gonna remove this break and we'll see what it prints out so now it's gonna iterate over all the relative file paths add it to the URL to get our full URL and ultimately hopefully this will give us our secret message and we can be done with the video let's run it what is it gonna say oh wow look at that look at this secret message make sure to smash that like button and subscribe that's all we're gonna do in this video everyone hopefully enjoyed this hopefully likes learning kind of a little bit about what web scraping is then you learned about the building blocks and then we did a bunch of exercises to really drill down those skills if you did enjoy this video yeah it may mean a lot of you smash those like buttons and subscribe also feel free to check me out on the other socials Instagram and Twitter I do appreciate when people follow me there and I think it's a good way for me to kind of show my personality a bit on those other platforms so I post some cool stuff in those places do I have anything else yeah I guess the only other thing I want to mention is that I'm going to try to do some follow-up more kind of complex examples of web scraping in the future maybe like a real world web scraping video and I also want to dive into not just beautiful soup but I would like to look at selenium and scrapey so I might do that in future videos too but feel free to let me know in the comments if there's anything else that you'd like to see alright once again that's all we're doing in this video thank you everyone for watching this has been a fun one for me peace out [Music] you
Info
Channel: Keith Galli
Views: 131,605
Rating: 4.974308 out of 5
Keywords: Keith Galli, MIT, programming, python, python 3, data science, data analysis, web scraping, beautiful soup, beautiful soup 4, beautifulsoup, bs4, scrapy, selenium, data science python, find_all, CSS select, HTML, requests library, beautiful soup library, python programming, how to scrape website, pandas, select, soup, scrape a table, web scrape, web scraper, web crawler, crash course, machine learning, real world, exercises, practice, beginner, what is web scraping, AI, regex, re library
Id: GjKQ6V_ViQE
Channel Id: undefined
Length: 73min 3sec (4383 seconds)
Published: Sat Jul 11 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.