Intro to Web Scraping with Python and Beautiful Soup

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello ladies and gentlemen of the internet my name is fook trongs senior data engineer for data science dojo and I'm here to teach you how to web scrape with Python okay so in front of you you see is actually a website that employs web scraping so this web scrapes actually a storefront of a website called steam so steam sells video games and the cool thing about steam is that they do flash shells every day so the user has to come back every day and study this page what is a good deal what is not a good deal and it's a lot of information right this is how this is how they've gamified shopping online now there's a website that actually scrapes themes front page in real-time and show you the best deals and ranks them okay so a lot of people ask me how do I get all of my data and actually in the absence of API if you learn web scraping it is actually a very important tool for a data scientist and a data engineer to know because the entire Internet becomes your database right so not just I can web scrape any storefront Nordstrom Macy's study the sales web scrape reviews I can web scrape baseball stats baseball players in real time wikipedia is also a good place to web scrape for example you can see that this frame over here of this Harry Potter character Ron Weasley it's very standardized I could write a web scrape script and then loop over every single Harry Potter character very quickly and create a data set all right today we're going to learn how to do that okay so today I'm on Windows so you can normally solve Python if you're on Linux but if I want if you're on Windows I highly recommend installing anaconda instead so if you go to Google and just type in anaconda it should be continuum dot IO and you should just download based upon your operating system okay next thing I've been using I'll be using a text editor called sublime text so you can just go ahead and go to google and type in sublime text and then install that I like using sublime text 3 okay all right that's where you get those things all right so once you've installed this this is actually if you're on if you're using an account with actually a pretty big file it's like 500 megabytes ok so be warned of that alright so what I'm gonna do is I'm going to go ahead and open up my command line and for those who don't know if you go to a folder any folder and then just hold down the shift button right click and say Oh command window here this opens up the command line for you okay and this is where you can work with Python so if you type in Python right here right and if you've installed either Python or anaconda well this should show up right so notice I'm using Python 3.5 with anaconda and if I just do a very quick 2+2 it should equal 4 that's how I know I'm inside of my console all right next thing is yes now I know that if I push down control and hit C ctrl + C basically if I did do a copy on Windows it will exit this console ok and I get back to basically the Windows command line alright so what I'm going to do now is I'm going to go ahead and install a package called beautiful suit that's the package that we're going to use to web screen actually it's a very powerful package I encourage those of you who want to go further beyond this introduction to go ahead and learn this package so all you got to do is do a pip install BS for KBS 4 stands for beautiful soup 4 so here we are so beautiful soup has been installed and how do I know if it's been installed well if I type in Python ok and I type in import BS 4 or PS 4 right it should just not air ok awesome so that's how I know that the packet is online and ready to go alright next thing I want is I need a web client so beautiful soup is a good way to parse HTML text that's all this job is it's a good way to traverse HTML text within Python now I actually need a web client to grab something from the internet and how do you do that in Python is actually you would lose a package called URL live and inside of URL live there is a module called request and inside of that module is a function called URL open ok I know that's a lot to take in but settle down we're gonna do it with step by step so I'm gonna do a really quick import all in one line kind of step alright so I can do from URL line dot request alright so I'm calling a package called URL live if you're on Python 2 this is a different package it's called you're alive - alright so I'm calling within that so notice I'm importing only what I need I don't need all of your life I just need the request module and I'm going to import out of that okay you are open the one basically function that I need and it's going to import all the basic dependencies as well and I'm going to give it a name because I don't want to type in URL open every time okay so I want to say you request right you rec for short that's how I tend to do things and also I can also modularize the import of beautiful soup as well so I can do from bs4 okay import and this is important capital be beautiful and then capital S for soup and then I'm just kind of call it as soup so I don't have to call out beautiful soup again every time I want to use this package okay and this is me working in the console this is me playing around so if if you want to you can actually tart start typing it into a script so in this case I have sublime open and I'm going to do control shift P to open up the command command console and then I'm going to say set syntax is equal to Python okay beautiful so now I can do the same commands in here so if I just select this into the command line hit the enter button that will copy it so that way I can paste it into my script here okay so there you have it the first two lines of this so now I'm ready to go so beaut beautiful full soup is going to parse the HTML text and then you are our life is actually gonna grab the page itself but what do we want to web scrape okay well I like graphics cards okay I'm going to web scrape graphics cards off newegg.com so some of you might know it it's basically Amazon but for basically Hardware of electronics okay so I'm going to type in for example graphics cards so these are a bunch of graphics cards that have shown up in my search bar and it would be nice to basically tabular eyes and turn this into a data set and notice that if a new data set if a new graphics card is introduced tomorrow or of ratings change tomorrow prices change tomorrow I run the script again and it updates it into basically whatever it is that I loaded into I can load into a database a CSV file an excel file it doesn't matter okay so in this case I'm going to grab this URL okay that's all I'm going to do is so basically I'm going to I'm gonna copy this URL and paste it into my script so in this case I can do my URL is equal to alright so that is the URL I want to use of of this okay and in this case I will actually run it in my console right so on when I'm web scraping I like to also prototype it into the command line as well so I know that the script is going to work okay and then once I know that it works I will go ahead and paste that back into my sublime okay so mark this is my euro right so I've gone ahead and call it a variable and place the string of the URL into it okay now this is going to be good so now I will actually open up my web client so in this case I would do you request right so notice I'm calling you are all live and I'm calling it from the shorthand variable that I called it earlier so notice I called from URL live dot request import URL open as you request so I'm actually calling the function called euro open right now instead of a module called request inside of a package called URL lot all right all right so the next thing is I'm going to throw my URL into this thing okay so what this is going to do is going to open up basically a connection it's gonna open up this connection grab the webpage and basically just download it so it's a client I'm gonna call it a you client is equal to you request of my URL okay it's gonna take a while depending on your internet connection because it's actually downloading the webpage I notice that okay it's done so the minute I I want it I can't do a read/write are you client dot read now if I do read it's gonna dump everything out of this right away I can't bring use it right so before it gets dumped I want to store it into something a variable so I'm going to call like us page underscore since this is the raw HTML I'm just going to call it HTML ok page HTML is equal to you client dot read okay so I can go ahead and show you this thing but in might depend on how big the HTML file is I can actually crash the console so I'm going to show it to you once it's inside of beautiful soup so bear with me here okay and with any web client since this is an open internet connection I want to actually close it when I'm done with it so you client that close is what I'm going to do okay and knowing that both of all of these lines of code have worked so far I can just go ahead and copy them into my script okay so in my URL is that a new client is and just add some documentation opening up connection grabbing the grabbing the page okay and then what this does is it offloads the content into a variable okay and then what this is going to do it's going to close the client okay then the next thing I need to do is I need to parse the HTML because right now the HTML is a big jumble of text okay so what I need to do right now is I need to call the soup function that I call that I made earlier so notice I call from bs4 import beautiful soup as soup so if I call soup as a function it's going to call it a beautiful soup function within the bs4 package okay so in this case I would do soup of basically my page HTML okay and then if I do a comment here I will have to tell it how to parse it because it could be an XML file or in this case I will tell it to parse it as an HTML parsed file okay and I need a store into a variable where else is going to get lost okay so in this case I'll call it a page suit okay so this is the this is I know it's kind of weird that they call it a soup okay but it's it's standard notation now when you say soup people understand that this is the data type of it it's up as derived from the beautiful soup package okay all right so in this case this does my HTML parsing okay so now if I go to the page soup and I just try to look at the h1 tag page oh sorry page soup h1 I should see the header of the page so this just say video cards and video devices okay so I should see that somewhere so notice that a grab this header right here okay and just and just for good measure let's just see what else is in the base so beautiful soup dot maybe there's a there's a P tag in there I can so newegg.com a great place to buy computers so if I think that machi might be at the very bottom great place to actually not but it might be something that's hidden okay I might be just in a tag line all right but I am on this page okay so now we need to do is traverse the HTML so basically what I'm going to do is I'm going to convert every graphics card that I see into a line item into a CSV file okay so to do that to traverse now that I'm now I have a beautiful soup data data type I can actually traverse basically the the Dom elements of this HTML page so let me show you how to do that real quickly so if I inspect the element okay of this page right so if I go find the body tag for example I think the body type nope it's a starts off as a yep it starts off as a body right so if I do a body right let Paige soup dot body and then I can keep going I keep going dot within the so notice that this body tag can go even further into an a tag or a span tag so if I type in the span tag I should find this span tag so body dot span see that span class know CSS skip two and see that know CSS skip two okay that's awesome so the next thing I'm going to do let me just make this HTML a little bit bigger so you guys can see it even further alright so what I want is actually if I'm in Chrome you can also use the Firefox Firebug to inspect the HTML elements of the page so I'm going to just actually select this the name of this graphics car right here and try to inspect its element right it jumps me directly to this a tag okay it drums me directly into this a tag and I want to grab the entire container that the graphics card isn't because I know that graphics card container contains other goodies such as this original price it's its sales price it's make its review type and the card it's the card image itself so I go out okay so this HTML is an embedded kind of tagging language I can go out until I find what it is that is containing all of this right so notice that this div right here with the class of item - container contains and houses all of the items inside of this thing so basically the effort I would need to set a loop I would read my script first on how to parse one graphics card and then once I'm done with that I can loop through all of the class containers and go ahead and parse out every single graphics card into my my data file so in this cost I need this class card I want to grab everything that has this class so I want to go ahead and and do that right now so I want to go to so my page soup there is a function called final okay and its capital a will find all and I want to find what do I want to find I want to find all divs right that have the class item - container so I would go back and I would say find me all divs okay comma and then I would feed it an object and the object says what is the name of the tag that you're looking for so it's a class okay if it was an idea output ID here and then I would go ahead and paste in the the item - container is what it's called right so in this case I will feed this into a variable called I guess containers right we'll call it by what the classes 0 I'm going to copy this as well and paste it into my script hopefully it works okay so from this I will grab grabs each product okay so notice that even though I'm writing this for graphics card I'm I'm betting that new egg is actually standardized its HTML enough so that I can actually parse any page any product on on new egg if I just run the script over so if I'm if I call this containers okay so let's check the length of the containers let's see how many how many things did it find so it found 12 objects right so I found one two three four five six and it found twelve graphics cards basically is what that did okay and look later there's six of them yes that is that is true okay so let's look at the first one so if I go to containers of the zero index I should see the HTML for this thing okay so I am actually just going to copy this out into basically in my text file and I'm going to read it in there because sometimes when you load a page there are some there are some post loading done via JavaScript right and some things will show up some things won't show up so just to be sure I'm just going to paste it into some my sublime and from my sublime I can go ahead and figure out what is actually in there so I'm gonna go control new and sublime paste it in but notice it's not very pretty so we'll deal with that in a minute so I'm gonna set my syntax to become HTML okay it's in HTML now but that's not pretty I'm going to use an external service called JSP to fire so it's gonna basically do all the spacing when there needs to be spacing so J's beautifier you basically just copy an ugly code and it turns it pretty see that everything is all now nicely spaced and eliminated okay there we go now let's read what's actually in this thing so if I open this enough now and there's gonna be a little bit hard to what kind of things do we want out of this thing so if we go through and we can see that there's some pretty useful things we can see that the items have ratings it has a product name we want to grab the product name for sure let's see there is its brand right I can grab its brand by just so notice that they call the image the name of the brand which is useful so if I grab the title of this image so notice that the image itself it says it says EVGA but that's an image right I can't grab that image but well I can grab the image I just can't parse what it says unless they use image recognition but notice that the title encodes what type of brand it is for us so that's very convenient so this is something that we want to grab and also I want to be sure I want to grab things that are that are true of everything right so if if not I'm gonna have to run into some corner case FL statement so notice that this guy right here is special he doesn't have any egg reviews right so I wrote something the parse reviews I'm going to need to write an if-else statement or I'm going to do I'll have to do a try and catch with an index out of air catch okay and then notice that it doesn't even have what this number is I think it's the number of reviews here so I would let you guys go ahead and handle the describing of that but going to script things that our president all of them so notice that I'm going to scrape the names so all of them seem to have the names of of the the brand or the names of the product and then I'm going to go ahead and scrape the product itself and not all of them have a price you see that I I have to add it to the cart to see the price okay and let's see what else is good and they all seem to have shipping so I'm going to grab shipping to see how much they all cost so once you learn how to scrape one it's the same really for all of it now if you want to loop through all of it you have to do those if-else statements to catch all the news cases that aren't there so notice that if I if I do a container right now a container of zero okay so a container of zero so I'm going to container of zero is going to throw container zero into just a variable called container because later I'm going to do a for loop that says for every container in containers right so right now I'm prototyping the loop okay before I can where I want to build the loop okay so I want to make sure it works once before I even build the loop so this container contains a single graphics card in it okay oh I will call it containers that have contain all right so container dot dot one let's see what is in here so notice the container dot a will bring me this thing back so if I do container dot a this brings me back exactly what I thought I would it would bring me the item image okay so the item image not that useful to us let's see if there's anything that we can redeem in here the title we might go to redeem the title but it seems that we can also grab that down here which I think this might be the more efficient way to grab it so let's grab it from there instead has that's what the customer sees that's what you will see when you go ahead and visit the space so we will go instead of doing dot a we will do dot div will go jump from this eight directly into this div okay so well I'll go ahead and push up and say container dot div so that will jump me into this this div right here and everything inside of it okay boom okay so if I go into that container dot div I will just probably assume this is the right one I know this roof stripping HTML tends to be hard because it hurts your eyes unless you know how to read HTML very well but it's something just to get used to okay so I know that I'm in this div and I want to go into another div called item branding okay so div dot div alright and inside of that div there is I think an a tag right this a tag actually contains some things that we want which is this guy right here what is the make of this graphics card dot div dot a okay and there we have it so I here's the href of the link okay so what I'm grabbing is this guy right here this this EVGA thing that I'm grabbing notice I hover it's a it's a clickable link that link is this guy right here but what I really want is this title the title of this link okay so what do I want I want to do container dot a dot image okay so I want to grab this image tag now so notice I'm just using these handlers and just I'm referencing as if it was a JSON file okay and notice that I'm inside of the image now so the image is here now I need to grab this title okay so this is an attribute with inside of the image tag so how do you grab an actually well you would reference it as if it was a index or I mean a dictionary so I would say title of this is equal to EVGA okay so that that now that I have prototype that I can go ahead and add that to my script alright so I can go ahead and go ahead and just go ahead and copy this right here and paste that into my script okay so inside of my script this is where actually can do that that that pre-emptive loop now I can write that loop now so for container in containers okay it's going to go loop through and it's going to grab container div div that image of that title is going to equal to the brand or the make right so that's the first thing I grab so who makes this graphics card that's the first thing it's going to do so what else do I want to grab while I'm inside of this thing so let's grab two more things right just just grab two more things just to just just have a really good CSV file because a CSV file with one column seems a little tiny bit pointless alright the next thing I want to do is I want to go ahead and grab the name of this graphics card which is right here and notice that it's embedded within this a tag and this a tag is embedded within this div tag and this div tag is embedded within this div tag so in theory if we do a container okay dot div div dot a it actually brings out it doesn't it seems like it brought out the item brand instead so the item brand is actually this a tag which is not what we wanted we wanted this a tag okay so notice that it's having trouble finding this particular a tag so what I want to do actually is I want to do a I can do a find all and find just the the direct class that I want okay so in this case I can do a find me all the a tags that have item title okay so in this case I can do container dot find all Oh container dot find all is equal to I want to see the a tag okay comma and then I want to throw it into an object and the object is I'm going to say look for all classes that we'll go ahead and start with item title okay so this will give me an object back basically a data a data structure back that has everything that it found so hopefully it should only be one thing so we don't have to loop over it so in this case container equals that which will be a title underscore container okay so if I look at the title underscore container I will I should have what I'm looking for beautiful so there the name of the graphics card is somewhere in this things I'm going to prote this and I'm going to throw it into my script so I can run it later okay so going back alright so the title container so notice isn't the actual title yet I still have to extract the title out of this thing so the my title container is notice that it's inside of these this bracket bracket which means it's inside of an array okay not an or in this case that's a list if you're on Python so in this case if I go to zero I want to grab the first object and inside of that first object I want to grab I think this is no it's not inside of the I tag it's actually a text inside of the a tag so if I do dot text okay this should get me what I want yes so I do title dot of 0 dot txt and that gives me exactly what I want so I want to pro you place that in there and I want to call this the the title so the product name okay so product name is equal to title container dot text okay so that is that so I've got the brand the make of the graphics card and the name of the graphics card again and now we can go ahead and grab shipping because shipping seems like something else that they might all have okay so what we're going to do is figure out where this shipping tag is inside of all how much does it cost for shipping because I think some of them costs differently for ship yep this is $4.99 shipping okay so in this case I need to find all Li classes basically Li stands for a list with the class price doc - ship ok so I want to go ahead and do that so I copy this class and I want to do Container dot find all of oh oh I think it's an Li car Kip Li comma of class is equal to pressure ok and this will give me hopefully a shipping container hope shipping underscore container hopefully there should only be one tag in this thing that has shipping in it oh I need to close that function so my shipping underscore container actually if I can just copy this okay shipping container you can see that it gives me back an array of things that qualify so in this case only one thing came back so I can do that same thing I do earlier where I reference the first element and then I think it's also in the text again right so I can do dot txt again and this brings me back it looks like there's a lot of open space so notice this is a return and then there's a new line there's return and then there's a new line so in this case I want to clean it up a little bit because I just want the text so in this case I will say strip so stripper removes whitespace before after new line and all that good stuff so it just says free shipping now so I can go ahead and grab this okay and throw it into my script as well so now I've grabbed three things so in this case I also need the find all that I did earlier so if I go up a few times I can find it so the shipping container itself will be placed in here and then if I close actually the find all function there we go so now three things that I want so the product name the brand and the shipping container will be actually shipping okay so cool so now this is ready to be looped through but before that I want to basically print it out okay so this is I'm going to show you why is sublime is my favorite editor it does multi-line editing okay so in this case I'm going to go ahead and enter three blank lines I'm going to copy my three variables okay copy copy copy I'm going to paste them in here and just go ahead and make this nice and format it so I'm going to print all of these things out into the console just so I can see so in this case I will copy this as well so that way I can go ahead and just say quote and then paste that so I can see what it is when it actually does print out and then I can do a plus for a string concatenation it's going to print each of these three things out for me so the brand the product name and the shipping and basically before I throw this into a CSV file I want to just make sure that this loop works so I want to save this this web scraping tool I'm going to call this web my first web scrape dot pi okay so if I open this there should be a file here if I right click and open up another console so notice I have a constant before but this one's running Python I want to open up this one now and I want to tell it so notice that I'm inside of this file path now so this file path is a file path that contains this script already in it so what I need to do is just do Python so I want to tell it to run Python and I want to tell it ok now that I'm in Python execute this script so my fur web scrape I hit enter and then hopefully look at that it went through it did that loop and it grabbed every other graphics card for me so all I have to do now is throw this into a CSV file and I can then open it in Excel so let's go ahead and do that real quick just finish up our code and I don't really need the prototype for this because I know that this script works now okay so to open up a file you would deal with just the simple open and then in this case I need a file name so the file name is equal to I guess products dot CSV okay so I want to open up a file name and I need it instantiate a mode so in this case w-4 right so I'm going to open up a new file and write it in it so this will be called oh FK so the normal convention for a file writer is F and I want to write the headers to this thing so in this case F dot write is equal to now I need to call some headers so a CSV file usually has headers so in this case headers will equal to I think I'll make it brand name let's call it product name because I feel owed us into a sequel database later name is a keyword in sequel and it's a product name and then I'll call this shipping okay and then I also need to add a new line because CSV zord eliminated by new line so I'm gonna tell it to write the first line to be a header ok and then the next thing is I want to tell it to every time to loop through I wanted to write a file ok so instead of printing it to the console which I'll let it do actually I'm going to do F dot right so f dot write is going to write so these three things so product product name shipping so I paste that in there that's going to paste all three of them for me but what I need to do is actually concatenate them together okay and I need to concatenate them with actually a comma in the middle so comma and let me just double check something real quick see if my strings are clean and no it is not so notice that the product names have commas inside of them so what that's going to do is it's going create extra columns right inside of my CSV file so before I print the product names I actually need to do a string replace so I I need to call a replace function as every time you see a comma let's replace it with something else and I like to do a pipe but you can delimit in it by anything you want right missus program you can you can do whatever you want as long as it doesn't air all right let's get I would go ahead and do that and also don't forget this it needs to be delimited an aided by a newline okay so it's kind of every time it's going to loop through it's going to grab and parse all of the data the data points and then it's going to write it to a file as a line in the file and then what I need to do is once it's done looping I will have to close to focus if you don't close the file you can't open the file okay because only one thing had opened a file at a time all right so I will run the script again so I know if I just push up it runs the script so you have to save the script first I'm going to do a ctrl s to quickly save it and I do ctrl syntax error oh I forgot to add a concatenation with the plus end okay so I need to do a plus n to top to concatenate that so I go python my first web script oh it went through so after running that script it's gone ahead it's scraped everything it printed everything to the console but more importantly it wrote everything to this far right I told it to write everything to the CSV file so if I open it up right now you can see that it has gone ahead and scraped the entire page and throw in every every data point as a row every product as a row into the CSV file so you can go ahead and scrape the other details as well like whether or not it it had as a sales price or not what the image tag might be and then if there's multiple pages right so if you go to Amazon for example there's multiple pages of probably products so you can start looping through so usually up here there's a page equals something so you can just do a loop and just say in this case do page two instead of page one and that concludes today's lesson on how the web scrape with Python and I hope you guys learned a lot and had fun doing it now I want you I want to really know from you guys did you guys enjoy this kind of video gs1 more coding videos more data science videos and if there's an if there's a better way to code something also let me know I'm always happy to hear from you guys what do you guys enjoy I want to make this content for you guys alright now I'll see you guys later and happy coding
Info
Channel: Data Science Dojo
Views: 1,298,174
Rating: 4.9174485 out of 5
Keywords: beautifulsoup web scraping, Beautiful Soup, beautifulsoup, python, web scraping, Data Collection, Code, python scraping tutorial, python (programming language), website scraping, csv sublime text, scrape and download as csv python, web scraping python, python web scraping, web scraping with python, python scraping, python web scraping tutorial, web scraping python beautifulsoup, web scraping using python, beautifulsoup python, web crawler python, beautiful soup, beautifulsoup4
Id: XQgXKtPSzUI
Channel Id: undefined
Length: 33min 30sec (2010 seconds)
Published: Fri Jan 06 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.