Day 25: Web Scraping on Javascript Driven HTML using Python - Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so we left it off where we were able to scrape a page that was loaded synchronously to any request so if I click on something the page loads it reloads the browser reloads all that stuff that's a synchronous request an asynchronous request on the other hand is something that loads typically through JavaScript using what's called Ajax so that's a it's an asynchronous request to the server which means that it loads dynamically it's similar to like if you're on YouTube and you click the subscribe a button you know like on join CF e-comm /youtube you click on that subscribe button it has an asynchronous request assuming you're logged in that's also true for a lot of things that we have on our site as well but what you want to see is using these asynchronous requests and how do you actually scrape data from it and a really good example of this is CRISPR cards website this is a very talented photographer whom I admire his work a lot I would say that what we're doing here is really for educational purposes I do not advocate just stealing an artist's work in this case it's just because he has such a perfect site to learn and understand about this at least at this time and of course if you're watching this if for some reason his site change or this functionality changed the idea here will still hold so what we do you can definitely still try on other sites so what we're gonna be doing is instead of scraping data we're gonna actually scrape images we're gonna get direct images like the slideshow that you see working here I must say that this is not gonna work for all slideshows it's not gonna work for every sort of animation that's working here that's not really the point the point is when i refresh this page it loads and then it's loading it's loading and now it's still loading more stuff notice that it wasn't completely loaded all at once that was all JavaScript elated and that's how it's working and if I inspect the element here or the page we can see that a lot of things are changing on a regular like as the time is going by things are changing so I refresh again see how it is and it's loading all that stuff in that's kind of the point that's what we're gonna be working on here has nothing to do with scraping from a slideshow well that's not the goal here the goal is to scrape from something that's using javascript and using ajax versus using just a regular request so what would this look like if we did the intuitive thing which was using Python requests and beautifulsoup specifically for this URL well let's go ahead and jump in to our sublime text we're going to create a new document in here so command in is the shortcut for that and of course I'm gonna save this into our project where we have on our desktop inside of scrape we have it in here and I'm gonna make a new folder in here call it 25 and we'll call it Jas underscore scrape dot pi okay so in here this is what we're gonna be doing it we should already have a lot of these things imported I'm gonna go ahead and show the sidebar so it just comes in a little bit all right so we're gonna go ahead and import requests this is nothing new this is from the previous days of everything that we've done so far I mean web scraping and then we go ahead and import beautiful soup and then we are gonna set our URL and I'm gonna go I'm making our URL to Chris Burkhart comm coming back in here I just copy that whole thing and then we're gonna do a web request do it I'm just calling the variable web underscore R of course it's an arbitrary variable that we name there and I'll do requests I get URL and then we're gonna do web soup equals to beautiful soup and this is going to be web r dot text and we're gonna be using the HTML parser oops lean-to comma there so HTML parser and then we are just going to go ahead and print out web soup dot find all and I'm gonna do IMG so IMG of course is the image tag so in HTML it is IMG source and then equals to some web source right so that's the actual HTML tag we're calling here that's what we want all right so now that I've got this let's go ahead and jump into my terminal let's make sure that I have everything installed by doing pip freeze and I see that I have beautiful soups and requests installed so I can actually run this but I'm gonna go ahead and do it inside of the 25 folder and we're gonna clear this out and I'm just gonna run Python of course we're using Python 3.5 here and I'm gonna go ahead and import all of this stuff what we see as a result we have some images coming here there's only a few images and if we did a bunch of background research on this that is if we went through the source itself it's probably not gonna have only a couple images that is especially when we inspect it maybe seeing the view source it won't only have one or two but like as we see here there's not that many so let's actually just look at the length of this if I just did Ln of the length of this whole thing I could find out how many images there are and there's only four that doesn't seem right in fact it's not as we've been talking there's been more than four that have come up like his logo is an image all of these down here are probably images too so we actually need to figure out how we're gonna actually grab all this stuff but the main thing isn't grabbing all of the images that's not again that's something goal the goal is just to kind of use our intuition to see that hey this is JavaScript we need to do something different and the thing that we're gonna be doing different instead of using Python requests solely we're gonna use something called selenium selenium is a Python library or package that allows us to run a web driver so this means that we can run a like something like Firefox or Chrome to actually run a web page and do all the things that we expect a web page to do where Python requests is not like that there might be something that you can actually do for Python requests like what we're about to but in my research selenium was the best choice for what we're about to do so basically what we're doing is we're gonna be opening up Firefox through selenium so selenium is gonna open up firefox and firefox of course is a browser that you're going to want it down if you don't have it downloaded already so definitely download Firefox so this can work and Windows users I think it might be just slightly different for you but I also think that pips should work if it doesn't definitely check out the documentation here this is fairly straightforward on what you need to do if there are people that are having trouble still please let us know in the comments below and all of you Windows users please comment together so then we know that this has been a little bit of a challenge for Windows users but what we're about to do shouldn't be a challenge it should work just fine so we're gonna go ahead and install selenium with just pip install selenium and oops I want to exit out of Python first so I'm gonna clear this out we're gonna install selenium and now what I want to do is actually run the HTML in a different way so right here is the HTML text right so that's actually getting the HTML that we worked with more specifically web are dot text is the actual HTML all right so we can print that out if we wanted but what I'm gonna do instead is I'm gonna you now use the web driver from selenium to actually open up Firefox and then we're gonna open up a URL which is this same URL and then we're gonna see what that document is by executing a JavaScript script or a script a script statement as to what we want to see so let's go ahead and do driver equals to web driver dot Firefox fire not X Fox but just Firefox web driver dot Firefox well oops I need to import selenium first so from selenium import web driver and we're gonna actually bring this up to the top here so we've got our Firefox web driver here and let's actually see what this does I'm gonna go ahead and just import it and of course I need to jump into Python and let's go ahead and import it and let's run that driver let's see what happens so we've got nothing called Firefox so let's try that one more time with the F the second half the lower case because I think that's actually up Firefox spells it and we tried again this time I see that it loaded and notice that Firefox actually opened there's no web address there's nothing there but Firefox is open so let's actually add a web address so I'll do driver that get notice it's get just like a get request would be and I'm gonna do HTTP colon slash slash WWE and I'll just do coding for entrepreneurs com slash and we'll go ahead and press ENTER notice the browser changed right so selenium is doing this browser for us it's actually going to our website and it's showing us what's going on in that browser now that's not what we want we want to not use coding French burners instead we are gonna use Chris Burkard like I said coding French verbs is also using JavaScript at this time but we're gonna go ahead and do something that's probably a little bit more pretty than what we've got so let's go ahead and open up crisper cards website notice how it's loaded it's loading in the background it is running from Python so this is another way on how you can actually run a browser from Python which is actually pretty cool so that means you can do all sorts of stuff here as well and again notice how it like I'm just writing in code but as you can see you can run functions you can do all sorts of things to actually make your browser be a little bit different so if you're a business and you have you want to have your website running a certain way you could build a Python function that actually doesn't this is how you would do it anyway so let's get back to the task at hand I'm gonna go ahead and open up the HTML here so I want to say hTML is equal to driver dot execute script and I'm actually gonna write the actual script on our page here so we don't actually have to type it out I also want to make it so you guys can definitely go ahead and copy this but it's really simple it's return document.documentelement outer HTML so what that's doing is it's a JavaScript call to get the document HTML so the HTML document that's related to the JavaScript so that's actually what it is so let's go ahead and and I'm just gonna press ENTER there and we're just gonna go ahead and copy this now and if I press enter that should be fine notice we're still using driver here driver is the same thing right here because we set that variable if you changed the driver variable that's fine you can absolutely do that there's nothing wrong with that but now that we've got this HTML I can actually type out HTML and I see that I have the open HTML tags and I can scroll the way the top and see the other HTML so now that I have HTML what can I do well I can use beautifulsoup again so I'm gonna go ahead and now say pie soup as in Python is now driving this instead of the Python request it's now actually selenium so let's call it cell soup and I'll go ahead and use this HTML now paste that in here and I'll use cell soup in here we want to make sure that we import these other things as well so imported beautiful soup back and let's get cell soup in here and just like what we did before with the images we did we're doing the exact same thing I'm just gonna go ahead and copy this one find all images press ENTER whoops it's not web soup anymore let's go ahead and bring it down as cell soup and we try that one more time and there we go we've got a bunch of images in here and if I do the length of this and do it again I see now that the number has doubled great so that means that I actually have some images that I can work with so since I have images I can run a loop and grab those images really simple so I'm going to do images equals to an empty list and we're going to go ahead and iterate through that list so I'll say for I in soup or cell soup dot find all IMG so same call is up here basically we're gonna loop through all those I'm gonna go ahead and print out I first because we want to see what comes back from that I already know what comes back from it so I'm gonna go ahead and say SRC equals to I and we want to get the dictionary values so the actual key value pair I'm using the key of source and that will give us the value that that actually stands for and then we'll just do images dot append SRC after that's done we're going to go ahead and images this is not going to be a whole lot different than what we see here although it will be slightly different so let's go ahead and do this just go ahead and copy these paste it in the terminal and we print out images and this is what we see we now see URLs that are actually coming through here right so we're actually seeing the source that's coming through we're going to go ahead and stop right here and pick it up in the next one
Info
Channel: Python Codex
Views: 38,144
Rating: 4.8664045 out of 5
Keywords: python, python programming, python language, python code, python for, python loop, python mysql, python script, python input, python examples, python with, python super, python course, python time, learn python programming, time module in python, python tutorial for beginners, python full course, python training, python tutorial, python for beginners, Web Scraping on Javascript Driven HTML using Python, Javascript Driven HTML using Python
Id: vcnomT0CP0Y
Channel Id: undefined
Length: 13min 48sec (828 seconds)
Published: Tue Nov 14 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.