Scrapy Splash for Beginners - Example, Settings and Shell Use

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to show you how you can use scrapey splash so you can scrape dynamic websites and javascript content using scrapey my last video was on a basic scrapey project if you haven't seen that and you're interested in that go ahead and watch that i also have another video that shows you how to set up splash i'm not going to cover that on this in this video here this is just the integration between the two so i have docker desktop running and i have my splash running if we come over to my browser here you can see that this confirms that splash is working and i have a scrapey project started if i do this tree command you can see that we've got the default files here so you also need to do pip install scrapy dash splash this is what we're going to be using to make the two work together there's a few things that we need to change in the settings i'm going to go ahead and i'll do those first for those of you guys that just want to see that bit and then i'll show you how to use the scrapy shell with splash and then we'll do a quick demo project too so if i come over to the documents i have open here we can see that we've just done this the pip install scrapy splash we have docker running this is a different way to do it it's up to you how you want to get on with it and then we see we have these configuration here so it says we need to add all of these to our scrapey settings so to put all these settings into our settings file here is under the splash demo so i need to cd into another directory and now we can see that i have the settings.pi so what i'm going to do is i'm just going to use vim to add these in settings.pi if you have a default or blank settings file like i do here a lot of the stuff is already commented out um if you are using different middlewares and spider middlewares then you'll need to put these in the right place but because i'm only using splash for this project i'm going to put them all at the top of the document just underneath the bot name and this here so what i'm going to do is i'm going to say let's give ourselves a little bit of room and i'm going to say splash setup for lack of a better name and if i just scroll down so we get it more in the part of the screen where we can all see there we go so i'm going to copy all of these that it says that we need to put in so the first one is the splash url so i'm going to copy that put that in here now i know that this is on localhost port 8050 so i'm going to just write that in there this is probably going to be the same for you but you'll just need to check your splash url and that is the one up here that i showed you just a minute ago so the next thing we want to do is copy the middlewares so let's copy those put those in i know the formatting is all off it's because i'm pasting directly into vim and we just have to get over it for the moment and we'll put in the spider middlewares and then the dupe filter class let's put that in and then the http cache storage let's put that in as well so i'm just going to save and close that so that's all we need to do for the moment what i'm going to do now is i'm before i get on with a slightly larger project what i'm going to do is i'm going to show you how we can use this with the shell so i'm going to do scrapey shell i'm going to run the shell and this is how we would normally check to interrogate and pass the response to the web page so i've got a website here and this one is a website that will not let you scrape because it always all the products are loaded dynamically if you check the source you would see that that doesn't um there is none of the information there it's all dynamically loaded i'm going to copy the url and down in my to clear that up i'm inside my scraper shell i'm going to do fetch and i'm going to give it the url and we're going to hit enter and it's going to say we've got a 200 response which is good so if i type response we should see that there so if i do response dot css and we put in the title text dot get we got the text of the title back so all good so far and we're not actually using splash yet but if i go ahead and inspect element and try and get out let's just say the name here that's an h4 tag so let's see if we can find any h4 tags so let's do response dot css ss h4 so we got lots back and let's just do the first one so we got search results okay so if we look through this we can see that we have let me do got dot for the text that we are easier so actually i just clear that so if i do um h4 and then put two colons in the text that's what we're going to get and i'll do dot get all and we can see that all of these are parts of the website but none of them are the products names despite looking over here that's too small for you guys to see my bad looking over here the actual products are in h4 tags so i'm going to run the shell using splash now i have my splash running which i just showed you so what i can do is i can redirect my request through splash to get the response back into the shell so let's do fetch and let's say we need to give it our splash url which is our local host 8050 this is just this is the one we just put into our settings file and then we do uh render dot html which is the point that we want to hit and then we do a colon and then we can say url is equal to and then our url here so if i hit enter you can see the first one we get is a 404 because it's trying to crawl the robots.txt on our localhost which isn't there but now i have that response back so we again we can do.response.css let's do title.txt like this and dot gets we got the title back that's good now if i can i can't go back further up so we'll do we'll type it out again not a problem response.css and then let's look for the h4 tags let's get the text of all of them and get all for every single one hit enter we can see that we now have down here underneath this is what we had before we can see that we actually have the product names as well we can see that they are all there we actually picked up some other stuff at the bottom but that basically is just showing us that splash is rendering the page and it's working for us and the difference between using the two means getting the dynamic content back or not so from here i'm going to just quickly try and get some kind of information for each one of these products out so we can put it into a bit of a pro a more of a actual scraper this i'm going to copy this class and i'm going to say let's clear that so we can see response.css a um dot now with it has lots of spaces in the middle uh in between all the words you can actually just put a dot and that should work for us so let's put one there one there on there one there and let's see if that works okay so we got loads of data back good let's get the first one dot get and that should have in it hopefully somewhere there's the sku there's the link and somewhere in here there should be the price as well oh it's all over the place okay so now that i know that that is where the products are i'm actually going to come back to this one i'm going to say products it's equal to that because this is the line that we're going to put into our scrapy project and now i can do products dot css and i can look for h4 which is where the name was dot text uh sorry double code on text because that's what we want and then get and that's the first one let's try and get the price and we can see it's in a span tag here with a class of price so if i do products dot css and it was a span price text oh text dot gets there's the price there okay so what i'm going to do is i'm just going to leave this open for the moment we're going to move over to um to vs code and we're going to start writing our scraper all your spiders need to go into the spiders folder so inside there i'm going to create a new file and i'm just going to call this one uh beer spider dot py and now we can go ahead and construct our spider in here so let's just collapse that down a bit make that one bigger so the first thing we want to do is we want to import scrapey and then we want to create a class of our spider so i'm just going to call this one beer spider like that and we need to make it inherit from the spider class scrapey.spider and we need to give it a name so i'm just going to call this one uh yeah why not now i'm going to write a function so we can actually get the data and i'm going to call this one start requests and we want to say self because it is part of the spider class that we've created i'm going to say my url is equal to and i need to grab the url of the page we're looking at let's put that in there and then we want to yield back out of here yield scrapey dot request and url is equal to the url and the callback is going to be equal to self.pass which we are going to write just now so we're going to say define pass and we need to give itself and the response from our request response there we go so now i'm going to say go back to my shell i'm going to grab we wanted this line this is where all of our products are so i'm just going to copy that put that in here and then i'm going to say 4 item in products we're going to loop through each one and we're going to get the information of uh which one was it if i went too far there we go there's the price so let's copy that i'm just going to paste this in for the moment and i'll format it in just a second and that was the name of the product so let's copy that and then let's put that here so when we do that we want to actually yield our responses out here as if it was a dictionary there we go and i'm going to say this was the name so we'll just say name and then this one was the price so we'll do the price there okay and that should work for us now this is how you would do sorry i've missed that we need to change this to item because i'm looping through each one of those there so this needs to be here for this one so this is a basic uh scrapey spider if i was to try and run this we're probably not going to guess anywhere because i haven't put the splash i haven't told it to run splash yet so what we're going to do is i'm going to check i'm in the right folder yes i am and we're going to do scrapy crawl beer and let's just run it and we'll see the errors because it won't find anything and then we'll run it again with splash added uh here we go so we got the finished we got the request but there was no information there didn't find anything so it didn't actually get anything out so the last thing we need to do is to change a little bit of our code so that it knows to look to splash for the response first what we need to do if we come back to the documents we were looking at earlier we can see down here in this example that we need to import scrapey splash and then change our request to splash request i'm just going to copy that and put that in there and then i'm going to copy the import as well and let's put that just underneath if i save that we can see now we're going to be running the splash instead so if i just move back to terminal and if i run the spider again we should get some different output this time because it's going to be using splash to render the page and therefore finding the products there we go we can see them all just there so that works perfectly if i do it again and i give it o for output and i'll just say beer.json file we can just see the output we can see the products in the json format so essentially that's it guys all you have to do is you need to make sure that you have uh splash set up again i've got another video for that link down below if you need to do that it's really simple and then you need to add in the middlewares and the settings and then change your request to splash request so if i just come back to vs code we should have a file in here there's our json file it has some change strange characters in that we would need to get rid of but there's the data so pretty straightforward so thank you very much for watching um that's going to do it for this one uh if you like this video give it a like drop me a comment if you're interested in this sort of thing there's loads more scrapey and loads more web scraping to come as well as loads more on my channel already so thank you very much guys and i will see you in the next one goodbye
Info
Channel: John Watson Rooney
Views: 10,121
Rating: 4.9861112 out of 5
Keywords: scrapy splash python, scrapy splash example, scrapy splash shell, scrapy splash tutorial, scrapy splash javascript, python scrapy splash tutorial, scrapy splash settings, scrapy with splash, scrapy splash windows, scrapy-splash tutorial, scrapy-splash example, scrapy splash, scrapy tutorial for beginners, scrapy and splash, scrapy dynamic web pages, scrapy example, python coding, web scraping, python web scraping, learn web scraping with python
Id: mTOXVRao3eA
Channel Id: undefined
Length: 14min 10sec (850 seconds)
Published: Wed Dec 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.