Scrape Javascript with SPLASH - how to install and get started with Splash

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to show you how you can install splash on your windows pc so you can easily and quickly scrape dynamic websites and javascript websites using python so splash is a lightweight browser from scraping hub that has an http api built in that we can send off requests to that it renders the page for us and returns the information back what this means is that we can quickly and easily put it into our python code and with a few minimal changes we can use requesting beautiful soup to scrape almost every website the main thing i think that puts some people off especially beginners in learning how to use splash is the fact that you need to install and run docker to do so so docker could be quite daunting to some people it is used really widely and is really popular it's not difficult to learn and for what we need to do is really simple so i'm going to show you how you can use the docker desktop app for windows to create a container which we install splash into with just a few commands we only need to use the terminal once and that's pasted in and it's nice and easy so let's get into it i'll show you how to install everything and how to start with a basic scraper using splash so as i said to install splash we need to install docker i'm on the splash documents page right here and it gives us some installation instructions for installing on linux or os x if you're not on either of those operating systems and you're on windows we can actually just install the docker desktop which i find is the easiest way to do it and to manage it i think this works on mac as well so go ahead and come to this url i'll link them all down below and you can click on download for windows now i've downloaded this already so i'm just going to run this and we are going to install the docker desktop app um i'm going to leave the install required components for wsl2 there because i do use that and i'm going to let that run this can take a little while so i'm going to let this go and i'll come back to you now it has successfully installed go ahead and click click close and we just need to to run it off our system so i'm just going to hit run the first time you run it it takes a little while it's going to do some setup that's actually happening behind my camera so once that's done we can actually open up docker and start using the container and installing splash so now we've got docker running we've got a couple of options here we can either go through the tutorial or we can skip it i'm going to skip it because i know that we i don't need to do it and that gives us here so it says we've got no containers running what we need to do is we need to come back to the installation instructions for splash and we need to copy this install here and this is actually a linux command so i'm only going to copy the second part and then i'm going to run it in powershell you can run it in terminal 2 in windows if you want to i think most people have powershell so i'm just going to use that instead so i'm just going to hit that in i'm just going to let that run it's going to do some downloading and installing again this might take a couple of minutes so once this is done i'll come back that has installed successfully and then i'm going to go ahead and just close that out and come back to docker and we can see under images we now have scraping hub splash as an available image on the disk if we hover over it we get this run command we want to click on that and we do have optional settings that we want to use so under container name i'm just going to write the word splash and port you can choose whichever one you want the default it suggests is 8050 so i'm going to use that one then we can click run and it will confirm to us that it is running so what we can do now is we can actually find this because this is running on our localhost we can access it via our browser we can click on this button here i'm just going to do it in chrome here i'm going to go and type in localhost colon 80 50 which was the port we used and that loads up this page now this means everything's working fine and splash is available this gives us a lot of options here we can read the docs we can see some examples and we can actually do some some basic rendering on here to to see how things work i like to do the screenshot of multiple pages example website urls then if i hit the render me button what it's going to do is it's going to render those pages and it's going to take a screenshot of each of them at this moment in time and give it a make it available to us to see which is quite cool way of showing how it works so basically what we can do is we can now use this instance of splash that we've created and are running on our docker container to access it and let it render the pages for us we can stop our service as well by coming back to the container and clicking stop and that will shut it down so if i try to run this now it's not going to find anything or and if we click to start it back up again it will just start it back up and it's come back away straight away because we entered in the the details in here so the main thing that we want to use this for is to render javascript websites for us and we want to be able to use it within our python code uh we can do this really simply we're just still going to use requests and beautiful suit but what we're just going to need to change a couple of little things just so we put this in the middle so basically our request is going to go from our python code to the splash http api that's going to do its thing and then it's going to return the information to us so i'll do a demo of that for you now so in this file we're going to import requests requests and i'm going to set a url i'm just going to use amazon for this i'm going to go ahead and pick an amazon product and he will do as a demo just so we can see it working so actually we can use this um we can use this search term here so if we put that in our url and if we were to do it normally we would do r is equal to requests.get and we would give the url so if i print r dot status code for this you might be convincing to think this is working but we get a 503 and that basically means forbidden so if i do r dot text we'll see in this text somewhere here it says are you a robot sorry just to make sure you'll notice a robot there but to access this page via splash we need to send our request to splash and then give it the url that we want it to render for us so to do that instead of putting our url here first we want to put in the local host url for our splash service which is http and then local host and codon 8050 for the port and the service we want to use is render so we just type forward slash render.html then after that we need some parameters so we're going to do params is equal to and we're going to put in a dictionary and our first key is the url that we want to scrape and because i've got that saved up here as url i can put url in again and then the last next one we're going to use is called weight and i'm just going to put 2 in here and that is just how long it waits before it times out for finding the page now if we go and print r.text we should hopefully get a lot more information back that looks like it does indeed resemble the actual information of the page and it does appear to do so somewhere around here there's actual some actual uh html code that looks like the page itself so i'm going to do is i'm just going to quickly import beautiful soup from bs4 import beautiful soup and i'm going to create a soup object and we'll do soup is equal to beautiful soup could never spell this word text and then the html dot password and now we're going to print my microphones in the way that's why i can't type yeah soup.title.txt and now if we run that we should get the actual title of the page back and we can see that matches our browser up here so what we've successfully done is started we've installed docker and we've installed splash we've started the splash service in the docker container and we have accessed it through our python code to render the pages for us this will do javascript um it you can do image blocking you can do ad blocking the whole lot if you do the configuration right we can see how easy it is now for us whenever we want to do dynamic pages and javascript javascript scraping we can just start up our docker container start our splash service and send it off to there send it off to splash to render the page for us and because it is it's basically built by scraping hub to render pages to scrape it's super quick and it's exactly how we would want it to be obviously the downside is that we are still sending the requests from our ip so we can still get blocked but we can rotate through proxies quite easily we can do proxies with splash as well i'll show you that next time so if we just quickly come back to our docker container we've got some information here we can see stats of how much it's using and we can see the logs and in the logs it will show you all of the information so you could actually see what you've been doing so you see what you've found which is quite cool so that's it for this video guys i hope you've enjoyed it i definitely recommend if you haven't already to give this a go so let me know in the comments how you get on what your thoughts are how you're finding it and also let me know what projects you're working on like the video if you like it and don't forget to subscribe for more content there's lots of web scraping content on my channel already and there's more to come so thank you for sticking around for this one and i will see you in the next one
Info
Channel: John Watson Rooney
Views: 9,067
Rating: 4.9283581 out of 5
Keywords: splash python tutorial, web scrape javascript, python web scraping, scrape amazon, python tutorial, web scraping, scraping api, splash, scraping with splash, what is splash, render js, render javascript, learn python, python coding, programming tutorial, web scraping lesson, how to scrape dynamic sites, learn splash
Id: 8q2K41QC2nQ
Channel Id: undefined
Length: 10min 27sec (627 seconds)
Published: Sat Oct 31 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.