Run Your Web Scraper Automatically Once a DAY

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to show you how you can run your python web scrapers in the cloud so we're going to be using cron job on a linux server and we're going to set it to run our code at a specific time every day so the code that i'm going to run is this one here it's a basic web scraper that goes out this website and as you can see it's got new products every day and we just scrape that information and email it back to us this could be any kind of scraper script that you've written so there's a couple of things that we need to know first that is we're going to be using a linux server and it's going to be running in the cloud so there are going to be a few linux commands we're going to need to go through but i'll run through them all with you and you need a way to get your code onto that server the best way to do that is git and github if you a get any if you have github account you can use that if you don't you can copy and paste it across but it's nowhere near as easy or as as convenient so the first thing that we want to do is we need to create a digitalocean account now digitalocean lets us set up a droplet that we can it which is basically like our linux survey in the cloud that we can use now these aren't free however if you click in the description you can use my link and you will get a hundred dollars credit to use and it runs out after two months so that gives you two months worth of testing and messing around and you don't have to pay anything so that's quite cool there are other services like this too or you can use a raspberry pi if you've got one handy that works just as well the newer ones are better obviously so to get started once you log in you can see i've actually already got a droplet running here but we're going to create a new one so we click create and create a droplet and once that loads we can select ubuntu uh it's going to be on a basic floor a basic plan see even when it's paid it's only five dollars a month so if you have lots of scrapers that you run constantly this could be a good option for you i'm going to select london because i'm in the uk and we're going to authenticate with a password so choose a root password which you'll need to remember to log in with and make sure you abide by its rules we can give it a name i'm just going to call this one you'd you yt demo but call yours whatever you like and that will do so i'm going to hit create so whilst that's loading up i can just show you here that i've got my repo on my github this is the code that we just saw and this is what i'm going to use to get clone onto that droplet so that we are going to need to use a few linux commands and we are also going to need to use ssh which is to create a secure shell to connect to our droplet i like to use the windows terminal um you can use either wsl which is the linux one or you can use powershell if you don't want to install wsl on your system but this is much easier to use the windows terminal from the app store in windows 10 as opposed to the default one it's just looks nice and it's easier to use so hopefully this is done almost booting up there we go when it's finished it gives us an ip and we can click copy that copied and come back to our terminal and i'm going to be using the powershell because i suspect most of you are as well so what we want to do is we want to type ssh for secure shell root because we're going to log into the root account of our droplet at and then paste the ip just like that we just want to click hit yes to get through this and that's fine it will add it to our known hosts which is okay with us type in your password and if you get it right there we are we're in so there's two things that we always need to do first with our new droplets and we need to update the system so we want to do apt update apt is like the package manager on ubuntu if you don't know that should run nice and quickly what we want to do is we just want to make sure we start off with a up-to-date system before we do anything else so it's come back and it said 56 packages can be upgraded so then we just run app apt upgrade this will take a little bit longer so there's 174 megabytes of upgrades to do i'm just going to let this run and i will come right back to you so that's finished updating which is great i'm just going to hit ctrl l to clear the screen or you can just type clear and hit enter so the next thing we want to do is we want to check the python version on our uh on our droplet so if you hit python it will probably tell you you can't be found and that's because we are on next so we need to do python 3 and we can see that we are running python 3.8.5 which is fine which is great the next thing we need to do is we need to get pip installed because we are going to need to install the python packages that we need for our scrapers so to do that we can do apt install and i think it is python 3-pip okay yes it is and there will hit yes there um i'll have all of these commands written out and i'll put them somewhere below either they'll be on my github link or in text just so you don't have to memorize them as i type them out super fast okay that's done so now we can think about getting our code from our github onto our server here now the easiest way to do that is to git clone which we're going to do in just a minute i'm going to talk just a second about requirements though so if you were smart which like unlike me when you created your actual script you ran a virtual environment so if you had a virtual environment where in uh that you'd installed your code on everything you would have pip installed your packages into that virtual environment you can then run a pip freeze and output that to requirements.txt that you can then upload to your github so when you pull everything down on to your server with your git clone you can just run the requirements.txt and then it'll install everything that you need i didn't do that because um yep reasons so we're gonna go ahead and actually pip install manually what we need so i'm gonna do pip three remember on python three pip three on nx in this case so i'm gonna do install and i can't remember what we need it's requesting beautiful soup and pandas so we can just go ahead and install those so requests pandas and beautiful soup for so i'm just gonna let that install that's done again clear the screen the next thing i want to do is clone the repo so we can see it's here so i'm gonna go back to the main page i'm gonna copy the url and then what we want to do is we want to make sure that we're in the home directory on our server so if you were to type ls here you would find not a lot what we want to do is we want to type cd and then two dots to go back up now we're in our main root directory and then if we do ls we can see all of the main linux folders here it doesn't doesn't matter if you don't understand all of them we need to know is that we need to be in our home folder so we want a cd into home and then ls and we can see that there's nothing there great so that's good now we want to get clone our repo so we do get clone and paste the url and that's going to clone all of that into our home directory and now if we run ls we can see that there is there a lovely titled project so if i go into that and we can see here we've got two files the readme and the pi file so if i was to run python3 and then run my pi file we find that i've got an error now i did this because when you upload your code to github you do not want to put any passwords in there or anything like that so when you have code that you need to have passwords for like this in this case it's my email account i find that you if you put that in a separate file you will be able to keep that separate and you won't need to upload it to github and you can just create it manually so if i come back to my code here we can see that i have right here import creds now creds is another pi file which i've got here and all it has is password and then your pass here now i can choose not to upload this cred file to github and then i can have my my repos public and no one will see my password so i need to re recreate this pi file on my droplet so that actually works so i'm going to use nano to do that nano is a text editor which is just always installed on linux servers you can use vim as well if you want to so i'm going to just do nano and then i'm going to say creds dot p y and in here i'm just going to replicate what i had before except with my real password this time which unfortunately i'm not going to show with you so now i've saved that we can see ls and we have a creds.pi file so i'm going to clear that and we are going to run it again and hopefully we should get no errors i didn't actually have any output from this python file but my phone is just about to buzz and there it goes i just heard it i've got an email and that's from this file so now that i know that this script works on this system if it doesn't you need to do some uh debugging to try and find out why maybe you don't have the right packages installed or maybe there's some other errors or issues that you'll need to resolve before you carry on but now that that's done we can move on and we can actually have a look at using cron job so to get to the cron job list it's called chrome tab so we just need to type that in and then dash e and it's going to create one for us now these are all our user cron jobs that the system's going to run for us this is the first time so it's going to say you need to select an editor it says nano is the easiest one and that's very true hit one and we get this up here now this does explain how to create a cron job and get it working but what i'm going to do is i'm actually going to come to another website which is crontab.guru and this is really handy to see how what you put into this information comes out so each of these stars as you can see represented by a minute hour or day a month and then the week so depending on what information we put here will depend on how often our script is run so if you put star star five stars like this and then the link the path to your code it will run it every minute forever we don't want um that what i'm going to do at the moment is i'm going to just say every minute now because we're going to test it but if we wanted to like i'm going to finish with run this code at 8am every day you would put an 8 in here and you can see that it says at every minute past a hour eight it's not quite what we want so we need to change this to a zero and that's going to say at eight o'clock in the morning if we click on random a couple of times you can see that you can do all sorts of different things and you can kind of just click on this and figure out what each of these ones means and why you or um how you can do it so you can even use uh name tags there but let's let's just go back to zero eight star star star and that's what the one we're going to use at the end so now we want to test that we can get our cron jobs working properly so what i'm going to do is i'm going to actually exit out of this and i'm going to create a new file in the same directory here and i'm just going to say we'll call it test.py and in here i'm just going to put a simple print statement that says if you can see this it's working there we go and i'll just exit and save now if i run python3 and test.pi you can see that we got our output so what we can do is we can use this to um on our control to check that we are in the right place and everything's working by outputting this to another file and then we can check that file to wait for it's working you'll see what i mean in just a second so we're going to run cron tab dash e e and come down again and we're going to come back to our five little stars now there's two ways to do this you could put a line in the top of your code that's your bank line that basically tells the server that it runs with python 3 and then you can make that executable but the way i prefer to do it which is simpler is just simply put the path to the python first and then the path to the skip the script that you want to run now i know that on a linux machine the python path is slash usr usr slash bin python 3 and now we need to put the complete path to our python script which was home because we came into the home folder earlier remember i think it was whiskey current job test dot p y now even though that file has output we won't actually see anything so we want to so we want to send that to a file that we can then look at so if you do two uh sideways arrow things like that that will send it to a new file so we then we can just say we'll just send it to the home folder and we'll just say uh cron dot so if i just make that above there you can see all we've done is this is the path to our python this is the path to our script and two little arrows which means we're going to send the output to a new file which is home slash con.log so i'm going to save this write it and i'm going to then go to the home folder and i'm going to wait for a minute and hopefully we get a log file that comes up with the output in it so i waited for a minute and now if we type ls we can see we have cron.log and if i use the cat command which basically spits everything in the file out to the screen we can see that we have our python code now that i know that this is successfully working i can then change the cron tab to my python script if this doesn't work for you there's a couple of things that i would recommend checking first one is that your python script is in the home folder somewhere because the cron tab is user specific so you want to be in your user's home folder and also check that you have input the stars the five stars in correctly and the python path and your file path but i did so i know that works come back to cron tab dash e and we can change this from test.pi to i think it was called new whiskey dot py and we don't need to output the file output it to anywhere because we know that it works and we hit exit save and there we go now at eight o'clock every morning that file is going to run it's going to scrape that data and it's going to email it to my phone so that's it guys that's how you do it you need to use some basic linux commands you need to learn ssh a little bit as i said i'll have everything written up in the description for you so you can see all the commands that i use to do this and you can work through it yourselves don't forget you can run this on any linux machine so if you have a spare computer that you want to use as a server in your house you can just install ubuntu on it and this will work exactly the same way or if you have a raspberry pi you could do that too but just make sure you have one of the newer ones because the first ones are really really slow um or if you want to do it all on in the cloud as i said digitalocean has 100 credit going on at the moment if you use my link that would be really nice of you um and get that started know if you carry on and you like it it's only five dollars a month there are other ones as well there's lenode or if you wanted to do a bit more manage you could do python anywhere but that's entirely up to you how you make it work um so this is just the way that i do it so hopefully you enjoyed this video and found it useful give me a thumbs up and comment let me know what you thought and consider subscribing there's lots of content on my channel already for web scraping and more web scraping python etc all that good stuff still to come thank you very much guys and i will see you in the next one bye
Info
Channel: John Watson Rooney
Views: 4,547
Rating: 4.9658117 out of 5
Keywords: cronjobs, run web scraper online, cronjob python, cron job linux, cron job basics, crontab example, crontab examples in linux, crontab how to, crontab in linux, crontab not running, crontab on ubuntu, cron jobs python, cron job tutorial for beginners, cron job to run python script, cron job tutorial for beginners python, crontab python script, python cron job, cronjob web scraper, python web scraping, python tutorial, code tutorial, scrape daily
Id: VztRqRXeyn0
Channel Id: undefined
Length: 16min 12sec (972 seconds)
Published: Wed Nov 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.