How to Web Scrape in the Cloud (Easy Way)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys how's it going so in today's video I'm gonna be walking through a subject that a lot of you guys have reached out to me on so there was a lot of interests in some of the web scraping tutorials that I did particularly on books to scrape comm which is our sandbox that we use for web scraping and one of the things people asked me was okay well if I wanted to have the script running in the background I can do one of two things I can have it run on my computer or I can have it run on a Raspberry Pi but then the second thing is how do I get this running on the cloud so I don't have to rely on any kind of computing at home so today what I'm gonna walk you through is I'm gonna walk you through how we're gonna go ahead and scrape as many as pages and as many entries as we want off of this website and we're gonna do it completely 100% in the cloud so we're gonna build the code we're gonna deploy it on to Heroku we're gonna run it in Heroku we're gonna set our own custom schedule and it's going to email us the Excel or CSV file with all the information that we want to scrape so let's get started so this is a sample input on on one of the scripts that I actually created and I mean when you think about the applicability of this this is great for machine learning projects as well there are so many api's out there and in a future video what I'm going to do is show you how you take information from an API and dump it into a database you can later use that for machine learning projects but for today I'm just gonna show you how you can go ahead and email this to yourself and then later we'll talk about in another video how we're gonna put this in a database and like I said do some kind of machine learning against it so let's quickly walk through the code I'm not gonna go through the entire code I'll go through some of the blocks I actually put this on github and when I did put it on github I was very meticulous this time to actually put in all the instructions on the different files you're gonna need as well as instructions on how to run it what to edit and then what you're gonna actually put in Heroku in or in your command line in this case to get this up and running on Heroku so I'm gonna quickly whizz through this but all the instructions are there and I will be linking this in the description below so you're gonna need a couple of different files in this case so let me just make this alright so first and foremost you're gonna need two different Python files so the way that my directory is actually set up is I've got these files called API key proc file requirements oo text and scraper dot pi and those are the ones that you're gonna really need don't worry about to read me one right now that's just something I downloaded from my own github page but essentially what this is is I'm gonna be using a couple of different modules here that I haven't used before one of them is gonna be something called send grid and send grid allows you to use an API to go ahead and send an email out so that's one of the api's that you're gonna need so in order to do that you're gonna go to send grid comm this is gonna allow you to go ahead and send some emails using an API and they're free service is pretty generous so if you just quickly if we just quickly go through this you can send about 40,000 emails for 30 days and then a hundred emails per day forever after that so for my purpose this is gonna serve just fine you're gonna go ahead you're gonna sign in or create an account if you don't have one all right so once you go in what you're gonna do is you're going to go to email API on the side you're gonna go to integration guide and then what you're gonna do is you're gonna go ahead and choose the Web API and then you're gonna pick whatever language in this case I'm going by thon and then you are going to go ahead and give it a name and it'll generate the key and that is what you're gonna copy and paste into this file here which is called API key and an API key what you're gonna do is you will put your API key right here and then here this is the email that you want to send it to in the email that you're sending it from be aware that you can't do any kind of spoofing with this because it does have some checks built in so be sure to use something like Gmail because gmail seems to work just fine so in my case what I'm gonna do is I'm gonna send it to a burner email just so you guys have the ability to see that so I will go ahead and change that and as I upload it to Heroku you'll see all of that stuff coming to life all right so what I'm doing is I'm going to I'm getting a burner email here so I'm gonna copy and paste this and I will put this in my to field over here I'm gonna go ahead and populate the rest of this and then we'll go back to the original script and I'll show you how it works all right so let's go ahead and walk through this very quickly so I've already previously done a web scraping video on this particular website using this particular script itself which I'm gonna be linking up above and below so I'll go through it very quickly but what it's doing is it's gonna go ahead and scrape around ten pages or so and it's gonna get a couple of different fields it's gonna get the title the price the Stars and the URL of the actual book itself so the actual site we're gonna be scraping is this one right here so let me just go to it here it's called books to scrape calm so we'll be getting the title the price whether it's in stock or not and how many stars it is out of five and so that's what this entire script here basically does all the way up to around this point over here so all of this is going to go ahead and scrape that for us it's gonna scrape ten pages there's about twenty items per page so that's about 200 items or so it's gonna scrape then we're gonna but we're gonna do is we're gonna go ahead and get the to get the absolute path of it and we're gonna throw it in a directory called CSV files and then once we do that we're gonna go ahead and open it we're gonna encode the file we're gonna use the API to go ahead and send the message this is gonna be the subject your file is ready and attached is your scraped file and then we will go ahead and parse the attachment and send it off with this when that web scraping is done so one thing to keep in mind is when you run it locally it's gonna run based on your computer time but when you put it on Heroku it's gonna work on UTC time so make sure you do that conversion as well and I've left all of these options in here you can uncommon to whatever suits your needs when you're gonna scrape something like Amazon or Walmart or any other basically big retail or online retailer there's this is essentially how you're going to do it so while I don't condone you guys going and scraping the heck out of everything on their website this is basically how you set up something that will automatically go on Amazon scrape whatever you want and send you a file it could be a JSON it could be an email it could be a CSV in this case whatever you want and it cuz then each every five minutes every five hours every day whatever you want but again just keeping in mind that when you're gonna be doing something like frequent requests to a different server you may want to look at your request headers that you're gonna be sending because you want to make sure that you're masking yourself your IP you know your browser and all that stuff so that's not covered in this one but this is just basically showing you how you can go ahead and do this all right so now what I want to do is I want to go ahead and upload this to Heroku and get it running now if you don't have a Heroku account be sure to go to Heroku comm and go ahead and sign up for a free account are given specific resources on the free tier so you can go ahead and check that on their website but it's free to join the other thing I want to tell you is that when you're creating something like this and you're going to be using a scheduler Heroku actually wants you to use one of their add-ons and if I were to type in scheduler here one of the things you'll see is if I type in Heroku scheduler and I want to say provision although it says it's free it's gonna ask me for our credit card so the method I'm actually gonna show you is it does the scheduling and you don't need to use a credit card at all which is a bonus though now moving forward we're gonna do everything in the command line so let me get you started on how we're gonna do that and again I have those detailed instructions on my github page where you can go ahead and check that out as well but let's just bring up that page over here and these are all the instructions you're going to need so we're gonna follow this verbatim and we're gonna see if we can get this deployed and remember this is the email address I'm using is just temporary burner email address I've put in my - and from field if you don't already have Heroku CLI setup you can just go to the insulation website when you get into your Heroku account and it'll tell you how to do it but if you want to see that you have it installed or not you type in Heroku you get all this stuff so you know that it's installed so let's follow these instructions so first of all I'm gonna go ahead and do Heroku login and what it's gonna do now it's gonna say press any key to launch a browser it's gonna go ahead and open a browser for me and in the browser it's basically gonna say okay you got a login so I'll log in here I'm already logged in so I don't need to worry about real logging in but if you weren't logged in for the first time it's gonna go ahead and say login and then once that's done it's gonna say logged in as my account name right here and now what I'm gonna do is I'm gonna go to my local drive on my actual machine so I'm gonna go so for that I'm just gonna type in CD then I'm gonna drag the path over where it's stored just like that and these are the files I have in here so the next thing we're gonna do is we're gonna type in get an it that basically says we're gonna initialize and you get repository so that's been done now I'm gonna go ahead and call this thing my app so I'm gonna copy and paste this and it's gonna say ID not found because we actually haven't created the hop I wanted to just show you that and so you're gonna go over here into your heroku you're gonna go to personal here so I'll just call this my scraping top there we go so we got something I will go ahead and create this they actually have all the instructions that you need to do over here as well but I've also got them on my github page so let's go back from here and now we will use the actual proper app name which is this so it's now initialized it which is great and now I'm gonna do get add and what I'm saying here is I want to add everything in my folder then I'm gonna commit to a version I'll type and get commit and then we'll just give this you can you can call this whatever you want I'll just say v1 for version 1 though I did realize I just forgot one thing and that is I need to go ahead and make sure that when I schedule this because I've scheduled this for 907 which was about 10 minutes ago or so I'm actually gonna have this scheduled every minute so that you guys see this coming through so I will go ahead and change this and add this one in and now basically it's gonna run this job every minute and I'll show you how we're gonna start it and stop it so not a big deal I just got to go back here and type in just add it all over again then we'll commit it again and it's gonna say one file has changed so just make this a little bit bigger and then finally we will go ahead and push this and so now it's gonna compile everything it's gonna know that because I have my requirements text in there it should say it here somewhere I don't know where it says it but it recognizes that because it's a requirement saw text this is a Python app and that's what it says Python app detected and then it's gonna install all the Pens seeds that I've actually got outlined in my requirements text the requirement text is also gonna be sent or loaded on github so make sure you use that I've often have people emailing me saying hey listen I've used the code but it's not working for whatever reason I point them to the requirements text and a few minutes later I generally get some kind of a thank-you response so very important that you follow the requirements on text because it is making sure that is using the right modules that is going to make this specific things work now you can also run this in a virtual environment on Roku if you want I've chosen not to do so for demonstration purposes maybe we'll do that another time in which case you can run multiple different applications that have different dependencies but for the purpose of this I've only got the one application running I'm using the free tier so I don't have that many resources anyway so it's ok for now so now it's basically saying this is done now this next step is not explicitly stated or listed on the actual Heroku instructions this is why I put it on github and that is now you have to go ahead and set a dyno for it and what that means is I need to go and tell her Oh Kouhei allocate a resource to make this thing work so right now I'm not gonna get any emails to my temp email address here but as soon as I do this we're gonna wait about a minute because it's gonna go ahead and scale up it's gonna assign one worker to it and I'm gonna walk you through what worker is in a second so when I go back to my files I'm gonna go ahead and use something called a proc file and what the proc file is going to do is it's gonna go ahead and tell Heroku that I want to assign one worker and I want to open Python scraper dot pi which is what this file is called and so the other thing to do is now if I want to see if this is actually working I can go ahead and open up my logs so I can go ahead and type in Heroku logs and this will tell you if there's any issues or not so let's go back here haven't gotten an email yet so let's hope it's going to run fine it's actually going to so now that it's executed it actually ran the script in the background so you notice that there's something here now that says your file is ready which is pretty much what we had to find here and the subjects your file is ready attachment is your script file or attaches your script file so when I click on this attaches your scrape file I go to attachments it's called scraped dot CSV which is what I did call it over here as well and when I open this let's see what we get so we should get a total of roughly 200 rolls or so 200 rows exactly actually and that's exactly what we get and so now this is gonna continually run I'm actually gonna let this run one more time so you guys can see let's go back to the list here and you'll see another email pop up after about a minute so this was the first one so we'll leave this here we're gonna let it run for another 60 seconds or so hopefully we see a second one and if we do we know that this script is running then I'm gonna show you how you're gonna shut this down so you're not killing your resources all the time so let's close this down we're gonna give this about another minute or so and there you go so there's a second one that actually popped up so this is going to continually run every single minute and again like I said you can have a scrape file that you're running off of Amazon or you know some other kind of a large retailer but anyways for now that is how this script works so the last thing I'm going to show you is how you're gonna shut this down so that you're not continually running this and killing your resources you're gonna go back to your command line and all you're gonna do is you're gonna close this down even if I close this down it's still gonna run in the background I just closed a log file for now but all I have to do is I got to scale the worker down to zero again that means assign nobody to it I scale it down and after this I shouldn't be getting any more emails at all so that was probably the last one that trickled through but I will not get any more after this because I basically said I'm assigning no more workers to it and that is basically it so that is how you would run something like this very straightforward to run it on Heroku I gave you all the files that you need on github so make sure you go and check it out build something send me an email show me I'd love to showcase it but this is the kind of cool stuff you can do and this is like I said for free because we're not putting any credit cards towards Heroku we ran our own custom scalar or custom scheduler and you can go ahead and schedule however you want whenever you want which is the awesome part of this it's gonna leave this open for another minute or two to show you that there are in fact we're not gonna be any more files coming through but guys if you did like this please consider liking and subscribing make sure you go to WWE title com to go ahead and sign up for my newsletter I don't put everything up on github in order I put everything up on YouTube I do have other sources where I put it up on and I will send you an email with that information if that is something you're interested in as well so again hopefully you liked this video if you did please consider liking and subscribing take care have a great day and Happy New Year [Music]
Info
Channel: SATSifaction
Views: 9,782
Rating: 4.9326601 out of 5
Keywords: web scraping, heroku, sendgrid, python programming, sendgrid api, web scraping python, beautifulsoup, beautifulsoup web scraping, heroku scripts, web scraping tutorial, web scraper python beautifulsoup, web scraping cloud, web scraping heroku, web scraping to excel, heroku deployment, sendgrid tutorial, which programming language to learn first
Id: qquCAgwvL8Q
Channel Id: undefined
Length: 15min 36sec (936 seconds)
Published: Mon Dec 30 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.