Scrapy Course – Python Web Scraping for Beginners

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
dive into the world of python web scraping with this scrapey course for beginners learn to create scrapey spiders crawl websites clean and save data avoid getting blocked and deploy your scraper to the cloud all while using python scrapey Joe Kearney developed this course he is an expert in web scraping and he's the co-founder of scrape Ops hey everyone welcome to the free code campus crepey beginners course my name is Joe I'm a co-founder at scrape UPS a platform that gives developers the tooling they need to monitor schedule and run their web scrapers also the co-founder of the Python scrapey Playbook a complete collection of guides and tutorials that teaches you everything you need to know to become a scrapey developer in this free code Camp scrapey course we're going to take you from complete scrapey beginner to being able to build deploy and scale your own scrapey spiders so you can scrape the data you need you can find the code and written guides for this course over at the python scrapey Playbook book free code Camp course page so you can easily follow along with this video the link will be in the description we've broken this course down into 13 Parts which cover all the scrapey basics you need to go from never having used scraping before to being a competent Scrappy developer if you want to dive deeper into any of these topics then check out the python scrapey Playbook itself or the scrape Ops YouTube channel which is linked in the description here you will find guides that dive deeper into the topics we've discussed in this course and cover some more advanced grippy topics too such as scraping Dynamic websites and scaling your web scrapers using redis so what is scrapey so I think the best little summary for that is directly on the scrapey.org website so scrapey is an open source and collaborative framework for extracting the data you need from websites in a fast simple yet extensible way so it's an open source framework that anyone can use with python and it makes scraping websites much much easier so it helps you do things like retrieve a Pages HTML parse and then process the data and then store that data in the file formats you want and in the location that you want so that's what scrapey is so now the next question you probably have is well why should we choose scraping what does scraping have to offer over anything else so some of you might have already done a bit of scraping with python and you might have used things like just straight python requests to request the page and then get the response and then you might have parsed that response using something like beautiful soup which helps you parse HTML so this is perfect if you're doing just very simple scraping and you just want to scrape a couple of pages of a website or you just have something like a one-off site that you want to scrape by all means use something like python requests and beautiful soup if you're looking to do anything from small medium to large scale it's much better to use something like scrapey because it comes with a load of features built in that you don't have to worry about so it helps you do things such as data extraction from the HTML using CSS selectors you can do automatic data formatting so it'll format the data you scraped into things like CSV Json XML and many other formats you can save the stuff directly into things like S3 buckets onto your local machine and use middlewares to save the data into databases so so much of that is already taken care for you so what else does that have that you don't have to worry about automatic retrace for example if you look for a page and the page comes back with an error it'll Auto retry it for you and you don't even have to worry about all the logic that goes into things like Auto retrace it looks after concurrency so you can scrape from one page all the way up to thousands of pages at the exact same time using the same piece of code so with all this stuff so much is taken off your plate that you can just focus on doing what you want to do which is okay this is the data I want from the website and this is the format that I want to save it into and the other great thing the fact that scrapey is an open sourced framework means that there's many thousands of developers all over the world who've made great plugins and extensions for it almost every single kind of question we might have is probably already answered so it's very easy to find the questions and answers online and things like stack Overflow for the use cases you'll be going through and it makes it very easy if you don't find the question just ask another one so I would really recommend scrapey if you're looking to do anything that is more than Justice very very very simple use case so if you're looking to scrape any kind of website scrapey would be the place to start off your scripting Journey okay so this course will be delivered through video of course which you're watching right now but along with the video we have an accompanying article that goes with each section so for example part two we have a full article here with all the commands you need to run code Snippets and more in-depth explanations so this makes things much easier if you're not someone who likes to watch video and you prefer to read things and take things step by step that way we will also have all the codes that we use if you want to just jump into a part further down the line you can go and you can download whatever part you need we will have an accompanying git repo where you can just download the code at that point in time and follow on so hopefully that should make your learning that bit easier okay so what we're going to cover on this course so we're in part one now part two will be setting up your virtual environment and setting up scrapey on your computer part three we look at how to create a scrappy project part four creating your first scrapey spider and navigating through different pages getting the HTML from the page and then extracting what we need from the HTML we look at crawling through multiple Pages then how to clean the data that you've just scraped using item pipelines and Scrappy then in part 7 we'll be looking at saving the data to files and databases and all the different ways we can do that part 8 looking at everything to do with headers and user agents and how we can use user agents and headers to bypass certain restrictions that websites might be putting on us when we try and collect the data part nine we'll be looking at rotating proxies and proxy apis and how they can help us bypass some of those issues about getting blocked part 10 we'll be looking at deploying and scheduling our spiders using scrapeyd so when we want to run a spider on a regular basis we want to have something set up that we don't have to worry about and kick off manually but that will be programmatically run on a daily basis hourly basis every 10 minutes or whatever so we look at everything to do with that in Parts 10 11 and 12 and we're looking at the different options that are out there in terms of the free options open source options paid options the pros and cons of those options that we have and then that brings us to our recap at the end which is part 13. and part 13 we'll just go through everything we've learned and recap and talk about what there is left to do if you want to take your scripting to the next level so I think that's everything I wanted to talk about and we will see you in part two so in part two of this course we're going to be looking at how to install python setting up your price in Virtual environment on Mac Windows analytics and then finally how to install scraping okay so let's get started with that so first things first how to install python so it's fairly easy the first thing we want to do is we want to just go to python.org go to the downloads section and then you can click the download python 3.11 or whatever version it will be when you're looking at it so obviously I'm doing it I'm on the Mac so it automatically detects that and it automatically proposes that I download the version for Mac OS you guys if you're on Windows you'd be wanting to download the latest version for Windows so go ahead do that if you don't already have python we can quickly check if you do have python by going to your terminal or Powershell so open that up and then I'm just going to open one up quickly here in Visual Studio code and what you just want to do then is just type in Python and dash dash version and as you can see here python version 3.9 is installed for me so I don't need to go under loaders because I know it's already installed so go ahead and check if you have python installed if you do have it installed you can move on to the next section and if you don't just go ahead download Python and install it okay so the next thing we want to do is install pip if it's not already installed so pip is just a package manager for python so we can download third-party packages for our python project so what we do again we just check is it installed so it's just pip version again and as I can see I have Pip version 22 installed here so if you don't have it installed we link to it in the documentation in the article that we have for this and in the video as well so to install pip you go to pip.pypa.io and go to installation and they have supported methods for installing on Linux Mac OS and windows and to give you the commands you need to run once you have a python installed to install pip so it's very self-explanatory and very easy to do all you need to do is copy this line and paste it into your terminal and hit enter so if I do it here it will just tell me that I already have it installed so as you can see it says requirement already satisfied pip in and then it gives the path so we have Pip installed and the next part is we want to install a virtual environment so VNV which comes with Python 3 the latest version of python will already be installed if you have python 3.3 and above if you have a lower version of python you might need to install the VNS manually and if you're on Windows you may need to install it manually as well so to do that you just pip install virtual ends if you're on Windows and I'll do that right now pip install virtual ends and that will go ahead and install virtual end for you if you're on Mac you don't need to do this or if you're on Ubuntu you more than likely won't need to do this either so we have python installed we have Pip installed we have virtual lens rvm installed so the next thing we can go ahead and do is actually create our virtual environment so a virtual environment is just think of it as a folder that sits on top of python where you can add all these third-party libraries and modules and they'll only be specific to the project you're currently running because what can happen is if you've got multiple python projects you can also often have multiple of the same packages but different versions to run your code and you don't want if you for example upgrade some third-party package that it breaks one of your other projects because one of your other projects needed an older version of that third-party package so by using virtual environments it just means that each project you have the third party libraries you installed are specific to that project so let's go ahead now and we'll just do python minus n and we're going to call the folder that we want VM also I just want to make sure I'm in the correct um folder so I've just made a part two folder with nothing in it as you can see and I'm just going to to thought no minus m v n f the end so that's gone ahead and let's create this VN folder with these items in it here and it's installed correctly if you're on Windows you're just going to be using the virtual end command instead okay so now that we have our virtual environment installed and you can see it here we want to just activate it so that any third-party package we installed after this will also be installed into this VN folder so to do that we just type in source and then VM bin activate so then you can see it's activated because we have the folder name VNV in Brackets here so that means anything we installed from now on using the package installer pip will be installed into this folder and be specific only to this project so we can go ahead now and install scrapey so we just do pip install scrapey and you can also get this command from the scrapey website itself as you can see pip install scrapey will install the latest version of scrapey 2.7.1 so I'm just going ahead and hit enter and as you can see it's downloading everything it needs for scrape B to run so depending on your connection and your computer it can take a minute or two okay so that's installed correctly as far as I know check it's installed correctly we can just run scrapey and it should give us a list of commands so if you see this output here where it's lists the available commands you know that scrapey is installed correctly as you can see from this line here scrapey has detected that there's no scrapey project created yet so it just says no active project so that's going to be the next step in part three is setting up our scrippy project so let's get going into part three so in part three we're going to be looking at how to create a scrappy project using scrapey then we're going to have an overview look at the project files that are generated when you create a new project and then after that we're going to go into detail on all the different parts of a scrapey project so that entails scrapey spiders items item pipelines it's created middlewares and settings so part three is really going to be a kind of a theory heavy part of this course so if you already know a bit of python this is probably going to be a lot more interesting than if you don't so you can feel free to dip around and and have a look at what parts of this would be most interesting to you we also have an article that goes along with this that might be a bit easier to digest so let's get going and create our project so to do that I've got a folder here part three and inside that folder I've got just the full project we're going to go through in a second and I've got my virtual environment that I've already activated so you guys if you've followed on from part two you should have already activated your virtual environment and you should already have scrapey installed so if you don't have that done just hop back to part two and make sure scrape is installed and your virtual environment is activated okay so now we can go ahead and use the scrapey start project command to create our new project so it's just simply scrapey space start project space and then the name of your project we're going to call this one book scraper because we're going to be scraping a site with books in it so if I hit enter it's gone ahead and created a new folder here book scraper you can see and if I do an LS you can see book scraper is there as well if and if we go into book scraper itself we can see we have book scraper and scrapey.cfg so I'm just going to open up the folder here and we can see inside of that we have several different files and folders so first off we have our spiders folder that at the moment has no spiders in it but we'll be doing that in part four we'll be generating spiders that go in there then we have items middlewares pipelines and settings so your basic scrapery project will contain these parts now you don't have to use items you don't have to create middlewares you don't have to touch Pipelines but you will always have a spider so you can think of items middlewares and pipelines are optional but we will be using them because if you're scraping anything more than just one page it becomes a lot easier just to use the pipelines items in middlewares and instead of trying to have everything custom made in a spider Okay so the next thing we're going to quickly look at is a fully fleshed out spider so just to give you an idea of what would go into these things like what is in items what does middlewares mean what we put in pipelines that's going to go through some code and give you an example so don't be too scared if you don't understand any of this stuff right now we're going to be going through all of it in Parts four five six seven eight so we start off with our spider so in here in our spiders folder there's just a simple spider called book Spider it's just a simple class it's got a name it's got some functions there and inside it has things like items which link into our items.py file here as we can see we're importing it and this is just a basic spider so I'll just give you a quick overview of what it does once you run the spider it goes to start requests and it puts this URL into this URL variable and then it returns scraping that request function with the URL and once the request comes back from the page with the HTML in it it goes to the next function which is to find in the parse function so this parse function then lets us use the response that contains the HTML and we can then manipulate this HTML and extract the things that we want such as the title category description price once we've got those those pieces of data are put into our book item and that is then returned to us in the console are if you've got other things set up such as feeds into a file so that might all be completely overwhelming for you but don't worry we're going to be going through all of this in extreme detail and showing you exactly how to do everything that is already here this is just to give you an idea of what's what Okay so you might have seen this book item that I mentioned so book items then links into our items.py file and in that file we just describe how we want the item to be set up so we want our book should contain title category description and price so then using this we can then use this book item both in our spiders when we fill the book item with the different pieces of data and return it and also in our Pipelines so in our Pipelines we have a simple test one set up here which goes through mimicking how you would then get the data that is returned um in the book Spider this book item with all the details and it goes through how it would save the item in a database so think of it we extract the data The Next Step would be to push the data into the item and then to put the item into a database so Pipelines are what happens once you've extracted and you're yielding returning the data from your spider so here we have for example process item it's fairly self-explanatory it takes the item with the title the category description and it inserts it into our books database table so that's what gets put into items and item Pipelines again this could be all very confusing for you if you know a bit of python hopefully it shouldn't be too confusing but we'll be going into it in a lot more detail later on okay so then we have our middlewares so middlewares are where you can get into the nitty-gritty of how you want the spider to operate so it gives you control over lots of different things such as timing out requests how long you want the request to go on for what headers you want to send when you make a request what user agents should be used when you make a request if you want to do things like multiple retries you can mess around with that in the middlewares section and as you can see it comes with several kind of defaults that are there that you can either update to what you want or you can create your own ones that go in here too so you also have managing cookies caches there's everything like that would be dealt with in your middlewares now there's two types of middlewares there is downloader middlewares and spider middlewares most of what we'd be doing would probably go into the downloader middlewares but spider middlewares can also do things such as adding or removing requests or items handling different exceptions that crop up if there's an error with your spider handling things like that so all these middlewares go in the middlewares.py file and then last of all we have our settings so settings is fairly self-explanatory it's where you put all your settings so you've got basic things like do we obey our robots.txt file when initial request is made to a website do we check that first and if it says don't scrape this site do we obey that yes or no it said here the number of concurrent requests we make so if we're scraping a website do we send one request at a time or do we send 10 or 100 requests at a time that's also set here so everything to do with how your spider and crawling operates will be either enabled or disabled in this settings.py file now we also have going back to what we were talking about our middlewares we have our spider middlewares as you can see here and our downloader middlewares as you can see here so this is where you can if you create a new middleware so this one directly links to the book scraper spider middleware that is right here so you need to make sure if you create a new middleware that you then enable it in settings also and also Verizon pipelines that's also where you need to enable if you create a new item pipeline that it is enabled in here also okay so I think we've gone through the basics of a full scrapey project and what's contained in there we've gone through what's usually in a spider gone through items and item pipelines how they can process the data once we've scraped the data from a page and then we've looked at middlewares and how in settings we can turn everything on or off so I think that's everything we wanted to cover in this part now again don't be too overwhelmed by this it does get a lot easier trust me so stick with it and in part four we'll be creating our first spider and extracting some data from a web page in part four of our scrapey beginners course we're going to look at how to create a scrapey spider using the Scrapy shell to find the CSS selectors we need using those CSS selectors in our spider to extract the data we want from the page and then finally we're going to get our spider to go through multiple pages and extract data from multiple pages so let's get going so I've got my terminal open here I've already activated my virtual environment I'm continuing on from part three so if you're just joining us here make sure you already have everything set up as we have done in part three I want to go all the way down into my spider's folder so at the moment it's empty it's just got an init.py file in it so we want to go down the level and down into the spiders so now I'm in my spiders folder and I can see there's just a new py there so in this spider's folder I'm going to run this command scrapey gen spider the name of my spider which I'm going to call book spider and then the URL of the website that we're going to be scraping and in this case it's going to be the books.2 scrape.com site which is a site that is there for people to practice their scraping up so if you go ahead and go down to your spiders folder and type this command into your terminal and hit enter scrapey will then create this spider as you can see here created spider book Spider using template basic in module and then it gives this so if we check that out now we can see that book spider.py is there and if I open this up in my vs code we can see it created book Spider here so this is just a very very basic spider we'll be adding a lot more to this but we'll just go through a few bits of what were generated here so obviously the name of our spider is book Spider so when we do Scrapy crawl to actually kick the spider off using scrapey we'll be doing scrapey crawl book Spider the allow domains list is books.2script.com this is important because later on when we're going to be doing crawling our spider is going to be going through multiple different links and having this allowed domains here listing only the domain we want to scrape prevents our spider from going off and scraping hundreds of different websites across the internet because as you can imagine URLs link from one page to another and sometimes a website might link to an outside website and in this case you would not want your spider to start crawling and scraping the entire internet so that's why we have allowed domains here next we have start URLs so this is usually just the first URL that the spider starts scraping but you can actually have multiple URLs here as well for it to go through one after the other then we have our parse function our parse function is the function that gets called once the response comes back so we'll be filling this parse function with all the different pieces we want to extract the data from the page itself Okay so we've gone through the basics of this generated spider the next thing we're going to do is we're going to use the scrapey shell to find the CSS selectors we want to get the data from the page so what I mean by CSS selectors for those of you who aren't familiar so first of all if you just open your developer tools you can do this by right clicking in Chrome or Safari or Firefox and it's usually inspect or sometimes it's called developer tools so you do that and this comes up here if you go to the elements tab you'll see then all the makeup of the page in HTML and CSS so here for example we've got a H3 tag and an a tag for links and this is the link to the page of this book here so we'll be looking now at how we can actually pick out these tags so that scrapey knows which pieces of data we want to extract from the page itself okay so let's just go back to our terminal and just to make using the Scrapy shell a little bit easier we're going to do just a pip install and then I python which just is a different shell would help if I spelled it correctly this is just a different shell which is a bit easier to read so I just did pip install IPython and then to activate this we want to go to settings and we I know we want to go to sorry not settings this scrapey.cfg and we're going to add the shell as a separate line here so now that that's done I can close Dash and we can run scrapey shell Scrappy shell then gives us this so what we want to do now we have scrape your shell open is as you can see we've got a list of the commands that it gives us that are available so we can do useful shortcuts fetch which is the command that we're going to be using so this fetches a URL and updates the local objects so we'll just run this fetch command now and I'll show you exactly what I mean so we want to fetch this books to scrape.com so I'm just going to paste in this URL here and it's going to go off it's going to fetch this and it's going to put the resulting HTML everything in here that we see into a variable inside in the scrapey Shell so we can access it and run different commands on it so enables us to kind of practice the code we want to then put into our spider so what we want to do is it put everything in from the page into this response variables so now we can just do response.css and let's say we're going to look for something specific on the page so what we can do is move our Mouse over these different tags on the page we can see article contains one book on the page so we can just say okay give me article and we'll just have the class name in as well so any class name needs to have just dot in front of it when we're referring to it like this so that has given us all the different books that are on the page now let's say we want to just get the first book we can just do this with a guess and it's giving us just the HTML that is for that first book if we want to then put all the books into a different variable so we can run some other commands on them within the scrapey shell we can do something like books is equal to response.cssarticle dot product pod if we do that it's after putting all the different books into this books variable so then if we run Len on books then in Python gets us the length so it gives us that there's 20 books if we go back to our page we can indeed see there is 20 books so there's four in each Road there's one two three four five rows showing one of 20 so that's correct Okay so for the purposes of this part four we're going to extract the name of the book the price of the book and the URL so we can actually go in and get further details later so the name the price and the URL so now that we've got our books what we can do is we can put the first book of the list of books so we'll make a new variable called book and we'll say that's equal to equal to books and the first item in the list of books that we have so now if I do book.css and then I go back and okay we want to get the the title of the book here so I can see from this that we've got a H3 tag and we've got an A tag and I want the text that is within this a tag here so I'm going to do H3 and a and that should get me the text that I'm looking for the title of the book so go back here and I do H3 a and then we just need a little bit extra which is just this text and I do get that gets me exactly what I was looking for which is a light in the dot dot dot this corresponds exactly to this so I've got the title of my book next I want to get the price of my book so I'm going to just remove these two pieces here and I'm going to inspect the price okay so if we look at dot product price and then dot price underscore color should give us the price so let's do that now so that's Dash product price color let's try and run that okay it didn't exactly what I I know because I S I did an extra double dot there you go so that got us exactly the price we're looking for and finally we want to get the URL so the URL is interesting because it's also part of this h3a tag but we want instead of the text within it we want this href attribute here which contains the part of the link to more information on the actual book itself so if we open this up in a new tab we'll see the full page that we're looking for and here you can see full project description and lots more details there Okay so we still want to do H3 and a but instead of text we're going to say a a trib e hatred that gives us our href attributes that was contained in this a tag that we were looking at a second ago so using the Scrapy shell we've managed to see how we can use the CSS selectors to extract the title the price and the URL for one book so now that we know that we can add these into our parse function and we can also Loop through all that list of books and get all the details for the 20 books that are on the page Okay so let's start adding things to our parse function so first I'm just going to add in what we initially had to get all the books that were there and that is books equals to response dot CSS article and then product underscore pod so we had that up here so I'm just taking this line here that we used in our Scrapy shell and I'm putting it in to our parse function okay the next thing we want to do is we're just going to Loop through it so we just want four book in books and then we are going to type yield so yield is like return and then what we want is scrapey to return to us is going to be the name the price and URL so we'll start with the name and then we're going to go up to where we got our text and we're going to use this exact piece here and then we're going to get our price and we're going to go to where we got our price and then last of all our URL and for that we have our href attributes okay now that we have that we should be able to go ahead and run our spider and see what happens so first let's exit our scrapey Shell by typing exit and then we might need to go up a level to our book scraper folder and we should be able to run scrapey crawl book Spider which is the name of our spider so if that goes according to plan we should see item script count of 20 there are the 20 books on the page and you can see what was returned here the name there is a book name a price there is a price and URL there's the URL and if we just scroll up we can see that the 20 books that were on the page all the data was script and output to our terminal so that worked exactly how we wanted it to work now as you've seen we have multiple Pages there's not just this one page of 20 books there is actually a lot more than that so we're going to look at how we can go to the next page if there is one and then scrape all the books on the next page and then keep looping through all the pages of books until there are no more pages of books to scrape so as you can see here we have a next page at the bottom of every page of books so if we click the next page button it goes to catalog page2.html and then we have a new page of 20 different books and as you can see it's going through all the different pages page three and there's a previous there as well to go back a page so we're going to want scrapey to bring us to page Dash three our page-4.html if there is another one to scrape so we're going to go back and we're going to do scrape your shell again to open our shell we're going to again roll in our fetch command to fetch our website URL and then we are going to try and get the link so to do that we're going to inspect the next button and as we can see here it's in An Li tag and then it's got a class name of next and then within that we want the link which is contained in this href attribute and that's contained in an a tag for links so let's see if we can get that now using our scripture so we do response.css and then allies dot next so Ally for the Ally tag dot next for the class name and then a and then we want the href attribute so so that c can me to Dash and that gives us exactly our catalog forward slash page 2.html which corresponds to well this was not the exact one we're looking at we're looking at page one so I can just remove that and go down here and this one should have catalog for slash page 2.html and that corresponds to this so now that we know how we can get the next page we can just put in under our Loop we're just going to paste what we had here to get our link and we're just going to do next page is equal to and this is going to contain our next page link so the next thing we need to check for is if we get to the last page there's going to be no more next page link so that's how we can know when we've reached the end so we can check that by going to page 50. so if I type in page 50 and go to the bottom I should see that there is no more next button there's a previous button but there's no next button so I've reached the end so that's what our test is going to be we're going to put in an if statement and we're going to say if the next page URL is not none then we know there's another page so we can continue going until there is no more pages left to scrape so let's add that in now okay we're just going to do if next page is not none and then next page URL is equal to and here we're going to create the full URL because next page doesn't contain the full URL it's only a relative URL so we need to get this part of the URL plus the catalog forward slash whatever the next part of the pages so let's add that in save that and then the important part is we need to do yield response dot follow next page URL and then call back is equal to self Dot parse okay so what this does is it obviously creates our next page URL and then we tell scrapey to go to this next page URL using response.follow and the Callback is the function that's going to get executed once the response comes back from the URL that we've gone to so once we get the response from that URL it's going to kick off self.parse and self.parse is this function again so it's going to keep going through and keep going through and keep going through and calling itself until there is no more pages and then in that case it's going to stop let's try and run that now so let's exit out of our scrapey shell and let's just do a scrapey crawl again scrape your crawl and book Spider and see what we get okay so with an item script count of 40 so it scraped four pages but obviously four pages is not 50 pages so there's a bit of a bug here which we're going to have to get to the bottom of let's start looking at the next page URL because that's obviously where it's going wrong if it's only finding four pages there must be an issue with the URL here so if we go back and we again inspect the element so here we can see it's page-50.html and I think we had and here it's 49 but then if we were on the initial page and we check here it's got catalog forward slash page two so sometimes it's got just page Dash 2 and sometimes it's got the catalog in t href so that's obviously why it only scraped four pages so the fifth page only has page five so we're gonna have to just modify our next page and if statement here just to check that we have catalog in the href if we do then we do a slightly different next page URL than if we don't so let's just add that in now so if the catalog forward slash is in next page then the next page URL is going to just be what we currently have and but if it's not we want an else and we're going to say add in the next page URL but we're going to add in catalog here so this should ensure that the next page URL is correct so if it contains catalog we don't need to have catalog in this part if it doesn't contain catalog we do have to have it and then that should make the correct URL so hopefully that's fixed that bug so let's try run the scripty crawl one more time okay so this seems to be going through a lot more pages which is a good sign and let's give it another minute or two to finish up and we can see the total item counts at the end and the total page of script so this is kind of the process that you have to do when you are creating a spider to scrape data there'll be small bugs like this that pop up and you need to do a small bit of detective work to find out why your spider is failing at certain parts or not able to extract certain pieces of data from the page so we can see here we have response received count to 51 I sub scraped count of a thousand so if we go back there is a thousand results so it scraped all the books that we were looking for that's pretty much everything we wanted to cover in this part four in part five we're going to go through and each book we're going to click into the book and then we're going to extract more product data from the product page itself so right now we're just kind of doing the easy thing of just going through one page one page one page just the name and extracting just the name the URL and the price but in real life scenarios most of the time we want to get a lot more data and that involves doing things like going in clicking into the actual product we're looking at and actually getting more in-depth data so we'll be looking at how we can do that in part five so in part five we're going to look at how to crawl Pages using the scrapey spider using CSS selectors and expats to extract more complicated pieces of data from Pages such as from tables from breadcrumbs things like that and then we're going to move on to saving the data into certain file formats such as CSV or Json format so let's get started we're continuing on from part four so if you need to get the code for that we'll have that available for to for you to download and follow on at this point or if you've got your part 4 already completed you can just continue on from there so in part four we just ran our spider and it went through and got us the details of the Thousand books that are on the books to scrape.com site so the next thing we're going to be doing is we're going to be instead of just scraping the URL the price and the name of the book we're actually going to be going into the book page itself and we're going to be taking things such as the rating the product description the product type the price excluding that including that the category that it's in such as poetry in this case so we're going to be looping through all the different books looping through every single page and getting all the specific data for each book that is on this site okay so let's go back to our Spyder code and the first thing we're going to look at doing is we want to start going into each page individually and so to do that we're going to do something that's very similar to going to each next page URL so we'll copy this code from the bottom and we're going to do this for every book in the list of books so instead of just yielding the data we're going to be going into the URL so we're just going to paste that in over our yield right here we're going to remove the next page non-section we're going to move our next page URL Pier and we're going to have to just modify where we get the URL for the each individual book so if we just go back into inspect the elements again and check out okay so that was H3 and a and we want the href for that H3 a tag so we just want h3a attribute href that should get us the next page and then it should create the correct URL for us instead let's call of next page because it's not the next page it's the relative URL of the book lets us call it relative URL and let's just put this here and the next thing we want to do is we want the Callback function instead of being parse we're going to do parse book page so we're going to make a new function to parse the book page individually one by one so let's go down here let's do Def passbook page self and response and at the moment let's just put pass in there so fat is going to Loop through instead of next page URL I'm going to call this one book URL and Okay so the only other thing to change that is incorrect is that obviously we need to Loop through the list of books so this book needs to be used and we're going to get the book.css and that's where we're going to get our link from with the this link here so that goes to relative URL and variable then we make the correct book URL and then using this book URL we yield um so we basically go into this um book URL and then the response HTML that comes back from this URL will get parsed by the parse underscore book underscore page function which is the one we made down here okay so I think we can save that for now and next thing we want to do is start flushing out our parse book page so what we're going to do first is we're going to open up our scrapey shell again like we did in part four and we're going to look at the different CSS selectors and expat selectors for the different items that we want to scrape on the book page itself so let's click into one of the books and we're going to see what if we want to extract from this page let's go back open up our scrapey shell in our terminal again so just Scrapy shell and when that opens we're just going to use our fetch function again to fetch the full URL from one of the book pages which in this case is just the very first book I've picked in the list so I can just put in the URL in here hit enter that's going to go off get the HTML of that page and stick it in the response variable so just like we did in part 4 we can see what works and what doesn't work with our CSS selectors so let's just do response dot CSS and then let's inspect the page again and just okay so we have product description there is an ID and there is a P tag underneath that so product pay underscore page there so that gives us the whole page so let's just try and see what happens if we do product on the score page that seems to give us back the whole page now let's look at getting the title of the book for example so in this case on this page it's in the product underscore Main and it's H1 so let's go ahead and just do so Dosh product underscore Main H1 text and there we have a Light in the Attic which matches to our Title Here so that's very simple just as we've done before so now let's do something a little more complex let's get the category up here so we have poetry in this case so for things that are a little bit more complicated like this sometimes it can be easier just to use expats instead of CSS selectors so expats are very similar but instead of using class names directly the format of how we write the expats is just a little different to how we would use CSS selectors so I've got one pre-written out which I'll just paste in here so paste in my XPath and that gives me poetry so I'll just explain to you how this gosh poetry so it went to the UL HTML tag it's put in the class breadcrumb so if we go back to the top here we should see it's in a UL HTML tag and the class is breadcrumb and then we have several Li tags and then we have an a tag within the LI tag and we have the href so here we can see that's where tli tag comes into it and then the active class is on the grayed out section here so it's then going from it's going to here and then it's saying preceding sibling so get me the preceding Li tag before the one that has the active equals class in it so it's going to here and then it's going back one to the preceding sibling and it's getting the text within here and it's doing that with the text at the end here so as you can see proceeding sibling Li one and then a and text at the end so expats are quite similar to CSS selectors in not every case will you have a class name or an ID tag on a HTML tag so in the case of the product description which I showed you a second ago we'll just look at it again there is no class name or no CSS ID on on this P for paragraph tag here so in that case as well we can say right go to the product description using the expats and then get me the following sibling that's a P tag so I can just show you that one so it's go to the product description ID get me the following sibling with the P tag and then within that get me the text so that's how expats work for getting some of these Corner cases where you might not have a simple class name or a simple ID on the HTML tag okay so we know how we can get the product description and the category tag up here we know how we can get the price and the title next let's look at extracting data from tables so if we inspect element again we can see that this is all contained within a table and this table has several rows which have TR as the HTML tag and then within the table row we have th and TD and that goes the whole way down so each row has one more th and TD so what we can do is we can specify okay get me all the rows in this table and then we can say okay we know that the product type is always going to be the second row so let's always look for the text that is within the TD of the second row if we wanted for example to see the product type so first we want to get all the table rows so we're going to look at table and then all the TRS that are in that table if so then let's assign that to a table underscore rows in our Scrapy shell so if you just use table underscore rows equals to response dot CSS and then have table space TR within the brackets that will make sure that all those table rows are within table rows we can quickly check the length of that that gives us seven rows and there's one two three four five six seven rows so now we can do something as simple as table underscore rows let's look at what we said the second ago the second one the dot CSS again and we want the TD and we wanted the text within that and I guess again we use get and we get books so again the numbering starts at zero and then the second one is second line is number one and then we look at the T D here the TD and that gets us books so that's how this line corresponds to here in this table so knowing that we can then go ahead and get things like the price excluding tax we can just put in something very similar the next row down and that should give us the price excluding tax so we now can get all the data we need from this table and the only last thing to look at is looking at how we can guess the Stars so if we just look at inspect element we can see we have several Stars here it has icon star icon star and star rating of three so that's where we can see the number of stars it's within this class that they've written three so we need to do something slightly different for this so first off we're going to get the star region so we're going to do response.css then p star rating for the class and then we're going to ask for the attribute of class so this is the attribute the attribute name is class and then it should give us our three that we're looking for so let's just do that now so response dot CSS P star rating attribute class and that gives us star rating of three so using our scrippy shell we've looked at how we can get all the different pieces of data from the page so let's start filling that into our parse book page function here so we can actually get all the book data to be scraped correctly okay so let's just exit out of our Scrappy show and let's first just get the table rows so table rolls are going to be equals to response dot CSS and then we're gonna have what we had up here where you can see table rows is table TR so table TR is a table rows and we can work with that now to fill in the rest of the details so we want to remove pass and then we're just going to do yield and we're going to have our details inside here so let's start off with the URL that's easy because we can just use the response sponsor URL so the URL of the page is contained within this response object then let's get the title so we had Dash up here here so can just copy this directly don't forget to add commas at the end and let's get the product type the price excluding tax including tax the tax so these are all to do with the table so we're going to be doing table row one for the second row and then TD text get for the product type and so on for the price excluding tax and including tax as you can see price excluding tax price including tax they're all one after the other here so all we're doing is incrementing the number here and the tax itself and we might as well add in the availability and the number of reviews as well since that's just a continuation of the same thing so add those two in we'll add in the Stars by doing this so let's copy we have them here and we'll just call that Stars and we can just paste that directly and let's also get the category and the description like we had a second ago so category or with that using the expats so that was this guy here let's just paste Dash let me have to just get it all in the same line perfect and let's get the description okay and we just want this XPath that we were using earlier in our Scrapy shell as well again just making sure that we have it all in the same line and that we add in our commas finally the only thing missing is the price and we can do that by getting the response and then dot CSS P dot price color and then the text from that so the price is up here and that's the class of price color within the P tag so if we save all that that should be everything we need to parse the individual book pages and the thing that's just missing here is the next page which we just need to add back in we deleted that earlier by mistake so that's how we get the next page URL okay so everything else looks correct so just a quick recap the spider is going to kick off go to this start URL the response that comes back the first time around we'll go into this parse function this parse function then we get all the books on the main page so that is starting with all these different books here then the next thing that happens is we get the relative URL and we turn that into the book URL we do that by getting each of these URLs here once we get the first books URL we then go into that book page so what happens is the code basically goes in here it's then goes to the Callback function which is parse book page down here it goes gets all these details here that we specified and then is Loops to the next book on the page because this so it comes back out here and then Loops back up to the start and goes to the second book on the page goes Clicks in gets the data comes back does the third book in the page so Loops through all the books on the pages keeps getting all the data for each book and then it goes to the next page and then once all the pages are done it finishes so if we've done everything correctly we should be able to now run our spider and see does that work so let's try and do a scrapey crawl the thing we're going to do that's slightly different this time is we're going to have the output go to a file so instead of having the output come into just our terminal we're going to also get it to save to a file so we do this by doing minus all uh our Dash o and then we're just going to call this book data and we'll do two um CSV CSV so CSV file formats can be opened in Excel or can be put into you know Google Sheets and different applications like that so it's CS face just stands for comma separated values so if we run that hopefully there's no issues and as you can see there's book data.csv here you can open that and we can see we have nodes of data so all the stuff that we were looking for price description everything else seems to be there so I'm just going to stop that now before it gets the end because that seemed like it's working correctly it was already on page 15 here let's see okay and I'll run that one more time except this time and do it in instead of book data.csv we're going to get her to Output to Json so I'm just going to delete that one and get it to a push to Json format instead Json format can just be a bit easier to read and if you're doing further coding it can be easier to parse as well so if we opening up as you can see it has all the data nicely formatted the title price including tax availability the number of reviews all the data is all there so that's working nicely obviously it's going to take a minute or two to scrape all a thousand books but I think that's everything we wanted to go through in part five in part six we're just going to be looking at how we can use items and item Pipelines to better structure and clean our data before we start saving it into things like a database so it'll just put a bit more structure on our code and it'll enable us through things such as for example we could change the prices from pounds to dollars four gets saved we could you know remove any trailing white space lots of different examples we'll go through in part six of how to clean up the data okay see in part six guys so in part six of the scrappy beginners course we're going to be looking at scrapey items and Scrappy Pipelines so first off we're going to go through what scrapey items are then we're going to use scrapey items to structure our existing spider a bit better then we're going to go into what scrapey pipelines are and what they do and then we're going to use the scrapey pipelines to clean our data so let's get started if you're continuing on from part five you should have everything already set up if not you can download the code from our repo and continue on from where we are now so I'm presuming you already have your book scraper project set up with your spider set up and you've got your environment activated and you have screen be installed and python installed and everything else is running okay so items so when you generate a Scrapy project it generates this items dot py file and this is where you put your items so items just help us Define what we want in a block of data that we're scraping and that were returning so for example here you can see in our book Spider we have no specific item declared we're not using an item that's been created in relationship is that py but instead we just have a yield with all the different pieces of data we're extracting from the page so that works fine but just to clean things up and to make things a bit less ambiguous the best thing to do is to use the items.py and declare a specific item in there so let's go ahead and do that now so I'm just going to copy and paste the one I've already got in called book item the book scraper item is just the default one you can leave that there for the moment so book item just has everything that we already have used in our book Spider so URL title product type all these different things but instead were declaring them specifically here so you might say well what's the point of that well one example is that if I do a Miss type and reviews goes in like this this might then not go into my database or might not go and be processed further down the line and I might not even notice it but if I'm using an item scraper will throw an error and say this Norm underscore reviews with two r's does not exist and it alert you to the fact that there is a typo here so that's one very good reason as to why we will use our items and actually Define the item first so now that we've got the item to find we've got our item class created let's actually start using that so first off we want to import that into our spider so we go up to the top and we're importing book item as you can see it brings us directly to the book item now the next thing we want to do is we just want to specify a book underscore item is equal to book item and then we're just going to yield book item at the bottom so instead of yielding just start texturing there we're going to yield book item and then we're going to remove those two brackets and we're going to say book item URL is equal to response to the URL and so on all the way down so change all these into using our item and then once that's done we'll start looking at item Pipelines so let's look at the data that have been saved into our file so this book data dot Json was what we did in part five that was the output from our spider that ran so as you can see we had things like the URL the title so on so forth but if you noticed we have the price excluding tax for example has this encoded value here so it looks like the pound sign did not go in correctly so you can specify a specific serializer that you want to use on a specific field so for example if it was like the price I've a serialized price function I can write and I can then use that serialize price to stick a dollar or a parent sign in front of the value so for example I can stick serialized price and I'll put it in front of the price excluding tax so I could just do something like serializer serializer is equal to and then serialized price so that would make the value go in here and then have the pen size applied to it before it gets put into price excluding tax so that's also a cool way that you can use items with serializers so I'm just going to remove that one for now because we actually end up processing the data from this in item pipelines in a second anyway I just wanted to show you how you could use this if you didn't want to do pipelines and you're you're only going to scrape a small bit of data and you didn't want to do a lot of post-processing there's no point using pipelines and you could just stick to using just items and have a serializer if you needed to but if you're going to do anything more complex and you want to do a lot more processing of your data you're better off using pipelines instead of just using serial lasers the next thing we want to do is look at our Pipelines so in our pipelines again scrapey defines a book scraper pipeline when you create the project this is just here to give you an idea of what you can get started with so using pipelines you can clean your data for example you can remove the currency signs if you want you could convert the price from pounds to dollars you can format strings to integers if you're going to save it into a database that becomes very important and you can do things like converting your relative URLs to full URLs you can validate your data check if the price is actually a price or is it sold out and then in that case you can you know put in a price of zero and you can also use the pipelines to store the data so instead of having all the data going into a file like we've done in part five we could have it we could use a pipeline to get the data to go directly into a database which we will be doing in future parts of this series so let's clean up our data a bit now what do we need to clean well straight away this is not good for our data this encoded value here so we need to sort that is another thing we need to sort out could be the availability of the stock so you might say okay in stock 19 available is fine but if I needed to run a piece of code later on this data that's not very useful because I just want to know that there's 19 books I don't want to have this extra text here and here and brackets so if I just wanted availability to be 19 I could use the pipeline to remove the in stock and the available parts of the string and just convert that 19 into an integer okay so we'll do that also and I think I saw in some places that things like the title had a trailing white space are the descriptions had trailing white space so that's also something that we could remove and another thing would be changing the category we could change the category instead of it being Thriller with a capital we could change that to Thriller with lowercase so this kind of standardization of data before it gets saved into a file or into a database is important especially when you start scripting at scale and doing larger projects so we're just going to go through a bunch of different processing in our process item in our pipeline so we will just add everything in here and then the item will be returned so let's start with just removing the white space so I'll just paste in the code I've already got and talk you through it okay we straight away get our item which gets passed in to our process underscore item so we've got the item available we pass it into the item adapter so as you can see up here useful for handling different item types with a single interface with this adapter we can get all the field names and then we can Loop through using our for loop loop through all the field names and if it's not the description we want to use the strip function to strip the white space from the strings so we're just getting the field name and then stripping the value and putting that back into what was initially there for that value okay now let's quickly look at converting the product types uppercase to lowercase if there is an uppercase value for the for example thriller or poetry values we can specify specific keys that we're looking for in this as I mentioned we'd look at category you can also do things like product type and we're going to just do the same thing except we're doing the lower function on the value now let's look at cleaning the price data as I mentioned earlier and as part of that make sure that the price data is saved as a float which can be important all the prices aren't always going to be rounded up to the nearest dollar or pound or Euro for that kind of data so here we Loop through the different price Keys which because we're saving several different pieces of data we've got price price excluding tax price including tax and the tax and for each one of these we're replacing the parent sign with nothing and we can also do something like replacing the for example Unicode with a specific value the other one I wanted to do was to change the availability to remove the extra text that was in there so let me quickly add that in to do that we're just doing the split function on the bracket if it sees that there's break there's no bracket there then we'll just set the availability to zero if there is a bracket there then we will split the second piece of the array that is returned from this function and we will say okay the second piece of this we'll split that again using the split function and we then know that the first item in this availability array is going to be the availability number that we had here so this is going to be the first ISO in that availability array and this is going to be the second so that should save just the number for US of the availability and we'll save that back into our item let's just look at two other ones quickly so just converting the review to a integer so we'll just convert Dash so the number of reviews make sure that it's an INT so we're just going to adapter.get and then we're using our int and putting the string in inside the brackets and saving that back into the number of views variable and last of all we mentioned the star rating and we want to turn the star rating into an integer also so to do that we can just get the Stars split the string using the split function again we've got the array we take the second value in the array converted to lowercase and then depending on what the value is in that variable is it zero one two three four five then we save the Stars value as 0 1 2 3 4 5. so pretty easy nothing too complicated there so that's everything I wanted to cover for for pipelines so as you can see there's a huge amount of data processing that you can do on pipelines and it's a good idea to have a look at your data do one run of it like we did in part five and then have a look at your data and actually see what you can fix what needs to be fixed up what looks okay what doesn't look okay sometimes you'll get a missing piece of data there'll be blanks but this is a process of refinement so the first time around you might only you know add in two things to your item pipeline you run it again and you notice something else is wrong and you add in another piece into this pipeline so the next thing you want to do we talk about this in part three is if you've got a pipeline you want to go into your settings and you want to make sure that the pipeline is enabled so we've got our spider middlewares our downloader middlewares extensions and as you can see here we've got our item Pipelines so this book scraper pipeline should correspond to the name of our class here and if I put that in you can see they're the same so that should work because this is also generated by scrape when you generate the project so it generally works as long as you uncomment this section here so if everything was done correctly we should be able to now run our spider and see the results with all the data processed just as we want it to be processed here if there's any errors they'll pop up and we can fix them and run it again so I'm just going to make sure I'm in my project and then just you want scrape your list to make sure everything's working and then one Scrapy crawl book Spider [Music] hopefully there is no issues okay straight away I can see there's an error being returned so I'm just going to stop my spider so none time none type object has no attribute next call and we can just scroll up and double check this so spider must return request item are none because ice and meta in guess okay so let's sort out this error so if we just go back to our book spider.py file you can see the error is because I'm returning book item and yielding book item and instead it should be book underscore item so that should fix the issue and if I do a Scrapy crawl again this time I'll actually get it to go into another file we loot call clean data dot Json so it's the hyphen capital O clean data.json and hopefully there's no other errors there it does look like there's another error because if I check clean data.json there's nothing there okay so I'll just close it again and you can see okay error processing availability to give us anything else so it says pipelines.pyline 21. topple object has no attribute strip so we can go to our pipelines line 21 okay so we have our value.strip and it's saying Tuple object has no attribute strip so let's just print out the value of value let's just add in something above it so we can just see where it is in the output and try runish one more time and if we stop it again and scroll up we should be able to see that we've got our stars and we've got in stock available at 19. and it is indeed being returned in a topple uh with the second value there's nothing there so obviously we need to reference the first value in the toggle so we need to just do that so adding the square brackets since zero should return just the string that we're looking for and then dot strip can act on this string so if we remove our print statements save that and build Traverse again and it looks like there's some errors coming in there we can just check our file there's nothing in here yet so I'll go ahead and stop the spider from running and we can see an error here pipelines.py line 21 in process item so that's still giving it a bit this line but this time it's saying type error none type object is not subscriptable so I know what this error is I've had it before so this is coming up because we're getting all the field names which are a from our items.py so it's getting all these different field names here it's looping through them and one of these field names is not being found so if we look at our spider and compare all these guys here versus what we have here I think I've spotted the one already so I think it's this UPC field unique product code I think it stands for and if you look here we don't have book item UPC so I can just add that in now so I'll add that in here and save that so now we should have this which should correspond to this and we should have no more errors so let's run that again and this time we should see our clean data.json file filling up so everything looks good there open up the clean data chart Json we've got what looks like all the data we wanted so we can go ahead and just stop the spider don't need it to collect all 1000 records you can just double check that everything did go in correctly so you can see just by either checking the file or scrolling up did everything get processed the way you wanted to get processed so did the price get get processed correctly is it now double the product type is the first part lower case yes it is the number of stars is now an integer so it looks like everything that went through our pipelines.py got processed correctly we can scroll up and check the category as well and the availability so everything worked out so that's just how we go through using pipelines and items I hope that's given you a good idea of how you can use items and pipelines yourselves to clean the data that you're scraping and in part seven which we'll be looking at next we'll be looking at how we can use pipelines to save our data into databases and also how to use feed exporters in a bit more detail so see you in part seven guys so for part 7 of our scrapey beginners course we're going to look at all the different ways we can save data so all the data we've scraped in the last few parts we're just going to see how can we save it to different file formats and then eventually look at databases so first off we're going to look at Via the command line how what commands we need to run to save it to different file formats then we're going to look at how we can do that instead via the feed settings which we can set in our settings are in our main spider file and then once we've done that we're going to go on and look at saving data directly into a database using the Pipelines so if you've done part six with us you know all about pipelines by now and we'll be using those pipelines in part 7 to save the item and data into the database directly if you're just joining us now you can download the code from our GitHub repo we'll have links for that and you can follow on from just this part 7. to get going I'm just going to go into my book scraper folder make sure I'm in the right place and then run Scrappy crawl and the name of our spider crawl book Spider Dash capital O and then book data.cs V so this is going to Output the data into a CSV format which is comma separated values so that can be opened in Excel and as you can see the data is all there correctly okay so we can stop that now and if we scroll to the bottom we can see we have 321 rows so if you want to append data onto a file instead of The Hyphen or Dash capital O you can do a lowercase o and then if we do the same name again book data.csv and enter it should start pending on the data here so instead of overwriting the file every time if we close the file open it back up you can see we're already up to over 500 records so the file doesn't update automatically sometimes it can take a couple of seconds or you have to close it and reopen it so as you can see we're up to 700 records there and if I relish once more and do a capital O it will overwrite that there you go so it's after wiping the file and now filling it again okay so that's the difference with the overwriting are appending and you've seen just by changing the file format at the end of your file name is how you can specify types of files that you want to write into so here we're going to do it again but we're going to do it in Json file format so we have a new file is created book data.json as you can see it's in Json format and the other one is comma separated values format now let's move on and look at how we can specify where the data will be saved in our settings file so if we open up our settings file what we can do is use the feeds so what we do is add in a feed section and here we're saying save the data into a file called data.json so that's the scholars books data dot Json and the format is going to be Json so if I delete the two files we've just been using there and save what I have in my settings and rerun it except this time remove the Dash o and the name of the file and if I just go ahead and run my spider we've seen it's just created at books data.json because I've specified it here and it's all in the correct format so I'm just going to go ahead and stop my spider and the next thing you want to do is just show you guys how you can specify the feed data in your spider so to do that we can use the custom settings so this just enables you to overwrite anything you have in your settings file and you can just specify it in your spider so just need to specify what we want to overwrite and we're going to overwrite our feed and we would then push our feed settings in here so if it sees that the feeds are set here it will overwrite what we have in our settings.py file so this is just an easy way that if you guys want to specify certain settings you can do them here they don't all have to be in your settings.py file one important thing to note as well with our feeds when we either set it in settings or in custom settings is we need to set the overwriting so like we did earlier we just have overrice true our overwrite false because it depends on where you're storing the data what the default is for that setting so it's better just to specify that we want to or write the file or not we can just have it there and run it and it'll overwrite our current file so now that we have that the next thing we want to look at is how to save our data into databases using our Pipelines so I've gone ahead and I've already installed MySQL so MySQL is a very popular database which you can get yourselves just by going to mysql.com site and their download section and you can just choose your operating system so if you've got windows obviously have it for Windows click download and install it then they have for many other operating systems the available downloads there too so once you've downloaded and installed that you should be able to then make sure it's installed correctly just by running MySQL and then dash dash version so as you can see here I have version 8.0.32 for Mac OS 11. and that's the latest 8.0.32 so that's installed for me so the next thing I'm going to do is just Connect into my MySQL so I can just type MySQL and then if it's a simple install and you've just installed this you should be able to just hit enter and it brings you straight in and you know you're connected in because you've got MySQL here and then you can just say show databases and it shows the databases so I've already gone ahead and created one called books that's there but you obviously won't have that if you just installed it so you want to create a database so you just do create database books and then it'll say created I've already got the database already there so it says database exists for me so we need a database to actually Savor things into and once it's set up it'll be there in your list and you just do show databases to get the list of databases that are available so we can exit out of that once we have our database created you might have to connect in to your MySQL if you've set up a username and password are with a different host if you've so you could do host localhost minus U for user root and minus P if you've got a password and then it'll prompt you for a password so depending on what you have so that also works to connect in if I'd set up a password it would have asked me for a password beforehand so if you just type MySQL you can usually get in if you ever already haven't got a password set up or if you're using a different host like digitalocean or some other third-party provider you can stick the URL to where your database is hosted there so we've got our database set up and the next thing we want to do is we want to install our MySQL connector so just to make sure that python is able to connect to our database so I'm just going to paste in the command for that and you guys can have a look at my screen there and type it in so it's going to install the MySQL and MySQL connector python packages with using pip again so go ahead and run that and now that I have that I should be able to start working on the pipeline so we can just go directly under our existing book scraper pipeline and we're going to create a new class and we're just going to call that save to mySQL pipeline and then we're going to import our MySQL connector to help us connect in and we're going to initially just when this pipeline is initialized we're going to set up our connection and then set up our cursor so I'll show you guys now what that entails so we have this init function here so this is going to start up when our spider initializes this class we have we're using the MySQL connector.connect here to set up the connection we've got our host our username password if you have a password you have to add it in here and then the database that we just created books so we can save all that and then we have the cursor which is used to execute the commands so I have that set up here and that's saved into self dot Cur so we can use it further in other functions so the next thing we want to do is we're going to add in that we want a new table to be created if there is no table to store the data so this can just be handy in case you are running this over and over again or you're testing you might want to go in drop the table and if you don't want to remember did I just create that is a table there or not you would have this there so that will just make sure that there's a table there so it creates a table if it doesn't exist called books and that table will have the following columns ID URL title UPC product type everything that we've already been scraping from the page all the different data points so it will set up all these different columns including a primary key called ID and then all the data will be able to saved into the columns that we want it to be saved into so we'll have the table set up if it's not already set up so that's we don't have to go in and manually set it up in MySQL ourselves and the next thing we'd want to look at is we're going to again use the process item function so we've already had that in our other pipeline but we're going to add it in here and this is where we're going to have our insert statement so it's going to insert our data that we have in our item so here it is pretty simple so using the cursor that we've already defined above I'm going to say please execute this command insert into books URL title UPC product type all the pieces of data that we've already scraped so once that's insert statement just there we have to use commit to make sure that the insert is actually executed correctly and then we just return the item so that if we add one more layer to our pipeline that the item is returned and the next piece of our pipeline can also continue the only other thing we need to add in now is we want the connection to our database to be closed once the spider is finished so to do that we just add in that enclosed spider so this is just a function that's scrapey looks for if close spider is there it executes closed spider once the spider is ready to close at the end so inside enclosed spider we just add in cursor close and connection close so we're just closing the cursor and the connection just so that this stuff isn't kept open and using memory if we're executing this lots and lots of times we don't want all this memory to be taken up with cursors and connections that are not being used so now that we have that we need to go to our settings and we need to enable our new pipeline so we're just going to copy the existing line and we're going to say execute our pipeline after this and existing one so we want the data to be clean first and the second step is save the data into our mySQL database so we're going to just copy the class name go to settings paste that in here and then the only thing we need to do is we need to change this number here so I don't think I've talked about this number yet so what this number is is that it's just the order in which the items in the item pipeline have precedence so the lower the number the higher the order of importance the first thing that's going to be executed so in this case number 300 is going to be executed first and then number 400 is going to be executed after that so this is an easy way for us to say please execute this pipeline first and this pipeline second and if you had multiple pipelines you can just use these numbers it doesn't have to be three or four hundred it can be any number you want I've just picked three and four hundred for now okay so now that we have that we should be able to go ahead and check our database to see did the items get saved into the database correctly so let's do that now so as before we're just going to do Scrappy crawl book Spider and should kick off so we have several books after being scraped so that's let's stop our spider the next thing we want to do is we want to log back into our MySQL console and then we want to show database this is and then we want to use books this just enables us to select from that database so once we're using the books database we want to just show tables and we can see that the table was created that we asked to create books and then we can do select all from books and we can see that there's 138 rows there and Dash D data looks like it's safe correctly we've got the name of the books all the other pieces of data that we had the description the price the tax the availability the category it all looks like it's saved there correctly so we can drop the table so dropping the table just basically removes the table so that it won't exist drop table books if we want to start again because otherwise what it's going to do is it's going to keep appending on to the database so we can see if we show tables that there's no more tables in the database now but our pipeline creates a new table anyway so that's fine so that's how we create a simple database get it set up and have a simple pipeline script that once the data is cleaned up with our first pipeline that we did in part six it then inserts it into a mySQL database in this tutorial that we've just done again using our Pipelines so obviously if you're more familiar with using postgres databases or other types of databases you can just modify the pipeline slightly we will have available articles where it'll show you the exact code you need to use a pipeline to insert the database into a postgres database also so you guys can have a look at the articles that we'll attach and we'll also have the code repos there for you guys to just download and play around with two so I think that's it for part seven in part eight we're going to be looking at how we can use the user agents and headers to get around issues with being blocked when we're trying to scrape different websites so we're going to be looking at user agents and headers in detail what they are how to use them so see you in partake guys thanks for watching welcome to part 8 of the scrapey beginners course for a free code camp so in part 8 we're going to be looking at why we get blocked when we're scraping the web what types of websites might block us and then how to use user agents and headers to bypass instances where we're getting blocked while scraping so we'll start off by going straight into what headers are so if you go to the site we've been scraping in the last two parts books.2scrape.com you open inspect an element on the page go to the networking Tab and then simply refresh the page if you've got doc selected or all selected you will see what we want to see so because this is a just a simple website the HTML is sent to us and we can see it returned in the preview so you can see all the HTML is there and that's what we end up scraping when we are using scrapey but if we look at the headers tab we can see everything that is sent when we request this page and you've got the request URL which is just the URL of the site we're trying to scrape you've got the method are we trying to get the page are we posting data to the page and then you've got things like the status code and so on so on now the important stuff for us are the request headers so this is everything that we send when we make a request to books.2script.com and as part of this the most important part for us in this part eight of the tutorial is the user agent so the user agent gives you all the important information about who you are to the server that you're requesting the web page from so here we can see if I copy this string and I've got this site user agent string.com and it lets you paste in and analyze user agents so this is all the stuff that is is contained when we make a request to a website straight away it knows that we're using Chrome what version of Chrome we're using the render engine that we're using your operating system so I'm using OS X I'm using an Intel CPU so all this kind of data is sent with every request you make automatically now this is fine when you're browsing the web or you're building your spiders and you're doing a bit of testing but if you're doing any sort of large-scale scraping on any kind of commercial sites you're more than likely going to start to get blocked so a lot of sites think I don't know Amazon Walmart any kind of big eCommerce sites most of them will have some form of antibots that stop you from scraping now you might say why do they want to stop me I'm not doing anything bad I just want to collect their data well they'll say well this is our data we own the website this is only for our customers so on so forth now obviously you need to look at the terms conditions of the site you're scraping and judge for yourself if it's legal or illegal the rule of thumb to go by is that if it's publicly available and you don't have to log in and give your details then it's more than likely okay to scrape the data of the website if you have to log in first and by logging in you might be agreeing to certain terms and conditions then more than likely the website will have that in their terms and conditions that you are not allowed to scrape their data so that's up to you guys to decide in a case-by-case basis but for a lot of simpler sites like the one where we're scraping in our tutorial series here this site has no antibots on it so there is nothing which will block us even if we have the same user agent so therefore it knows if if it gets a thousand requests from this chrome and my Mac that it knows that it's me then it's it's not going to do anything about it it's not going to block me when I'm trying to scrape all these books off this website but that's obviously because this website is there for people to learn on so that's kind of why we get blocked so the other things that they look for is they look at the IP address of the machine that you're using so that's also a a very simple way for websites to block who the requests because they can see your IP address every time you make a request so they normally look for the IP address and they might set something in your cookies in your session so they might set some kind of flag or counter there so they know that it's you coming back every time so it's mainly IP address the cookies or the sessions and then the headers in general and as part of that the user agents so the difference between headers and user agents is that headers is everything that we have here so it it encompasses things like the accepted things that are returned does it take HTML or does it take images you know what do we accept back what does the browser accept back as a as a response to things like the language encoding and then the user agent is just one subset of the overall request headers so for some sites that are not too complex if we change the user agent each time we make a request the website will think that it is a different browser and a different computer looking for the data on the site every time so it'll let the request go through however for more complicated sites they'll look at everything in the request headers and they'll want everything to be different or at least slightly different so for example I have Mac OS here so if Mac OS is coming every single time and they match that plus they can see my Google Chrome version here as well and the version of chromium then they might say okay this looks too similar even though their user agent is changing every time this looks suspicious and they might flag my requests and block my requests or at least they might not even block them but they might throw up a capture page so that if you're not actively solving the capture the requests are being blocked so for the most comp for the more complicated sites we need to also be changing the entirety of the request headers not just the user agents so they're the kind of main things we need to look at is how can we change your IP address to stop getting blocked how can we change our user agents and also the entirety of the request headers for the more complicated sites so that's what we're going to be doing in part 8 and part nine is looking how to bypass being blocked by changing these different parts that are sent when we make a request for a web page okay so now that we've gone through the theory of it of what is in a header and user agent and what details the websites are looking at when we make a request let's go to our spider and Implement different user agents every time how we can get multiple different user agents insert them into our spider use middlewares to do that and then also do it for request headers okay so if we go back to our spider we're continuing on from part seven if you're just joining us for this part eight we have a link to download the code so you can hop in and start right here with us now so the first thing I'm going to do is I'm going to just go to my settings and disable the pipeline which saves the data to mySQL because I don't need to do that for the purpose of this part H so the next thing is I'm going to open up my spider and I'm going to set a user agent so I'm going to do that the simplest way to do it is actually to go to our settings and in the settings we can directly set a user agent so just like I had shown you in the browser it's user agent and then it contains all the available information about the user who is requesting the web data so here you can see Mac OS X you can see it's an iPad etc etc etc so this is just an example if you wanted to send this user agent with every single request you can set it in the settings now obviously that doesn't make sense because it's not changing So within 10 or 20 requests the website is going to say hey this is the same person every time they're making a lot of requests they're probably web scraping and they'll ask for capture audio blockers so this is not sufficient this is just if you want to set one specific user agent for every single request so for now we'll remove that again and we will look at how we can create a list and rotate through that list so what we're going to do is go back to our spider and we are going to create a list of user agents so here's one I've pasted in from our article you can check out our article and paste it in yourself as well and we also have this available in the GitHub repo so you don't have to type out everything here yourself so we've got a list the next thing we want to do is add the user agent into every single request that we make so to do that we can go to where we make the requests every time and we can specify that we want our user agent to be overwritten to do that we just go to where we have our callback and our book URL and we would do something like this so we would have our headers and we're saying overwrite the user agent part of our headers and we'll do this with our user agent list so we just need to specify self dot user agent list and then we'll import random so we can switch between different things at random and we'll do self.user agent list here and I think that's all we need so what this does is we're saying add this user agent to our headers when we make the request and pick a random user agent from between 0 and the length of the user agent so it's going to pick one of these guys at random and insert it in to our header now obviously you need to add this line that we just added to everywhere that is making a response dot follow so we can add it here as well and I think that's the only two places we have them so we should be able to run that and it should send a different user agent every time but as you might guess this isn't really enough to spoof a large-scale website they'll see if you're doing thousands of requests okay there's only five different user agents and they'll say we need to block this user so that brings us on to how we can use a fake user agent API to give us a massive list of thousands of user agents and then we can Loop through that list of thousands of user agents but instead of having them all here directly in our spider what we would do is we would Implement and middleware and we would add it into our middlewares.py and in this middleware is where we would rotate through all the different fake user agents that we would be getting from a fake user agent API so that's what we're going to look at next we're going to create a middleware and we're going to ask for those fake user Agents from the third-party website get those user agents returned to us again pass those in to our request headers so to get those request headers we can go back to our browser and go to scrapeups.io where you can sign up for a free account and then using the API key that you get you can use the free headers generator API so this is what happens when I use their headers API I can specify that I want user agents I put in my API key there make a request and it gives me back a result with a bunch of different randomly generated user agents so once you're actually logged in you can go to their fake headers API section and that's where it shows you your API key and if you want to generate browser headers you use this URL if you want to generate user agents you use that URL and then the response comes back like this just as I showed you here so I can just stick that in and it gives me the results back so you can specify the number of headers you want and it sends you back all the user agents our browser headers that we can then use in our middleware so depending on the language you're using as well you can specify different examples so this is where I got the URL to use for here and if you're using node or PHP a ruby you could use the other tabs to see the examples but we're using python with scraping so we have everything we need here so now that we have an API endpoint where we can get our fake user agents and fake headers we can go back to our middlewares file and we can start creating our middleware that will handle everything to do with the fake user agents so I'm just going to scroll down to the bottom and I'm going to start a new class and I'm going to be importing a couple of things that we're going to need so I'm going to import your link code which will encode our urls ran into pick a random integer so we can use that to pick one from the list and requests as well so I've created a new class called scrape UPS fake user agent middleware that can be obviously whatever you want it to be and then we're going to set it up so again we have our initial function which gets kicked off when the class is initialized and in here we first off set some of our settings so we're going to want the scrape UPS API key which is going to be the API key that we get for free from here so that's where we have our API key we also have the URL which you can see here and so we've got our endpoint and what else oh yeah do we want it to be enabled or not and the number of results we want to specify to come back so we go ahead and set those all up in our settings so I'm going to set my API key obviously you guys set that to whatever your one is set up the endpoint and the end point is going to be user agents paste static and then we have our if we have enabled so that's true there you go and the num requests which we can set to to 50. you can save that and save that as well so this part up here just makes sure that we have access to our crawler settings when d-class is initialized so as you can see here and here we've got two functions and that means when the class is initialized get the user agents list and enable it so I'm going to first off get the user agents list so I'm gonna add that one in so that function looks like following we've got the payload set which is the API key and then we're saying if the number of results is not none set the payload number results and then we want to make a get request to the API endpoint with our parameters which have been URL encoded here using the URL encode function then that goes off goes to the script ops endpoint and then gets the user agents and comes what comes back gets put into the response then we're using dot Json just to parse it into a Json response and then we can have our user agents list saved into user agents list so once we've got that we're going to just create two more simple functions underneath the first one just get random user agent fairly self-explanatory let's get one and getting one user agent from the list that's been returned and returned that selected user agent and then we have just this check to see if the fake user agents is either active or inactive and then once all that's set we can actually put that into practice with our process requests which is one of the scrapey functions that's it's one of the functions that scrapey will look for when you're using middlewares and it then sees that we've specified something to happen when it goes to processor request and then it executes the following so when it sees process request it goes in here it gets a random user agent sticks it in here and then it sets the header user agent to be the random user agent so I hope that makes sense I think it's fairly self-explanatory we're just getting a list of user agents and then with process request we are getting a random user agent and we are assigning that to our header request header so when we go off and ask our books to scrape site for the book or the list of books it sticks in dash random user agent into the user agent of the request the only thing to do then is to stick our middleware this is a step you mustn't forget it's very easy to forget it I've forgotten it loads of times and then sometimes you can be trying to debug things after so you want to go to your downloader middlewares open that and add in your new middleware so again as we did in the last one the lower number has higher priority so we don't need to have the other downloader middleware just want our a middleware we just created our scrape Ops fake user agent middleware because this middleware up here is just the generated one that we don't need at the moment okay we'll go ahead run our spider now check that it's working should be working fine and then we will look at just doing the same thing but instead of having a user agent come back from scrape-ups We'll be asking for a list of fake headers so we'll create a second middleware and we'll just have the whole fake header as opposed to just the subsection user agent part so just to make sure that the headers are being attached correctly because you might say oh I know for sure that it's working we can just add in a very simple print to our process request so I'm just going to stick this in here two print statements one sync new header attached and the other one saying the new user agent so it'll print out the user agent that we've just attached the random user agent so that should print that out then to our console and we should be able to see that a new user agent is being attached every time that request is being processed you out the only other thing we have to do is go back and remove what we added in earlier which was this header's part here because it's being done in middleware so we don't have to specify it for every request here so I'm just going to remove that and I'm going to remove it here as well so as you can see there's always multiple ways of doing it either kind of manually adding things into the spider setting them in the settings if you just want it once off in the settings if you want a simple case The Spider and if you want something more complex in the middleware so you've kind of got your three different ways of adding your user agents are headers in so now that we've removed that we don't need this user agents list up here we can remove that let's save it and we should be able to just run our spider again so Scrappy crawl and book Spider and if everything's gone to plan we should see I'll just close the spider straight away and scroll up so we can have a look and see to everything so it looks like all the data is coming back as it was before we don't expect anything to change there because we we're not going to be getting blocked by the books to scrape.com site anyway but if we scroll up I think we should see here you go so here is where we can see the all the new headers that are coming in so you can see the Chrome version there is different sometimes it looks like it's using Edge as well so you can see there's multiple different headers coming back and then they are being attached in with the process request so that all seems to be working fine just as we wanted to and we made sure that we had it enabled in the settings as well in the downloader middlewares so we have an enabled there as well as enabled up here and the number of user agents coming back from the API set here so everything is as it should be one other thing to note is that this robot's txt underscore a bay is set to true if we're starting to do more complex sites we would set this to false so every site or most sites have a robots.txt file which is usually one of the first things that a spider will look for so scrapey does this automatically every time it looks at a site it first goes off and checks does the site have a robot dot txt file and in that robots.txt file is usually specified things like the pages on the site is this site open to being scraped and if it's open to being scraped watch pages are not allowed to be scraped are what pages are allowed to be scraped now obviously any crawlers that go out crawl different websites don't have to obey this robots.txt file you know it's it's a piece of code it's going to go off and do what you tell us this is up to you to decide you know do you want your spider to obey it or not a lot of big sites will have like okay if you're a Google spider you're allowed crawl and scrape our data if you're not a Google spider don't scrape our data so if they have that in their robots.txt and you have it set to obey is equal to True your spider will go see it's not supposed to scrape the site and it'll shut down so if you're having issues and you have this set to True try setting it to false okay so now that we've gone through what robots.txt entails let's go and create our next middleware so this time instead of having middleware that just replaces the user agent every time we're going to create a new middleware so go to middlewares.py and this new middleware is going to create a new header every time using the data that it gets back from the fake browser header endpoint so I'm just going to paste in the code here and then talk through it for you guys so if we go up to the top we start off again with just a simple class name scrape UPS fake browser header agent middleware that can be whatever you want we pull in the settings that we need this will get everything from the settings file and then it kicks off get headers list which calls out to the API endpoint it does a get request here to the scrape UPS endpoint Returns the response converts it to Json and then we've got a list of headers and then in the process request which gets processed with every request we have this get random browser header function which will get the random browser header from the list that we just asked for from scrape UPS so then we have this random browser header and we can assign all the other headers not just user agent like we did in the middleware above but we can also modify all these other parameters in the header so you don't have to modify all of them you can modify certain ones again this is up to you to play around with and decide which ones you need which ones you don't need with scraping everything is really a case-by-case basis because every website is different but we're giving you here everything you need to play around with so you might find certain headers need to be modified more than others we have our request.headers being updated and then that is just like the one above everything we need let's go ahead now and add the settings that we need so we have some of the settings already set here so we have the browser endpoint there's a default set there we can set our fake browser header enabled to True here to make sure it runs and then the num requests is going to be the same as here it's already set and the API key is already set as well so again you get your own API key for that and I think we should be able to just go to settings now and make sure we have the middleware enabled so I'm copying the class name going to settings going down to downloader middlewares and where we had the fake user agent middleware I'm just going to overwrite that so we have our fake browser header agent middleware so I'm going to save that and then just go and run the spider again and it should work correctly so yeah there seems to be lots of books getting scraped so it seems to be working correctly I can stop it and just as we did before to double check that the headers are being set correctly we can just stick underneath a simple print statement that shows the headers are set to what we wanted them to be set to so if we run our spider it should show that there's multiple different headers so I'm stopping it again and looking at the output once we get past the book data okay so here there's some headers so new header attached which is what we have here and then we can see we have things like let's see accept the user agent so okay so we have the user agent Mozilla 5 do we have accept so we have accept the text HTML everything we wanted there I'm just trying to see can we see in here that both of them are different to each other yeah so here for example you can see this user agent is using Chrome 103.0 50 60.134 and up here it's using a different version of Chrome so it's using one zero three point zero point fifty sixty one one four so you can see it is changing in each request so that's exactly what we want to show you how to do so I think that concludes part eight in part nine we'll be going into how to use proxies to bypass the anti-bot blockers that websites have as well so instead of kind of handling everything ourselves and trying to bypass the antibots by updating our own headers we can see that there's commercial things out there for you to do that with and there's also things like proxy lists that are free to use as well and we look at how to integrate those into your scrapey project as well so see you in part nine guys so in part nine of the scrapey beginners course we're going to be looking at everything to do with proxies so we're going to be going through what are proxies and why do we need them and then we're going to be looking at the three most popular ways to integrate proxies into your projects so let's get started so in part 8 we're looking at user agents and requests and the headers we pass in when we're making the requests to the website you're looking to scrape we discussed and looked at how if you change your headers and change your user agents you can basically make it look as if you were multiple people accessing the website you're trying to scrape now there's one thing we also mentioned in part eight which is that the data that also is being transmitted with your request is usually your IP address so this IP address is it's your unique identifier and that's what is used to make sure the data comes back to your machine so every machine will have an IP address and that's how the requests get to and from your machine think of it as like your house as an address your computer must also need an address and this is your IP address so if we change the user agents every time when we're sending the requests that's fine but if we changing the user agents every time but we still have the same IP address then the site that we're scraping is very likely to know that we are the same machine that is requesting their data every time so they're very likely to block us straight away so that's why changing our IP address as well as our user agent and headers is very important so just the user agent and headers might work if it's not very sophisticated type of website that you're trying to scrape but if you're going to anything that's complex at all you will need to rotate your IP address and that's where proxies come into play so let's first off look at the first method is that we're going to be looking at is using proxy lists so these are lists of IP addresses and ports that belong to multiple different servers all over the world so there's lots of these machines that are available to bypass our requests will go via that machine before it goes to the website we're trying to scrape and then it'll come all the way back via that machine as well now there's pros and cons to every one of the of the three integration methods we're going to look at so the pros of proxy lists like this are that obviously the proxies that list that you can get online are last number free so like this one here freeproxylist.net you can go there and you can select from a list of different countries select your protocols and you can check the uptime there's also another list which is very handy on juno.com forward slash free proxy list here you also have IP address Port country uptime response Etc so here there is 9 000 proxies online in 136 countries but the downside is of using these lists is that because they're free so many people are using them that they're very likely to either have very poor response times and take very long time to actually root your traffic through them or else they can be already blacklisted because think of it if someone has already used them to scrape millions of pages from maybe the same site that you're going to look to scrape data from then there's a very high likelihood that that website could already have discovered this IP address and blocked it so the pro is that it's free the cons are that it can take a long time and there's a very high likelihood that if it's free that it's already been used too much and it could be blocked so we're going to go ahead anyway and try with a few of these IP addresses and a few ports from these two sites and the way we want to integrate them into our project is we're going to use this GitHub project which integrates with scrapey and it's called scrippy rotating proxies so we'll have a link to this available but you can just do pip install scrapey Dash rotating Dash proxies and go to your terminal and paste that in and run it now I already have it installed so it's going to say requirement already satisfied for me but for you guys you should see it installed in there now we're continuing on part nine from part eight so if you're looking for the code for where we are starting at now we'll have that available in a GitHub repo which we link to so you can continue on from here with us so now that we have that installed we can go ahead and we can add our proxy list in so as you probably guess everything is going to go into our settings file as with everything else that's part of our project and these are just dummy domain IP addresses but this is the idea so you can have this many you could have a hundred different IP address and ports in here but we're just going to put three or four just for the purpose of showing you how it works so let's go back to our free proxy list and take two of these guys so we want the IP address in the port now obviously depending on your use case you might need a and you might need a proxy from a specific country or you might need a proxy with a very good uptime or response time so that's for you guys to search in the search boxes on the site here okay so I've got three of them there and the next thing I want to do is enable the middleware so this project that I've just installed this scrapey rotating proxies will have installed a middleware but to actually make sure that the middleware works we need to add it to our downloader middlewares and that's where we can make sure it's enabled so I've just done gone ahead and done that there I've added them in so and as you can see I've left the other two middlewares that we had from part 8 in here as well so obviously they don't have to be here I can also remove them but I might as well leave them in for now they're not going to do any harm they're just adding in a different request header so let's save that and then the other thing I wanted to quickly show you is that if you had all the proxies already in a file you could do something as simple as just saying the rotating proxy list path is equal to and then the path to your file obviously we don't have a file with a bunch of ips and ports but that's where you would put it if that's what you're wanting to do so let's just quickly remove that and now we can go ahead and we can run our spider and see the results so I'm just going to do scrippy crawl and the name of the spider I'm just going to make sure and in my project it's great big crawl the book Spider so it's going to go ahead and run you can see the header that was attached from part 8 where we were adding the new header and now this can take a good bit of time so as you can see here this rotating proxies Dot middlewares has printed out that it has zero good proxies zero dead proxies and three unchecked proxies so that means that it's going to go ahead and it's first going to try and see can it actually send any requests through the proxies that we've listed here so this can take a good bit of time depending on the quality of the proxies that we've got obviously I have no idea how good the ones in that free list are because they change every day every hour there's new ones added and there's ones removed and then as soon as they're added they're being used by hundreds if not thousands of other users so this is the good thing about this middleware is that it checks cannot actually use it and as you can see here it's just retrying our books to scrape URL with another proxy because one of them failed to work so it's just a process of waiting so it's moved one of the proxies into our dead pool and it's still got two that it wants to check so this is just a process of waiting and letting the middleware do its work so I'm gonna leave that run for a few minutes and come back to it and we'll see did it actually manage to use any of those free proxies that were on that free proxy list so I'll come back in one second and we'll see where we are okay so I've just come back a few minutes later and it still hasn't managed to get any of our three proxies in the list to work he's got two dead now one it's trying to reanimate it's not looking good so obviously the ones I've picked were probably already overly used already could be blocked by the site we're trying to scrape so here you can see obviously the major disadvantage of using free proxy lists online now there's lots of different places to get them so depending on your source of the proxies you may have much better luck at getting them to work but it's really a process of trial and error and while it's free it can be painful to actually get up and running consistently and correctly so we've had a look at how we can just plug in a bunch of different IP addresses and ports into our rotating proxy list and how we can use this middleware to use our proxies in our scrapey project but another way we can do this is using a proxy port so what we would be using is a service which is provided by a proxy provider and they would give us a proxy IP address and port and they would handle changing the IP address every time and we wouldn't have to worry about compiling a list of proxies ourselves so that's what we're going to look at next so we can still be looking after our own user agents and our own headers but the proxy provider would be dealing with everything to do with rotating a proxy list for us and making sure that the proxy list is of good quality and available all the time so we wouldn't have to worry about that now there's lots of them out there and we're just going to look at one of them now the one I've just going to show you now is called smart proxy you can check them out at smartproxy.com and as they say effortlessly scrape web data you need so they've got some great deals and offers and as you can see they you know do things like bypassing captures IP bands they've got millions of proxies from millions of locations and the plans they do entail residential and data center proxies so we haven't talked about that yet but residential proxies would be basically the data will be forwarded through residential IP addresses so these are IP addresses that are mainly used by people's homes so think of it if someone is watching Netflix and browsing Facebook and looking at Google search and then one or two of your requests are going via that IP address then the website you're trying to scrape let's say Amazon is going to say oh well I saw that IP address yesterday they just bought something from me so they're much more likely just to let that request go through without any issue so that would be what residential proxies are then data center proxies would be think of you know your traditional data centers with thousands of servers in a big room and your requests will be routed through a Data Center and through the IP address that is belonging to one of the machines in the data center so you have access to a lot more IP addresses in the residential side but then the data center side are much quicker and there tends to be not as many data limits and they tend to be a bit cheaper as well so that's the difference between residential and data center proxies so you can sign up with them most proxy providers also give you a week or two of a free trial or a certain amount of free credits that you can use to test out their service so if you guys go ahead you can click get started there sign up for an account and then once you're logged in if you go to the residential tab because we're going to be using residential proxies for our next part now so click the residential Tab and then you can either check out a pay as you go plan where you pay per gigabytes of data that's transferred or you can go into regular or Enterprise as well so I already have a plan set up with them so I'm going to go directly to the proxy setup next and this is where we will get our details which we'll then put into our spider first off we want to generate our username and password so we can put in any kind of combination of letters and numbers here a password and click create and it creates a username and password once you have your username and password you can grab these and put those into the username and password field here then our proxy location if it's important for your spider that you are scraping from a certain country for example if you're scraping an e-commerce site that will only show you specific products if you're living in a certain country then it is important to select the country here from this list so for us it doesn't matter so we can leave it at random then for the session type we want rotating because we don't want a fixed session every request can come from a different IP address and that doesn't matter for us right now and for the output format we're just going to pick http so that's going to then give us this string here which we can copy and use in our project so this is the endpoint where we're going to send our requests to Smart proxy is going to handle all the IP address rotation and all that stuff and it's going to then send us back the response from the website we are trying to scrape now that we have our endpoint from Smart proxy the next thing we want to do is go back to our project we want to disable the middleware we were using because that will no longer be needed because smart proxy is going to be looking after rotating our proxies and it's going to be looking after some band detection stuff as well so we can disable the two of them and the next thing we can do is go to our spider and we can go to where we have response.follow and in here we'll simply add in one more field which is going to be meta and then the proxy information so matter is equal to proxy and then our proxy details will go in here so I can go back grab my proxy endpoint and paste it in here so that looks correct and then I can also copy this and put it down where we also have response dot follow below so I'll add it in here too and that should be the two main places I needed for now the other example which I'll show you in a second is that we can create a custom middleware which would insert the endpoint and as well so we'll do that once we get this to run correctly we can now do Scrapy crawl book spider and it should work for us so Scrapy crawl book Spider and hopefully we have no issues as I can see there's some things coming through I can go to my books data.json and I can see that there is the data coming through so it looks like it's working correctly and it's all going via the smart proxy endpoint so I can close down my spider and the next thing we can do is we can create a custom middleware version so just adding it in to our meta value here and adding in our proxy endpoint is fine if you've got a small project but if you've got a larger project it probably makes more sense just to make a custom middleware for it so I'm going to show you how to do that next so we'll scroll down to the bottom because we have our other middle words in here already and we'll create a new middleware where we will be adding our endpoint details so we just make a new class called my proxy middleware it's going to again pull in our crawler settings and then it's going to get our proxy user proxy password proxy endpoint and proxy port from our settings so we need to go ahead and set those in our settings and then once it's got those it makes the user credentials it puts those credentials into a proxy authorization header for the request and then it has the URL which is made with the endpoint and port and that then goes into the request dot meta so let's go ahead now and in our settings fill out our proxy user password endpoint and port so let me just go here and add them in now so username password endpoint and Port so I just need to change the password obviously your username and password are going to be whatever you guys have made in your own dashboards with smart proxy I'm just copying my details from here and that looks fine so I should be able to save that and the next thing I need to do is to make sure my middleware is enabled so again going to my downloader middlewares and I'm going to add in my new middleware so I can add that in there and save it the next thing we want to do is try and run our spider again and see does it work but we will obviously remove what we did a second ago so that we can show it's all going via r new middleware so let's just remove that save it and then try run or spider again and it looks like the book details are coming through again so I'm just going to close my spider now we can have a look at smart proxy and see the traffic usage and as you can see weave requests coming through so there you go we've got the user and we've got our usage by gigabyte so it's working just as we wanted it to work our requests are going through the inspired proxy endpoint smart proxy is looking after the IP address rotation and it is sending us back the request then scrapey is able to take the information out of the HTML and we have the data that we need I think that's given you a very good overview of how we would use proxy port endpoints so there's just one last thing I wanted to show you guys which is proxy API endpoints so this is if you want to go just a step further and not have to deal with the browser headers are the user agents or any things like that and maybe you are scraping something which requires a headless browser to do JavaScript running for you we can get that by using a proxy API so again it's a service where there's an endpoint we're sending our request through that service and then that service is making sure that certain things are enabled to make sure that the request gets us the page data so what we're going to do is we are going to show you how to use that now and that is also going to be a paid service and you can sign up for that by going to scrapeups.io clicking get free account signing up for it you've got a thousand free credits there and if you then once you're logged in go to the proxy aggregator page go to the request Builder and you then have an API key which you can use and you've got the proxy endpoint which you can use as well in your spider so once you've got your API key we can move back to our book Spider file and we will start adding in a new function which will help us send the traffic first to our new proxy provider so this new function is going to be called get underscore proxy underscore URL and we're going to pass in a URL to that function and then we're going to have an API key as part of this payload object and we're obviously going to put in an API key where we have our own API key that we got from scrape UPS so I'm going to add mine in quickly now I'm just going to copy and paste that in and that then is going to slot in here this payload is going to get URL encoded so I need to import this URL encode from URL lib and then it's going to create this new proxy URL and then it's going to return that proxy URL so this function is going to get the URL of the site that we want to scrape and it's going to encode it along with our API key and it's going to send it to this API endpoint once we've got that function created the next thing we want to do is we want to add it in to where we use our current scrapey Dash request function so we will have URL is equal to and then get proxy URL and then the same down here we'll be doing get proxy URL with our next page URL to and the only other thing we need to add in now is a new function called start requests so we'll add this in under our custom settings and I'll explain now what this does so scrapey looks for this function when you start up your spider if you don't have it it doesn't need it to run it'll work off of your start URLs list here but if you do have it it will go in and work off of what you have in here so what I've asked it to do is I'm saying okay when the spider starts up run this function and inside this function run our guess proxy URL the same as we do down here because we want the very first URL to also go to our proxy so if we didn't have this function in here what would happen is that the very first URL would actually not be sent to our proxy provider endpoint URL so that would mean that there's a chance that the first very first request would get blocked so that's why we have this function so that the very first URL is properly encoded and sent off using this get proxy URL and once so that's where we have start URLs and we're taking the first string inside of our start urls and then we're doing the Callback is going to be our parse function and then it's going to go on and it's going to work perfectly because it'll be going through get proxy URL for the rest of the requests as well so that takes care of the very first call and this get proxy URL function get takes care of making sure that all the requests are going to go via this proxy API endpoint so then the request will come back with the response and the response will be able to be parsed in our parse book page like it was before so we should be able to go ahead and run that if we just do scrapey crawl book Spider again and run this oh there's one other thing so it just did one request and stopped straight away and that's because it's a very easy mistake to make the allowed domains does not contain our proxy dot scrapups.io so let's just add that in and if we rerun it it should there you go so we just close our spider and I'll show you that we have all the data is coming through so we've got our product type books the title the description it's all there so that's working perfectly it's going via our proxy API endpoint and the next I want to do is I want to show you guys how instead of integrating this directly into our spider we can use a proxy middleware that's been created especially by scrape UPS so we can just quickly pip install it and it makes things a bit easier if you are adding it to a project and you don't want to have to add this special get proxy URL function so what we would do in this case is we would just pip install and then our new python module scrape up stash scrapey Dash proxy SDK and install that and then the next thing we would want to do is we would go to our settings and like we always do add more settings so let's just go down here and add in the settings we want so it would be again our API key if the scriptop's proxy is enabled yes and we'd be adding this line to our downloader middleware so I already have downloader middlewares so I'm just going to add it here so if I save that and I add my API key from here into my settings okay so I just wanted to hear perfect and I should just remove the get proxy URL from the places that we've been using that function because we don't need it anymore and then it should run fine going through our middleware so let's try and run that one more time and see does it work going by the scrape UPS proxy middleware and it looks like you have lots of requests going through so if I just cancel that and we can check in our dashboard if we have so we have 123 requests so that looks like they all went to our books to scrape.com site so we have 123 requests made so it's working like it should so that makes it very easy if you want to get started with it all you need to do is do that pip install and scrape UPS Scrappy Dash proxy SDK add to the two lines in here and the one line into your downloader middlewares and then it will just send all your URLs via the scrape UPS proxy endpoint now obviously you can make your own custom download or middleware as well and like we did for our smart proxy example that might be a bit more complex than needs be for for this because there's already the middleware that you can just pip install so we will leave the example with our downloader middleware in our article so if you want to check out a long version of how to implement the downloader middleware using the scrape UPS proxy API endpoint we'll have that in our article and you can copy and paste that into your code and play around with that so there's just one other thing if you wanted to add some more and functionality to the scrape-ups proxy endpoint you could add in for example the following so scrape Ops underscore proxy underscore settings is equal to Country us so this would send all the traffic via US IP addresses so if you're for example scraping an e-commerce website that needed to be only loaded via the US so we'd show us items only you would do something like this country us in your scriptops underscore proxy underscore settings they also have other functions such as you can pass in if you wanted to be the page to be JavaScript rendered and there's many other different parameters that you can pass in which will mean that certain things are switched on on your proxy provider site so that way instead of you having to do all this stuff on your side the proxy provider will take care of it as long as you pass in the correct parameters and each proxy provider will have their own page with all the different parameters that they allow you to pass in to them okay so that's everything I wanted to cover in part nine in part 10 11 and 12 we'll be looking at how you deploy so basically get your spiders to run on a on a server in the cloud so how you deploy the code to your server in the cloud and then how you can schedule and run your spiders to scrape at certain times of the day or the week so you can collect data on a periodic basis without having to have everything running off of your home so we're going to go through some different options there what's available we're going to look at some open sourced options free options and paid options and just have a look at the different uis and give you a bunch of different options and we'll go through the pros and cons of why you should pick one service over another service and we look at how complex and easy they are to use as well so that's what we'll be doing in the next three sections okay see you in the next part 10 News so in part 10 we're going to look at the different tools we can use to deploy and schedule our spiders online and the tools we can use to monitor how well our jobs are doing how much data is being scraped and if we're missing any Pages or items when we are actually running our spiders so you might be asking yourselves what is this deployment in scheduling so deployment is basically us putting the spider that we've just created onto a server that's always going to be online so that we don't have to have our own machine our own laptop or computer running 24 7 at home we can just put that onto a virtual machine somewhere on cloud and then we can actually schedule it to run at a certain point of time once a week once a day once every hour once every minute however often we want to run our spider to collect the data so that's the deployment is the act of getting the spider on the machine and the scheduling is scheduling to run at a certain time of day or time of week and then the monitoring is just seeing how well our scraping is doing either seeing have the jobs actually completed did the Spider Run correctly did the Spider Run for the correct amount of time that we thought it was going to run for did it get all the pages that we thought it should be getting obviously if you see Zero Pages script you know there's an issue so that's where the monitoring comes in and it's very important that we do have some monitoring setup because obviously if you don't your spider can be running every day and you could be missing huge amounts of data so that's deployment scheduling and Mantra that's the first part we're going to do in this part 10 is we're just going to look at the different tools available we look at free tools open source tools and paid tools so the first is square PD which is free and open sourced anyone can download it and contribute to it as well on GitHub so the pros of this are it's obviously free and open source there's plenty of third-party libraries for it as well there is optional uis from different providers and the downsides to it are things like you need your own server ideally because if you're running it on your own computer or laptop you would have to have your computer laptop online at all times if you want to have this for example running something every day at a set time so it also doesn't actually have a scheduler so some of the other tools we look at such as using scraps or scrapey Cloud there with those tools you can actually set a scheduled job to run at a specific time every day but with Scrapy D you'd have to use this Cron job to hit the API endpoint to get scrapey to run your job at a specific time so Scrapy D is good because it's free and open sourced but the downside is there's a bit more configuration it's a bit harder to install but we'll show you exactly how to install it if you want to install it so the second option just using scrape Ops to deploy a schedule and monitor your jobs the upsides of data is it's got a good a UI interface to use simple to use and understand it's got built-in monitoring for your jobs and spiders it's easy to schedule stuff but the downside would be you'd need your own server as well like with square PD and the third option with scrapey cloud is it's a paid service they have a freemium kind of version so you can just check it out if you want with that it's easy to set up you can just download their CLI tool use that to deploy your spider into their Scrappy cloud service and then once it's deployed there you can quickly and easily run it and you don't need to have your own server set up with another third-party provider so they're the three main options we're going to look at scrapeyd scrape ups and Scrappy cloud so let's first off have a look at scrapeyd and then we look at two different UI dashboards that we can install so we don't have to control everything using their API endpoints because well that can be useful for some people most people want to interact with their spiders and run them and schedule them using a nice front-end UI okay so Scrapy D is available to download as I said we need a third-party server set up first to install scrapeyd on so we're going to go ahead and create that with digitalocean now so digitalocean is a server provider which enables you to quickly set up virtual machines and then install everything you need on them so you can also use any other VM provider such as vulture for example these are another good provider and they have very cheap servers as well so if you go off and create your own account with vulture or digitalocean or AWS go off create your virtual machine I'm going to do that right now with digitalocean and if you want to use this you can just follow the steps that I'm using so you just log in go up to the create click droplets select the country or region you want to select it usually works best when you select a region that you're close to select Ubuntu for the operating system version 22.10 we can select the cheapest virtual machine they have available which if you just click basic and then go to regular for the SSD type and then they have a four dollar a month server there so once you've selected the server you want you can either add an SSH key or a password to log in that's not that important now because for this you can also log in Via their console which can be accessed via the browser so that's how we're going to do everything now we're just going to use the browser to log into the machine and install everything we need makes it very simple and easy to use so that's all we need to do once we get to the bottom we just click create droplet and that will go ahead and create the droplet for us the droplet is just their term for virtual machine I can see it's creating so I just give it a minute or two to finish creating and then we can access the console over here so as you can see the droplets being created now we can click the console button which will open up a new window for us where we can access the console and type in all the instructions to get everything installed correctly so as you can see we're logged in to the virtual machine and now we can start running our commands so first things first is we want to run sudo apt update that just updates all the packages on the machine to make sure everything we install will be the most up-to-date versions of things so we'll give that a second to run that's finished and the next thing we want is to install python pip so we can pip install all the packages we need for python so that's just sudo apt install Python 3 Dash pip this Command right here we'll have all these commands easily available for you to copy and paste from our article as well so you don't have to be pausing the video at every point in time Okay so ill office you ask do you want to install this x amount of space will be used we're just going to say yes and sometimes it'll ask as well to restart certain Services which we can also say yes to so while that installs I'm just going to show you the project that we're going to be using so the project is the part 6 code so obviously you may or may not have done that part if you haven't you can just git clone our project from here and just type git space clone space this URL so that this is the project that we're going to be using from part six of this course okay so let's see is wait it's just asking us to restart the services and now we'll go ahead and get clone our project so as I said it's just git clone and then the free code Camp Dash part-6 so that's installed we'll now just see the into our project and we will install a virtual environment using pip install virtualenf so that's after installing virtual EnV now we can actually create our own VN folder where all the python packages can be installed into so we'll do that with virtual ends vnf and as you can see this folder has been created so now we just need to activate it so we do that with Source vnf bin activate it's activated now and now we can install the project requirements so this requirements.txt file contains all a list of all the things we need to get this project running so we can just do pip install space Dash r space requirements.txt and it'll go ahead and install all the packages that are needed to run the project correctly so we just give it a minute or two to install everything and then we should be able to run our scrippy spider next thing we do we just CD into our book scraper and we'll see if we can run scrapey list so that ran correctly and now we can run scrapey crawl book Spider so if we run that should see Scrapy starting up and as you can see all our pages are being scraped so our pages are being scraped and the data is being extracted from the page just like we were doing in part six so that's perfect we don't have to wait for that to run and complete the next thing we're going to look at is how we can install scrape PD okay so we just pip install scrapeyd to install that and then the next thing we can do is just run scrapeyd so to do that it's just Scrapy D now I've added on a bit extra after it just Scrapy D because I want all the output that usually gets displayed to the screen to go into this scrapeyd logs.txt file so put all the logs into scripty logs.txt and run this command in the background so we can go ahead and do that and now let's check that it's actually up and running so to do that we will be using curl to Ping the Damon status dot Json endpoint that lets us know if Scrappy D is running correctly so when we run that command it says status is okay there is zero jobs pending zero jobs running zero jobs finished because obviously we've just ran Square PD we haven't run any spiders yet we have scripted setup we can hit the end point using curl the next thing we want to do is we want to package up our spider and then deploy it to scrapeyd because if we don't do that square PD will not have access to our project and will not be able to run the spider so to do that we can install scrapeyd client so again using pip we just do pip install Scrapy D client that will go off and install the scrapeyd client the next thing we need to do is we need to go into our scrapey.cfg file so that should be here this guy scrapey.cfg so we want to edit that we can use Vim or VI and we can all we need to do for this is just uncomment out this line so that it deploys it to Scrapy D which is running on localhost Port 6800 so it's handy it's already there all we have to do is go in comment that out so to save this we just I'll show you one second we just type in the double dots to get up so we can actually save it correctly and then WQ exclamation enter and that saves it so now that it's saved we can do a scrape Ed deploy and default is the name I've just picked for the project because scrapeyd works with the concept of projects so it needs a project name and then as you can see it's deployed okay and it now has one spider available so we can now go ahead and run our spider right now again using curl so curl it's going to hit this localhost 6800 port and the forward slash schedule.json endpoint we're adding the project name of default and the spider name of book Spider so because we've deployed it we should be able to run this and it comes back with a job ID if it has run correctly so it said status okay and we're giving you back a job ID to show that the job has been started now this doesn't mean that it finished running so if there is ever issues sometimes you need to do further investigation so we've gone through how to use Scrapy D and Scrapy D deploy to package up and deploy our spider to scrape Ed and then how we can use Curl to schedule our spider using the curl command followed by the schedule endpoint so obviously if you wanted to just run this yourself you could just set up a cron and using your crown you could say schedule this to run every day at whatever time you want and then you would just be running this command obviously we want to make this easier for people to use so now we're going to look at the two dashboards that we can install the first being Scrappy dweb and The Following being the scrape pups Scrapy D integration so for scrape edweb scripted web is also a third party open sourced application that you can install and we're going to go ahead now and install that so I'm just going to go back up to the top level that we were at and just pip install and then the Scrapy d-web so that's going to go ahead and install that now it may need a certain specific version of specific packages to be installed so I've gone ahead and found out that when I was making this video that four specific packages needs to be installed with a specific version number are the installation wouldn't go and work correctly on the version that of Ubuntu operating system that we're using right now so it's easy enough all we're going to do is specify for four different packages the version that we need so the first one is just flask SQL Alchemy so we're just going to pip install that specific version the next is we're going to do pip install SQL Alchemy then we're just going to install a specific version of flask and finally we're going to pip install a specific version of work zoic once those are all installed we can now check and see the scripted web run correctly so we need to do is type in scrippity web okay it's giving us an issue is it let's just try and rerun scripted web again yeah so this time it ran correctly so I think it just needed to create the settings file initially so you can you know it's running correctly when it stays up and you have the URLs where you can access it showing here so we can just go ahead and copy the URL that it has given us so this is the IP address of our server which you can also get in the digitalocean dashboard followed by 5000 which is the port that scrippy web is running on if we copy that go up paste that into a browser we should see the scrapery web dashboard showing up correctly so we can see the jobs that I've already run so we can see the job that was run earlier that we ran when we ran it manually via the command line we had the default project and the spider book Spider and the job ID that was returned to us so it doesn't have the pages and the items because it needs this log parser module to be installed so we're going to go ahead install the log parser so we can see the pages and items and more statistics and we're also going to add in a username and password just some basic authentication right now anyone can hit this endpoint and start running my jobs I don't want that to be happening and I'm pretty sure you guys don't either so you're paying for the server you don't want anyone to be able to come on and start messing around with your dashboard so we're going to quickly go and do that now so we'll start by just copying and pasting this pip install log parser command I'm just going to shut down scrape Beauty web for a second so run pip install log parser then once that's installed I want to edit my scriptd web settings so I can use VI again and I'm going to first enable the off so as you can see here it's currently set to false I'm going to set it to true and then I'm going to set a username of test and a password of one two three four five six seven eight obviously set better username and passwords than that please for your own projects and your own servers the next thing we need to do is come to our script e servers list and just comment out this line here because we don't have a server running on Port 6801 we just have our Square PD running on this Port 6800 on localhost so once that's done the next thing to do is to add in our logs directory and to enable the log parser here so let's just enable the loud parser by setting that to true true and for us the directory is going to be just root then the name of the project frequent Camp part 6 and the name of these spiders the folder containing the spider which is book scraper and then it's got a logs folder in there which is where the log parser is going to read the logs from so obviously if you've got a different project name and a different spider name you need to just make sure that that is correct but it will always have a folder in there with logs already so just find out where your logs folder is and paste in the directory here and then the last thing we need to do is just set our scripted server which is just the default again which you can see from above 127.0.0.1 is usually the default and it's running on Port 6800 so now that we have all that I'm just going to save the file and now that is saved we should be able to run scripted web again this time I'm going to do like I did with scrape Ed and get the logs to save into a separate log file so they're not coming up on the screen and I'm going to run it on the background so that should be running we can check if everything is running correctly using the following using the sudo SS command so we can see that we have scrape PD running on Port 6800 of localhost and we can see that Scrapy web is running on localhost Port 5000. so we can see both of them are running we can see the ports they're running on and we have the process ID which we can use to kill the process so if we wanted to stop one of them from running for example I might as well just stop Scrapy D from running and show you guys you can just type in Kill and then we will get the scrappy D process ID from here and we will just paste that in and press enter and then you guys can see that Scrapy D is no longer running so that is how you can kill them if they're running in the background as a process like that so I'll just start up Scrappy D again so if we check we should see Scrapy D running and scripted web running perfect okay so because we killed and restarted Scrapy D we need to just re-deploy our project again using the Scrapy D deploy because if we go back to our Scrapy dashboard we won't be able to see our spider and we won't be able to run our spider so we need to package up and redeploy the project again so we can just do that with square PD deploy so it's just Scrappy Dash deploy and we're just picking the project name of default again so that'll just package up our spider and add it again to scrippy D so that's been added again and we can go back to our endpoint now it asks us to sign in so if I try and sign in it says no you need to add in the username and password so I'll add in my username and password that I set in the config file and sign in if I go to run spider we can then see the default default latest version book Spider if you need to set any specific settings and Arguments for your project you can do it there if you want to set it to run at a specific day of the week hour our minute you can do it here so I'm not going to set it for a specific time in the future I want it to run right now the next thing I need to do is Click just check command that will paste in a default command which as you can see here is just going to do the curl to the endpoint that we did earlier so all scripted web is doing is running this command so everything is correctly set the project the version The Spider and we just want it to run switch click run spider and that's going to go off and kick off our s our job and it's going to start running so that should be running now and we should soon be able to see the statistics coming back for a number of pages and items so let's give that a minute or two to run so as you can see it has finished and it took 24 seconds to run and you can see some other stats and pieces like that now we're still missing the pages and items and this sign is still up here saying that we need to install loud parser so I think I might have actually put in the incorrect path to where the logs are stored so let's fix that and then we can show you how the pages and items show up so if we go back I've discovered that I need to change my path in the scripty web settings so I can open up my scripty web settings so here you go so I think we just need to fix this so as I said depending on your project you just need to find where the log folder is kept so I think that should do it it's just forward slash root forward slash Frankel cam part six and then locks so I'm just going to save that and I'm going to restart scrivity web unless it's after fixing itself but I think we need to restart it yep so just go back run the sudo ss-t-u-n-l-p command again and we will kill our scripty web using the kill command and we'll run scriptyweb again so yep so as you can see the log parser ran eight seconds ago and it was last updated at this time so if we re-run our spider we're going back to run spider and we'll put the default project the latest version book Spider put in the command run spider and it'll run again in the meantime it's gone through and it's parsed the last logs that we ran a couple of minutes ago and you can see the pages and items have actually been populated here so it scraped 1051 pages and there was a thousand items and we can see some more stats by clicking the stats button so you can see what else to have warnings and errors so there's one warning and you can see the latest item that was scraped and as we can see it was this URL this was the book title we can see the price tax and all the other stuff that we've already selected in part five and six so this is kind of the basics of scribdy web and how you would install the scribby web dashboard to work with scrapyt obviously you can also see the full logs and you know if you need to see exactly a specific error or you need to nail down further into the logs you've got the full logs available there as well and it's picked out the warnings and if there was any errors it would show them as well so it's very handy it's free it's open sourced you can install it yourself and as you can see there's a bit of configuring in the settings and there's a little bit more knowledge required into okay you need to deploy this project like this you need to run the spider like that there is some help pieces as well available for example if you forget how to deploy it they do have a help section here and you can follow the instructions as to oh how do I deploy my project so that scripty web and Scrappy D can use it again they've got the commands you need to run here and the steps you need to follow are you can put in your project directory and they can Auto package it up using scriptyweb as well so the next dashboard we look at is the scrape UPS integration with scrapeyd so there's two different dashboards for scrapeyd Scrapy web and the scraps dashboard so for that you need to go off and create a scrape UPS account so you can just go to scrapeups.io and if you've been using it for any of the other parts we've gone through already in our course you can use the existing API key you have if you're just joining us for this section you can sign up register for free and get your own API key so you will just need the API key and then you will just need to follow the monitoring steps so we'll just click monitoring it should be scrapey so I need to do pip install scrape up scrapey and we'll do pip install scrape up stash scrapey so we've installed the scrape off SDK we just need to add our API key to the scrapey project settings so I can copy this line go now into my folder and I want to edit my settings.py file and I'm going to just add the API key in here and check what else I need to install I also need to install the extension and the downloader middlewares so go down to my downloader middlewares and I'll add those in here so copy that paste that in and to the extensions I'll just put that under the existing extensions there so paste that in as well so we've got the download the Download Remember words the extension and the API key so I should be able to now save that and just check that there's nothing else to do there I think that's all correct so that's to install the monitoring so everything will show up in the dashboard but we now want to install the actual scheduling side of things so for that we will go to the servers and deployments section and we will add a new Scrapy d server so we just need the name of our server we just do test obviously you can name your server whatever you want we need the server IP address so we will go to our digitalocean dashboard and copy the IP address here so the ipv4 so copy that and just need to paste that in and save the details and then it's saying you should sh into your server we're already in on our console and it says run the command in your terminal so we will copy this command and go back to our console and I'll just go back up to the top and paste that in it's installing everything it needs it might need to restart some Services again we can say yes to that once it's finished we will say yes to that and okay so everything seems to be finished so we can now go back and check our servers list and we can see our server name is there and it's connected and we can check our server perfect so if we need to edit or delete the details there and it says your server is now set up correctly you can schedule your jobs on the scheduler page here I'll click that and click schedule job and I have my server name I've got my spider book Spider I will run the spider now and I can click if I wanted to run it every month every day all I can select you know a specific month or whatever we'll do that in a second for now I just want to run it now and I don't have any settings and arguments to add in so I can submit the job and the job is scheduled so in a few seconds that should show up in the jobs list in the meantime let's go and schedule a job to run let's say once a week so let's say every Monday at 7am and obviously that's in UTC time zone crons are usually run in UTC so you need to make sure that that corresponds to your own time zone correctly so we're saying every Monday at 7am please run the book spider spider on the test server and if we submit that it should then show up in our list so there you go at 7am only on Mondays it'll run that's very useful if you need to set up your spider to run every day every hour or whatever you can View and edit them here and you can enable them disable them so if we go back to our dashboard we don't have any details coming through yet there might have been an issue with running it as well but we'll explore that now so when there is an issue like this where we can't see any data coming into a dashboard the best thing to do is to try and just run the scrapey list are the square B crawl command manually from our server so I'm just going to go back to the console and go back into the project and from inside the free code Camp part 6 book scraper folder I'm going to run a scrapey list first and if there's an issue with scrapey or there is an issue with the settings it'll show up here so Scrappy list and we have an error escape the error and scrape is saying that there is an indentation error in the downloader middleware so that's probably what's causing the issue so we can just uh edit that so open up our settings again and check it's a download middlewares so I think it's just that there is a space here and in case it's also this guy I'll just remove the indentations there and do the same for the extensions and then hopefully there's no more issues so save that and try run scripture list again at this time it worked and book spider is returned perfect so if we go back to our scrapups dashboard and try and run that again so it's got the server and splatter we need selected and we can submit job so this time within hopefully a couple of seconds it's just go back to our servers page and go back to our jobs page we can see it's now running so it's after kicking off at 6 23. I'm going to give that a couple of seconds to run and then we should also have the stats available for this job so we can see on Monday one job run and it's in the middle of running now so we'll give it a couple of seconds to run and then we should see things like the runtime the pages missed Pages items and all the other stats coming through okay so on our jobs page now we can see that the status is changed to finished it took 25 seconds and we can see the number of pages items coverage and everything there so we can actually click into that now and we can see all the pages scrape there the runtime and everything in more detail so we also have things like number of items fields that were scraped so this pulls in stuff like the we can see if there was any stuff that was missed so things like the number of stars the price everything most things seem to be at 100 percent and obviously the description there there was one or two descriptions that were missed for some reason so this is can be useful to see which Fields were missed are which Fields came through correctly you can also see the amount of bandwidth that was used and we can see that there was one warning as well so obviously when you have multiple runs of the same spider then you can actually compare if you're running this daily you could compare the stats every day and then if on one day suddenly you see a major Divergence in the stats you can say okay there must be an issue with my spider I need to investigate and go in and investigate further so that's very useful we also have the status codes if there's 500s coming back are 404s the page isn't found so maybe they the links are broken so you can use these status codes as well to diagnose any other issues so that's how you would use the scrape UPS dashboards to integrate in with Scrapy D which would also be running on your server so you've two different dashboards that you can use with scrapeyd scriptyweb are the script UPS dashboard and integration if you guys have any other questions with that you can stick it in the comments as well so in the next section we'll be looking at how we can instead of using scrapeyd to degrade with scrape UPS we'll be looking at using the complete scrape UPS integration to integrate in directly with your scrapey project instead of using scrape PD as a kind of a middle layer which gets integrated with your project and then other things have to integrate in and hit the API endpoints this just goes directly and integrates with your server and your project so that's what we'll be looking at in part 11 and then Part 12 we'll be looking at using the scrapey cloud so that's it for now and see you in part 11. so in part 11 of the scrappy beginners course we're going to be looking at using scrape UPS to manage deploy and monitor our spiders so we'll jump straight into it first things first we're going to need to set up a virtual machine so you can do it with AWS if you already have an account with them you can do it with digitalocean most of these different companies have free credits so you can just sign up use their free credits try it out and then go from there so I've already got a digitalocean account and I'm going to be using that so you guys can follow along with that are you guys if you already have an AWS or Azure account you can follow on from the step where we log in to the actual virtual machine so I'm just going to quickly set up a server here I'm going to go on to the cheapest ones they have which are four dollars a month and I think that's all I need to select so I can just create droplets now that the droplets been created I have the dashboard available where it shows things like the IP address and a few different graphs I'm just going to open the console now so that's going to open up a new window and it's going to SSH onto a machine and then we can then run commands on our virtual machine so while that's getting set up if you guys haven't already set up a account with scrape UPS so you can get a free account there and we'll be using them now to integrate with our server so go ahead create an account I have one already set up once your account is set up go to servers and deployments on the site where it says add servers click add and then we're going to name our server free code camp and we're going to put in the IP address of the server so we get that from here copy their P address paste it in save the details and now it says to provision a server we need to run this script on our server so we copy the details there go in to our terminal our console for the server and paste in the script details so that's going to go off run the script and provision the server and we can see that it is running through installing the dependencies it needs installing a new user and the authorized keys and installing the required libraries so that's just going to run through those different steps and once it's complete we should be able to then go on and in Clone our spider onto our server using the dashboard so that can take an integer two we let it just run through the different steps so now our provisioning has completed and it's brought us into the server dashboard and here we have options like clone repository add spider delete the server edit the details schedule jobs and the SSH keys for the server so we're going to go ahead and go to the Clone Repository that's where it's going to be getting the details of the repository that we paste in and then it's going to clone the spider directly onto our server so we don't need to actually do it manually ourselves we can do it through this UI here so for that we're going to first go to the free code comp part 6 that we've been using in the part 10 video so we're going to use part six and the next thing you want to do is you want to Fork your own copy obviously if you have your own spider and you're using that that's fine if you're following along with me now the best thing to do is Fork your own copy so click the four button follow the steps that's just going to copy over the free code cam part 6 onto your own repo and then from there like I have here I've just done it myself I now have it under my own name and in here the next step will be to add the deploy key so this deploy key will enable us to do commands like git clone and pull the repo directly from our GitHub onto the machine so we just need to add that key we go to settings deploy keys add deploy key I'm going to call this free code cam VM and I'm going to paste in my deploy key so you get this deploy key in here you copy that you add it in here remove any spaces and we don't need to allow right access for the moment because we're just going to be pulling and we can add the key so once the key is added we can then go back to our main UI here the next thing we need to do is just go and grab our URL from the main page here grab our URL so this is my repository and this is my own copy of the free code cam part 6. the branch name is main so you can see it here it's Main so I need to make sure to put in the right Branch name and then the language is Python and the framework is scrapey so that's all correct so this is the install script that's going to run when we click on repo it's going to go in to our virtual machine it's going to then get clonus and then once it's get cloned the repo it's going to go into the repo install a python virtual environment activate the virtual environment and then install the modules that are in the requirements.txt so if you look in here it's going to install the different modules that are listed in here which is everything that the project needs to run so once it's done that it's just going to make sure that scrape is installed and while we're at it we're going to add the monitoring module first grade pops as well so let's just add in that here as well so install scrape up scrape B which installs the monitoring python module for us so once that's all in here we can click clone repo and it's going to go through the steps here so it's cloning the repo great and then it's going to run the install script which can take two or three minutes and then the next step is it's going to find the Scrapy spiders by running the scrapey list command so I'm just going to give it a minute or two and then we will hopefully see our repo in our table here and we'll see the spider our book spider in under the spiders table on the right so as you can see the install script ran correctly and it was able to find our spiders as well so you can see our spider automatically came in here and here is our cloned Repository so if you click in you can see that there is a deploy script here as well so if you need to deploy updates to your code you will update your own repository and then for the code to actually go onto the server you just need to go in here and click deploy and it will then pull the latest from your Repository so that's how you would update the code on your VM okay so we have a repository we have our spider so let's just go ahead and show you guys you can quickly run the Spider by just clicking the run now button it'll go in select the server the repository and the spider because you could have multiple spiders in your repository and we're just going to click submit job to run it straight away so the job is started and if you want to check the log straight away you just come here and click view logs so you can see it's just gonna head and it's running the spider correctly and you can see the title product type price everything is coming through so that's how simple it is to run the spider so the last step we want to do now is to activate the monitoring for our spider so instead of having to just look at a bunch of logs in a log file like that we can have everything displaying in the dashboard like we had in part 10. so to do that let's just open up the docs go to monitoring python scrapey scrapey SDK integration so we've already done this pip installed scrape up scrapey as part of the install script when we cloned our repository so we don't need to do that again but we do need to add in our API key and the extension bits here so it's telling us we need to add this to our settings.py file in our scrapey project so let's go ahead and do that now so I'm going to open up my Repository go to book scraper and go to the settings.py file and I can just edit it directly in here so I'll need to add the three different sections so first I'll add the API key and then I'll go and add the extension and the downloader middlewares so I'll just copy this line and go to GitHub paste this in here I need to get my API key which I can get from my settings here paste in my API key and then I want to go to my extensions and then the general middlewares extensions are currently comforted out so I'll just add it underneath and last of all downloader middleware's so obviously if you guys are using your own spider that has the downloader midwares uncommented out and you're just adding these two lines to your existing list but here because it's currently commented out I'm just pasting in the whole lot in together so we've got the extensions the downloader middlewares and the API key so we should be able to just commit the changes and now we can deploy the code via our dashboard so now that's completed we go back to our server go into our free code camp server go into our clone Repository and click deploy here so the latest has been deployed we can check the log as well to see did deployment work so we can see it updated the book scraper settings.py file one file changed with nine insertions so that's perfect great so now that that's in we should be able to run our book Spider again submit the job and if we check the logs again we can see it's kicked off and if we go to our jobs list we can see there's one running so this is the one we ran for part 10 and this is the one that is running now so once it's completed running in about 20 more seconds we should see the pages items coverage and everything else fill in as well and we can also see that in our dashboard we can see under Tuesday we have this job that is running now so we'll just quickly show you how you can also schedule it so in case you're just joining us for part 11 if you want to schedule a job to run on This Server recurring you just go and click recurring and then you can select we want okay every day in March we want every every time at midnight we want this job to run so we'll submit the job and then we can check in our scheduled jobs we have book Spider which will run as 12 every day only in March and there it is so then if you need to edit that or you want to just disable it you can go to the scheduler Tab and you have the ability just to disable it there or delete it cloness view the logs whatever you need to do so if we just go back to our dashboard we can now see that the job is completed the page is scraper there we can see the items are there and everything looks like it ran correctly so we can compare the the two days so yesterday this many pages were scraped are this many status codes came back today this many status codes came back so it is useful if you need to compare the same job that was run over multiple days you can quickly see okay if the runtime varied or if the number of page varied varies are the items are the coverage you can see that very quickly in one page one glance so that makes it very useful and if we need to drill down into the individual job we can just click in and we can delete the job data or do anything else we need to do here so that brings us to the end of this section any questions you guys might have let us know and I hope you have an idea now of how to quickly get set up with a virtual machine and hook it up to use scrape UPS so a reminder that everything that we've used with scrape UPS here is free to use so there's no limitations on the image of servers you can hook up are the amount of jobs you can run so that's the end of part 11 guys so in part 12 we're going to look at everything to do with Scrappy cloud so Scrappy Cloud was made by the developers of scrapey and it's a way which you can deploy and run and schedule your spiders on the cloud using scrapey Cloud the great thing is you don't need to have your own third-party server so you don't need to have a server with digitalocean or vulture or AWS you can just deploy it directly onto Scrappy cloud and just run it the only downside is that if you want to schedule your jobs it's paid so you can run your jobs and it's free but to schedule your jobs on Scrappy Cloud you have to sign up for a monthly subscription so we'll show you how everything works and then between scrapey Cloud scrape ups and Scrapy D you guys will have had a full overview of all the different ways you can deploy schedule and run your spiders on the cloud and you can decide then which kind of way you want to go do you want to go with a completely open source way with using just scrapeyd and scripted web or do you want to go with a free way using scrape UPS or do you want to go with a paid solution with scrapey cloud so you'll have the full up array of options covered by the time we finish part 12. okay so let's quickly look at scrapey Cloud so scribby cloud is obviously made by size the Creator is the scrappy scalable Cloud hosting for your scrapey spiders and that's pretty much what it is host and monitor your Scrappy spiders in the cloud as we said and it's very reliable easy to scale as they say on demand scaling they have lots of other Integrations as well so what we need to do is you need to go ahead and create an account with them once you have an account you can then go into the dashboard and access scrapey Cloud here on the side so I'm just going to start a new project and I'm going to call it just free code camp and click Start so then it's got the instructions here of what we need to do to install the command line tool so we can easily deploy our spider into the scrapey cloud so first things first we're going to go ahead and we're going to be using the part 6 code example again so the code that we used for part six of this course we're going to just get clone that so I've got an empty folder here open in vs code and I'm just going to Gig clone the free code Camp part 6 and then I'm going to quickly install a virtual environment obviously if you guys are on Windows or on Linux you guys need to follow the steps they'll be covered in part two of this course to make sure you're installing the correct virtual environment for your operating system and then once the virtual environment is set up we can activate it so just do Source bin activate I know it's activated and now we can go and follow the instructions here so I'm just going to copy and paste them in directly so install S hub and then to the s-hub login and then it's just a matter of putting in my API key okay so that's all installed put in s-hub login it says I'm already logged in I'll just log out to make sure that you guys can see the full process so a sub login put in my API key from here paste in my API key obviously you guys will be putting in your API key so please don't use mine and then we can just do s-hub deploy and it should deploy if I'm in my correct project so I'm inside my project and I think it wants there to be a scrapey.cfg file so you need to make sure you're in the correct folder and if it's correct it you can see the deploying to Scrappy Cloud project and then the project ID there so in the meantime you can see it's bills the project and then it uploads it to the site here so here you can see it's been deployed and it's successful so if it works correctly this should all be very similar to what you see so now that it's deployed into the cloud we can go to our jobs dashboard and we should be able to run our spider so we have our book Spider available now in the drop down and we can leave everything else the same you've got priorities there if you want to have certain jobs running ahead of others so we can just click run and that should kick off the book spider and it should start scraping our books to scrape site so you can see it's running away there and we should see that when it's completed it'll be in the completed table and it'll also have the items and the requests and the errors all available here as well so they'll start populating in a second so as you can see three requests the logs are there if we wanted to run every day once a week periodically whenever we want we just go to periodic jobs click add periodic job select the spider and then we can select okay we want to run every Monday at 6am or 7am and we can just select every day or every Monday 7 A.M and we can save that so as you can see here it says the periodic jobs below will not run because I have not created a paid subscription so you just need to click subscribe now and then sign up for the paid version and then this job will run every Monday at 7. but if you guys just want to try it out then you could just schedule the jobs normally by clicking run here manually yourselves so my spider is still running away we'll just leave that complete and then we'll have a look at the items and requests and some of the stats that are available so as you can see our job is completed with the Thousand items the requests so the amount of pages that were scraped and the logs so we can click into the requests and you can see T thousand requests if you want so you can see all the specific URLs that were scraped the statuses etc etc the items then are a bit more interesting obviously because that's the actual data that we scraped so we've got everything set up nicely there our prices taxes titles URLs description etc etc so you can check quickly if the information that was scraped correctly came through or not if you need to look at the logs or the stats that's also available there so as you can see it's very polished very nice simple to use and also works and auto scales when you need to scale things as well so I think that's the ins and outs of scrapey cloud I think now you should have a very good idea of the different options you have when it comes to deploying your spiders and then scheduling and running your spiders and seeing all the stats in various dashboards so just to go through the three options you have first option Scrapy D which can run just via an API endpoint and you can hit the API endpoint to schedule things and deploy things with Scrapy D you have two uis you can use with it the scrapeydweb are the scrape Ops dashboards which we've shown you how to install so that's the free open source part the second option was using scrape ups for everything and using the scrape UPS integration directly with a server such as digitalocean are AWS our vulture so you need to set up a VM quickly there first and then the final option was just using the scrapey cloud for everything and deploying it directly to Scrappy Cloud but it was paid if you want to do any sort of periodic jobs to schedule them to run daily or weekly or whatever so there are the three main options and I'll leave it up to you guys to decide what works best for you so that brings us to the end of Part 12 and in part 13 we'll just go through everything and do quick recap of the entire course so see you in part 13 guys so guys we've come to the end of our Scrappy beginners course this is the last part so we'll just do a quick wrap up and then I'll give you a small bit of an overview of some extra skills you might find useful if you want to continue on and get better at scraping using scraping so we've built an end-to-end scrapey project that scrapes all the books from books to scrape and then cleans the data and stores the extracted data in different file formats and different places such as a database we then looked at optimizing our headers and user agents so that websites would let us get through any antibiot software that they have on their site and we also looked at how we can use proxies in the different proxy provider options that are out there if we want to have something that's a little less Hands-On and then finally we looked at how you can deploy your spider to the cloud onto a server and then how you can schedule this to run periodically and then how you can view the results of your running spiders but obviously that isn't everything we've only gone through the kind of Basics there's still a lot more different edge cases that we haven't covered and the number one thing is probably scraping Dynamic websites so there's a lot of websites out there that are rendered in the browser so that means that they're using a front-end framework which will actually only display the page once all the data is received by the browser so in that case if you were to ask for a URL what you get back might not contain the data you're looking for because it hasn't had a chance to render inside in a browser so things I would recommend that you look at in in those cases would be looking at things like Scrappy Puppeteer or Scrappy selenium which then use scrapey with a headless browser integration to actually render the website in a headless browser so then that way using scrapey Puppeteer or scrapey selenium you can actually render the page and get the data you need the other option would be to find the API endpoint because most of these sites that are front and rendered have API endpoints and you can find the data there so I'll just give you one example of what it would look like to see the data coming back from an API endpoint so I would recommend if you guys want to work through a couple of different challenges site have put together this great site which we've been using for our books to scrape.com but they also have a bunch of different other examples using this quotes dot to scrape.com where you can for example have infinite scrolling so how would you get around a page where you have infinite scrolling like something like Instagram or Facebook and you can practice having a JavaScript rendered site so there's all these kind of different options which you can practice with and they have all these different pages available here for you to practice on so one example would be this one see we're scrolling down and if you go to your network tab you can see there is data being asked for every time we scroll down so more pages of data are being requested and they come back but instead of it being HTML it comes back as Json and then this Json data we've got the quotes which then the front end framework then goes off and populates the page with so this is an example of where you can actually directly query an API endpoint instead of actually scraping the HTML you can ask for the API endpoint to give you back the data directly so that even makes your life easier so this is one example of what I was talking about if you come into contact with a front-end rendered page that gets rendered in the browser so I really would recommend you guys checking out and working your way through these different challenges that are available on the twoscribe.com site now another very important thing is obviously getting through a login endpoint which we didn't do in this course but which is something which a lot of websites would have so that is something as well I would really recommend that you guys go off and explore how to do so the last major thing I think would be looking at how you can scrape at scale if you really want to imagine you're scraping millions of pages a day you know there's different ways which you can use things like scrapey redis to use redis to store all the URLs you want to scrape in one central place and then you could have multiple different servers pulling URLs off this queue and then the old URLs can be scraped by multiple different worker machines at the same time again this is all using scraping so that is something I would highly recommend you guys to look at as well if you're interested in scraping at scale now all this stuff we also have as articles and videos so if you guys want to check out some more videos and in-depth articles I would recommend you guys checking that out like we have scraping behind logins and we have separate articles on using Scrappy Puppeteer Scrapy selenium Etc and also using scrapey redis if you want to do a distributed worker architecture so all that is up there for you guys if you want to continue on your journey and learn more about scrapey and the kind of more challenging parts of using scrapey to scrape the web so I think that comes to the end of our course I'd like to thank you for following along and if you have any questions just reach out put a comment on the video and we'll do our best to get back to you thanks guys
Info
Channel: freeCodeCamp.org
Views: 113,994
Rating: undefined out of 5
Keywords:
Id: mBoX_JCKZTE
Channel Id: undefined
Length: 277min 9sec (16629 seconds)
Published: Thu Apr 27 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.