Python Scrapy Tutorial - 14 - Pipelines in Web Scraping

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back you beautiful Python developers now before we go on to learn about storing the scrapped data inside our database we got to learn about pipelines I'm talking about this file on the left hand side pipelines start fire so if we discuss the flow of our scrap data it somewhat looks like this it first gets scraped by a spider then it is stored inside these temporary containers called as items and then you can store them inside a JSON file but what if you want to send this data to a database like SQLite or MySQL what we have to do is we have to add one more step to this flow after storing them inside item containers we are going to strain them to this pipeline where this process underscore item method is automatically called and the item variable will contain our scrap data now all of this code inside this pipeline stored yfl has been automatically generated for us by scrappy but he still need to activate this pipeline inside our setting start by file so let's open up our setting start by file and over here we are going to search for the word pipeline and let's close the search functionality and over here you can see that this word of item under scope pipelines is written so let's actually uncomment these three lines so you can just select all of these three lines and if you are using pycharm you can just press the control button and then the backslash button to uncomment multiple lines at the same time so now that the pipeline has been activated I just want to tell you one more thing about this number over here so the lower the number is the more priority a pipeline is given so for example let's say we have created multiple pipelines over here and then in settings stored by a file we have to add another pipeline because this code tutorial pipeline has been automatically created for us by using scrappy but if we add another pipeline over here we will have to add them inside item underscore pipelines and difficult depending upon the priority we have to give it a number so if this is a a priority that is that you want to execute this pipeline first we will give it a lower number than the other pipeline so this will just leave it at 300 after this let me just go through the flow of our spider and often pipeline so whenever we scrap data over here and in our case the title author and tag and we yield items every time we yield items it goes to this pipeline start by file now that we have activated the pipeline in a setting stored by file and it goes to this method of process underscore item and it contains the items that are sent from over here so this yield items at every for loop these items are sent to this pipeline stored by file and they're contained inside this item variable actually it's an item list so let's actually try this out and what we are going to do is we are just going to print the value out so I'm just going to write pipeline over here by mine and we sort of print out our title so I'm just gonna write item and inside it and this is going to write it title so this should print out the quote that we are scrapping so let's just go to our quotes underscore spider and open up our terminal go inside our project folder just normally how we activate our crawler and we can just right scrappy crawl and quotes and press ENTER and now the crawler has run properly so let's actually maximize the window and now if we go up over here that this is an error it says a type error must be a strain not list so we made a very basic error it's fine let's go back to our pipelines dot PI file and let me actually also add 0 over here and this will contain our strength hopefully I think this contains the string and if we don't write the 0 over here then we just return a kind of a dictionary so let's actually crawl it again and see if it works this time and hopefully it will otherwise I'll kill myself I won't I'm just kidding then here is let's scroll up now and it should work all right so it's working now and as you can see along with the quotes of author tag and I tell that we are in not using our this yield items over here we have also printed out this line that says pipeline and then it has the quote that we want to print out so guys this is pretty much it for this video in the next video we are going to be learning how to finally send this data from our pipelines dot Pi file to our database and most specifically we are going to be learning about SQLite database in the next video so I'll see you over there
Info
Channel: buildwithpython
Views: 33,803
Rating: 4.8709679 out of 5
Keywords: web scraping, scrapy, python, scraping, python web scraping tutorial, beginner, python projects, web scrapping, web crawler, tutorial
Id: VMVFB1VKpto
Channel Id: undefined
Length: 4min 39sec (279 seconds)
Published: Mon Jan 21 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.