Scrapy to Database - SQLite3 Pipeline with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is a scrapy project i did recently that saves the name sku and the price of some products from a website we want to store this in an sqlite database and to do that we need to use pipelines so i'm going to come to my project folder and i'm going to open up the pipelines and we can see that we have this single one here and we're actually going to use this one to save it to the database the first thing we want to do is import sqlite and now we want to change a few things within our class that is going to allow us to connect to the database create our table and execute things using the cursor so to do that we need to create an initialization method with def double underscore in itself within our class this is what's going to get executed when this class is called and it's where we can actually create the database connect to it and create our cursor so what i'm going to do is i'm going to do self.con is equal to sqlite3.connect and the name of our database so let's just call this one mtiles.db now if you haven't used sqlite before i have a video on that which you can check out but basically this line is going to create or and or connect to the database the next one we want to do is the cursor so i'm going to do self.ker is equal to self.con dot cursor so the cursor is what we can use to execute commands onto our database now we're using self because we are within this class now we want to be able to create a table so i'm going to define a new function i'm going to say create table within this we want to use the cursor to execute creating a table within our database if it doesn't already exist so let's do self dot ker because we want to use the cursor dot execute now we can now now we can add in some sql commands and the one i'm going to use is create table if not exists so what this is going to do is it's going to try to create this table if it doesn't already exist so if it does exist it will pass over it so i'm going to call it products and i'm going to open up my parentheses and i'm going to come on to a new line because this is where we're going to define the information for the table in our database the first thing that we have is we have the sku information so over in our actual items we have the sku name and the price so i'm going to say sku and then this is a number so i'm going to say real and i'm going to say that this is our primary key we're working on the assumption that no two two items will have the same sku so we can use this as our primary key which will mean we can import data into this and not duplicate it up the next one will be the name and this will be a text field and then the last one is price which is again a number so i'm going to call that real again so this is the table that we're going to create so i'm going to add this to the initialization so i'm going to say self dot create table so when this class is called we are going to create or connect to our database have open up and create our cursor and then create the table however we are only creating the table if it doesn't already exist the next thing that we want to do is process the item so which i have this function down here that i've kept now what we can do is this will take the actual item from the scraper that we have from the spider so you can see we have these here here by default so this is where we want to execute adding that into the database again we use the cursor so i'm going to do self dot cur dot execute and now we want to do our sql commands and i'm going to do insert or ignore now this is important because what we are saying here is that insert this information into the database or ignore it if the or if that primary key which we've decided was the sku already exists so this is going to stop us adding multiple instances of the same product to our database when we run this spider this is quite a cool way of of you could run this spider on a store multiple times and only add new products so we have three bits of data so i'm going to do the three question marks there to represent the data that we are inputting then we need a comma and now we need to tell it what bits of data we are putting in now because we are using an item within scrapey because i have this item here we can call this item back here so i'm going to say item and i'm going to reference the sku the name and the item price like this so these three bits of information are the ones that we are going to be putting into this database once that's done we do self.con dot commit because we want to commit those changes to the database and then we're returning the item back out so now we have our pipeline created which creates the database connects to it we have our execute to create our table if not exists and we have our execute command here which i've missed out into values there we go so we need to say products is the table and the values we want to put in otherwise we're not going to know what we're doing the final thing that we need to do is we need to come to our project config files our settings.py and in here if you looked in the pipelines file it says don't forget to add your pipeline so ours is called scrapy tiles pipeline if you had multiple ones you'd probably give it a better name but if we come down here we'll see after middlewares we have item pipelines so we're going to uncomment this and now we have our pipeline now when you have multiple pipelines the lower the number means which one's going to be run first you can use pipelines for all sorts of things like checking for multiple bits of scrape data but the only one that we're going to be using is the database pipeline so leaving it at 300 is absolutely fine so now if we were to come back and run the scraper we should be able to create the table and then scrape the data and add it to each table in the database so let's go up scraped count somewhere here 24 okay and we can see that we've got this all going on which is fine so now if we go to our project folder we have our database here and i find the easiest way to look into a sqlite database is just to use the db browser for sqlite i'm going to open up my database and we can see that we have the products table here with the information that we've said and if i go to browse data we have all the information here for those 24 products that we've scraped we've actually left the sku has a dot at the end so we would probably want to tidy that up but if i used to run this again we'll see that the insert or ignore working we're actually going to scrape all this data again but it's not going to insert it into the database because it already exists in there so there we go hit refresh still there so that's going to be the easiest way to save your scrapey information into a database using sqlite don't forget to create the table and don't forget con.commit otherwise your none of your data will be going into the database it takes a little bit of work to get your items right but make sure that you do and then the data will go nice and neatly if you need some more information on how to work with sqlite and python go ahead and check out this video here
Info
Channel: John Watson Rooney
Views: 1,688
Rating: 4.9629631 out of 5
Keywords: scrapy tutorial, scrapy to db, scrape to database, scrapy sqlite, scrapy database, web scraping with python, python scrapy tutorial
Id: cw5QtDxwTIQ
Channel Id: undefined
Length: 8min 2sec (482 seconds)
Published: Sun Sep 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.