Download Images using Scrapy and Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] so in today's video we will be talking about how we can download images using scrappy so this is our sample website for today so this is the Eiffel Tower page from Wikipedia which has lot of images so we'll start with using a scrappy project so this tutorial actually assumes that you know what is a scrappy project looks like in fact even if you don't know we can you can follow along so I'm going to create a project so I'm calling this project wiki I'm going to do exactly what it is suggesting you can start your project with CD wiki and then scrappy start not start Jen spider and I'm going to call it Paris and then X doesn't matter and I'm going to open this complete folder in Visual Studio code now that the spider is open here I'm going to remove a loud domain it reduces complexity and not needed and I have copy pasted this URL in the start URL the next thing we need is images so I'm not going to waste time on how I created the selector in this video this is going to be a little bit longer video so let's call it raw image URLs and I'm going to use a CSS selector dot image IMG so this is my selector and in case of images the URL the source of the image is actually inside the attribute SRC and then I'm calling get all and there is one pointer here that usually in fact most of the times the image URLs that you will get will be a relative URL so you will need to write a loop to convert them into proper URL so let's create a variable clean image URLs and this is going to be a list remember that we need all the URLs in a list even if it is just one URL so for now let's loop through all the images for image URL in raw image URLs and the function that I am going to call is a response dot URL join what this function will do is whatever URL you provide inside this as a parameter it will convert this into absolute URL with context to the current address so if the current address is already absolute URL it won't change anything so whatever is written by this I am going to add it to our clean you are a list so this is where we have all the URLs now so all the URLs as list so this is again very important even if you have one URL this should be a list now let's look at our structure and if you already have items defined then what you need to do is you need to add two variable images and image URLs there are two variables that you need to end but if you have not worked with these it's fine you can actually forget about it and directly yield dictionary so let me show you how you can heal that dictionary so very simple you need to return a dictionary where the key name is image URLs note that it has to be exactly like this it cannot be images URL it cannot be images it cannot be image URLs nothing the dictionary key has to be image URLs and this is where in the value you will pass this clean image URL I need to move it inside this parse method and there we go that's all we need to do in this spider now rest of the things we will be doing inside settings now in settings we need to add two settings number one the pipeline which is inbuilt which downloads the images and you need to provide the folder so number one you need to specify the pipeline and number two the folder of the images so wherever you want to download the folder of course this pipeline and all these things actually Yul's you'll forget unless you are practicing on a very routine basis so what I will recommend is you can just do a quick Google for scrappy image pipeline and go to the result and you have the page where it says download anything downloading and processing files and images the process of downloading files and images is very simple very similar so here you can see that image pipelines so this is what you can simply copy because this is inbuilt no change is required so just copy it here and then you need to provide a folder so how do we provide a folder the folder name has to be like this images store note that everywhere it is plural images store so let's copy this and paste it here it says red path to valid directory it just means that directly name should be valid it cannot be like this it doesn't mean that you have to create the directory or it has to be absolute path let me call it local folder and you will see that local folder is created here or wherever I am running this project and it will download all the files so that's all we need to do again I will do a quick review with you in the spider you need to yield a list which contains full complete URL and the key has to be image URLs and in the settings you need to provide these two settings that's all you need to do okay so we can see that full URLs were created there was some checksum and let's see what do we have let me open the files so here is Vicki and you can see that local folder so this is the name that we provided in the settings it is created and there is another folder which is created full by default and you have all the images here so you can see that all the images are downloaded but the file name is not correct now the file name is not correct because it is actually saving the file name as checksum and why it is doing because spiders you will be running for not one single page you will usually be running for a whole lot of pages usually and even on one page there can be two file names which are same but the path is different so what scrappy does by default is it takes the complete path creates the checksum and it uses checksum to create this file name now if you want to know how to change this drop a comment and I will create another video where I talk about how you can do that so if you want to take care of these strange pile names we need to stop using this inbuilt pipeline and create our own pipeline so where are all these pipelines written these are here in the pipelines taught via Wi-Fi so let's do something very simple let's take this path go to pipelines and we can actually delete this we don't need all that let's paste it here just to make things easier so done so what I'm going to do is from scrappie dot pipeline dot images import image pipelines so we have the pipeline and available now we are going to write a simple class so what do you want to call it custom pipeline let's call it custom image pipeline and the base class for this one should be image pipelines so it's very simple and there will be only one method so what we are going to do is we'll go to image pipelines so I'm just going to press ctrl and click image pipelines actually derives from files pipeline so I am going to look at the definition of files pipeline and just scroll down to the bottom and look at the last function file path so this is the function which actually creates the file path so I am going to copy it come to our pipeline and paste it here so this is the function that we are going to use and this is the function that we are going to actually change and get our filenames corrected so I am going to read everything by the way if you notice this is where we have full so that's why the full folder was created so I'm going to delete everything from this function and we have the URL in the request dot URLs we have the request available here so let's take request dot URL and very simple I am going to split it this scenario is actually very simple if you just split it using forward slash the last part and how do you get the last part from this list minus one so this is what is going to give us the file name which is contained in the URL of the image so let's return it very simple now we need to enable this pipeline so let's come to settings and instead of this readily available class we need to put a custom pipeline but it should be fully qualified name so it will be wiki dot type lines dot custom pipelines so let's open the terminal and run the crawler scrappy crawls Paris and let's see what happens so we can see that checksum is calculated and the status is downloaded so let's go and look at this folder okay so we have local folder and now I can see that there was no full folder there was no folder called full and the file names are whatever was there in the URL so that's all for this video see you in the next one [Music] [Music]
Info
Channel: codeRECODE with Upendra
Views: 5,284
Rating: 4.5135136 out of 5
Keywords: Python Scrapy, Python Web Scraping, Scrapy Spider, browser scraping, code/recode, data scraping, download images with scrapy, python scraping, python scrapy tutorial, python web scraping, python web scraping tutorial, scrape, scrape web pages, scraping, scrapy, scrapy for beginners, scrapy pipeline, scrapy python, scrapy tutorial, screen scraping, web crawler, web scraping, web scraping python, web scrapping, web spiders, webscraping, website scraping
Id: jMMErSuEJ2g
Channel Id: undefined
Length: 12min 57sec (777 seconds)
Published: Fri Jul 03 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.