Scrapy: Downloading Files Using Scrapy [PART 1]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone so the aim of this section is to show you how you can download whatever kinds of files using square P and for this demo we'll be downloading mp3 files from this website ok so at the time record in this video square P provides to us to pre built in media pipelines one is called the files pipeline which I'm gonna explore in a moment and the other one is called the images pipeline which can be used to download images from various websites they both share the same idea so right now in this video we will explore the files pipeline and for the image pipeline I'm gonna give it to you as a challenge so bear with me alright now back to the website and first let's open Chrome developer tools so we can inspect the HTML markup so we have a table which contains the table body this is where all the contents live and inside it we have a list of table rules the first row represents the headers name last modified size and description the second one contains this horizontal line the third one which is a link to go back to the parent directory and lastly we have the fourth table rule which is the first song okay so if we expand it we can see that we have five table data one two three four five and within this second one we have a link to this song itself okay so the idea is to write an expert expression that we select this second table data note of all the tables that are below and including this fourth one so let's open the search box ctrl F there we go alright so let's start first by selecting the second table data from the fourth row and then I'm going to show you how to select all the remaining rows that are below it okay so double slash the are two square brackets for or you can use the position function like I showed you before and like this we selected the fourth row okay so to get the second table data from it we can do the following slash D D 2 square brackets do and that's it now to select all the tables that are below and including this one we use the following axis so just after the double slash we type following : : and that's it so this X path expression returned 15 note and like you see we only have 14 song this is happening because the last or which is the image is selected to we don't want that to happen so we need to figure out how we can exclude it so if we take a look at the last hole we can see that it contains an anchor node with an href attribute set to F dot jpg whereas the one before this one contains an anchor node with an HF attribute that ends with the dot mp3 extension so right after the second table data we can access the a node and inside the two square brackets we call the contains function the first argument is where to search in our case it's the href attribute and the second argument is the value which is mp3 C like this we do only get 14 nodes which is fantastic another way to do this is by excluding all the links that have the jpg extension so if we change the value to jpg and we call the not function just before contains we can I get 14 nodes so I'm gonna stick with this one all right now back to vs code I already created a project called demo and the line downloader and I have one spider called mp3 download so pause the video initiate a project and create a spider like this one okay welcome back so first as create an item class let's open items dot py and inside the demo download the item class let's create two fields the first one must be called file and the line URL equals to scrape the top field and the second one is called files like this now naming the two friends like this is compulsory okay it's a must because the files pipeline by default will later grab the URLs from this file underline URLs field download each one of them and then store them in the files field okay now back to our spider class and within the parse method let's iterate through all the links so for link in response dot XPath I'm gonna paste the XPath expression there we go now for each link we will get an anchor node right now the anchor node itself is a relative URL let me show you so if we take a look at this one we can see that the href attribute value contains a relative URL and of course it's relative to this one now scrapey can tell by itself with relative URLs we need to tell him explicitly to build a full absolute URL so to build an absolute URL we can copy this one right click on this anchor node edit as HTML and then we paste the website URL this is exactly what we need to test query to perform ok so inside the for loop let's create a variable called relative underline Newell and equals the room response dot XPath dot double our path sine H ref and then we call dot extract and the line first next is create another one called absolute your hand equals to response but URL join by the relative URL and like this we built an absolute URL using square P now the next step is to take this absolute URL and assign it to the file URL field so let's import the item loader class and the demo download our item class so from square P dot loader import item loader and from demo underlying downloader dot items import demo download the item now within the for loop let's instantiate the loader so load ERG equals to item loader item equals tool demo download your item and selector equals to link now just after the absolute you are a variable let's go on loader dot add underline volume the field name is file underline URLs and the value is absolute URL now just below it let's yield loader don't load and the line item now more importantly we need to activate the file pipeline so in the settings of py file let's uncomment the item pipelines dict and let's change the default one to scrapie dot pipeline's dot files dot files pipeline and we set the priority to one next I want to increase the download time out which is by default set to three minutes so download underlying time out equal to 1200 seconds which is 20 minutes you only need to increase it if you have an internet snail speed like mine now finally we need to set the path where to store the downloaded files so files and the line path equals to room and we can store it under the project root directory now let's save all the files click on the file menu and then save all all right now let's launch this bada sleepy groan downloader press Enter hmm it's not working let's see why so back to vs code for link in response to XPath loader so instead of using the link selector object we use the response so let's change response to link like this ctrl s to save the file now back to the command prompt let's relaunch the spider let's see if it work hmm it didn't work - so I'm not quite sure this time so let's go back to vs code everything seems to be correct here so let's go back to these settings that py file scrape a dot pipe lines dot files the files pipeline download time out fires ah we don't have to use path we have to use files and the line stored like this now control has to save the file let's go back to the command prompt I'm gonna clear everything and then I'm gonna relaunch the spider okay I'm gonna let it download two or three songs then I'm gonna stop it you all right everyone so once these by the finish downloading all the files go back to vs code open up the project Explorer you will see a new folder called food which contains all the mp3 files in my case I have only two now these mp3 files don't have the original file name right so what we're gonna do next is override the default files pipeline behavior to store the files by the original name
Info
Channel: Human Code
Views: 12,679
Rating: 4.9436622 out of 5
Keywords: Scrapy, Web Scraping, Python web scraping, Scrapy Files Pipeline, Scrapy storing files with original name, Media Pipelines Scrapy, Python, Scrapy Images Pipeline, web scraping course
Id: CpvkCzd2O6A
Channel Id: undefined
Length: 10min 6sec (606 seconds)
Published: Mon Dec 17 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.