Scrapy: Extending the Files Pipeline [PART 2]

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right so what we're gonna do now is let's scrapie save the downloaded mp3 files by their original name okay so first let's create a new field in the items that we want file called file underline name equals to scrape little field this one will hold the file name now if we take a look at the HTML markup we can see that the a note contains the fine name plus the extension so what we can do is grab the full file name and remove the extension from it because the files pipeline who's gonna be responsible for setting the extension and not the parse method okay so within the parse method let's go hold loader add underline XPath the field name is file and the line name and the XPath expression will be dot tab slash text like this next in the items that we buy file we can use input and output processors to remove the extension like we learned before okay so let's import the take first class and the map compose class sold from scrapey taught loader thought processors import take first comma map compose now inside the field class let's set input and the line processor equals to map compose and I'm gonna pass the function name that will be responsible for removing the extension so let's call it removed and the line extension and then let's set output and the line processor equals to loop take first okay now let's create the remove extension function so def remove and the line extension which takes a value as an argument there are we go so we know that the value will be like this fine name dot extension right now in python we can use a utility method called split text that can decouple the file name from the extension automatically so to do that let's import the OS module so import to us then let's return o s dot path dot split text and we pass the value as an argument all right now this will return a list that contains on the first index which is 0 the file name and on the second index which is 1 the extension we want only the file name so we set the index to 0 now more importantly we need to modify default behavior of the files pipeline so inside the pipeline's that be my file let's import the files pipeline so from scrapey dot pipelines thought files import files pipeline now I'm going to show you a trick so if you press ctrl and then click on files pipeline we can see its class implementation ok so I'm gonna bring this here I'm gonna close the file explorer there we go ok so the files pipeline has two methods that we need to override the first one is a method called get media requests that is responsible for sending requests to all the files URLs to download them and the second one is called file and the line path is responsible for saving the file so I'm gonna press ctrl F and then I'm going to type get underlined media and the line requests here it is so let's copy it and let's face it into our pipeline good now let me bring this here let's go and let's copy the five path method - let's place it here and then let's close the files pipeline okay now bear with me so if you notice it in the get media request method we have access to the item object right so what we can do is grab the current item file name and pass it to the file path method okay so in scraping when there is such a behavior where we need to pass a piece of information from a method to another one we use the request method okay all right so within the request class we can do the following Mita equals to two curly braces let's set the key to final for example and we want to get the value from item dot get the field which is file and the line name beautiful now in contrast in the file and the line path method we have access to the request object right which means from this a quest object we can catch the information that was sent by this request class but before this let's take a look on the return statement of the file and the line path method so we have returned full which is the directory name media grid which is the pig number and then we have media ext which is the file extension so let's delete all of this except the media extension and the return statement now what we need to do is to set URL equals to request dot new run because as you see we have the URL passed here as an argument to the split text method then we have to replace this medium by the fine name sent by the request meta so to do this we call request dot meta to square brackets and then the key name which is file name so what you remember from all of this is that in scrapie when we want to pass an information from a method to another we can do that by using the request method okay now before I forget let's import the request class so transcript e import request ok now one last thing we need to do is add our demo downloader pipeline to these settings of pi files so let's copy the class name and let's open the settings of UI file let's change the value to the moon underline downloader dot pipeline's dot and let's paste the class name now let's open the file menu and let's click on save all and before I forget let's delete the food folder I can see that we another mistake in the pipelines of py5 so let's open it it's an indentation problem let's fix it ctrl s to save the file there we go I think we didn't import the OS module so import as ctrl s to save the file grateful now back to the command prompt let's launch the spider scrapie crown downloader press Enter it didn't download the song so let's go back to the pipeline's dot buy file let's see hmm okay that's because we need to in hate from the files pipeline and not from the object okay so files pipeline let's take the process under line item I thought we don't really need it here so ctrl s to save the file now let's go back to the command prompt I'm gonna clear everything and then I'm gonna relaunch it spider okay I think it's working now so I'm gonna let it download two or three songs then I'm gonna stop it all right now back to vs code let's expand the food folder and there we go we now have all the files saved by the original name which is fantastic now for the image pipeline is the same as the files pipeline and I'm going to add an article just after this video explaining what is the difference between them and I'm going to add a challenge for you so make sure to do it see you there

Info

Channel: Human Code

Views: 6,362

Rating: 5 out of 5

Keywords: Scrapy, Web Scraping, Python web scraping, Scrapy Files Pipeline, Scrapy storing files with original filename, Media pipelines scrapy, Python, Scrapy Images Pipeline, web scraping course

Id: 4VvYOWiQch4

Channel Id: undefined

Length: 8min 19sec (499 seconds)

Published: Mon Dec 17 2018