Download All Images and other Data -Python Scrapy New Easier Method using Python Scrapy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so hello and welcome to another live stream so in this live stream we are going to talk about image downloads so image downloads is very interesting because there is already a middleware there is a pipeline which is built into scrappy so we can actually use that pipeline to download the images but the problem is that the file names which are created are not really useful so in today what we are going to see is how to download the images from a website how to rename them and we are going to rename using the product names so that's what we are going to do and we will also see how clean how to get a clean output right so these are the things that we are going to cover so let's get started so the site that i've chosen is this site so it's uh built on my shopify shopify basically and i don't think it has been used by someone you know so there are only 132 products so this will be a good site and there is one more reason why i chose this site actually so if we scroll down see what happens this site has infinite scroll right for pagination the site is using infinite scroll okay so but there is one interesting thing and probably it will apply to other sites as well it can apply to a lot of sites so let's start with writing the basic scrapper so scrapper scraper doesn't matter anyway so what we are going to do is we are going to visit this page so this is going to be our start url then we are going to collect the urls of these products so if i go to this product so here we will collect some details so maybe for learning purpose adjust this product name and the price that should be sufficient for today and then we will also collect this image url so this image url is required because we will be downloading the images right so now i'm going to show you some tricks about how we are going to do it hello hersh and hello gcu-1 all right so what we are going to do is we are going to start with the the analysis so i've just pressed f12 to open this developer tools and i've like zoomed in quite a lot so maybe yeah this should be good enough so what i want to show you is like whenever you are doing the analysis of the site and you see that this kind of dynamic site so i mean if you have seen my other videos and if you know how to monitor the network request so let me close this overview so now if i scroll down you can see that okay more and more products are loaded dynamically so we can examine this api but there is one more thing that we can try and that is disabling the javascript and see how the site looks so what i'm going to do is i'm going to press ctrl shift p p as in paris so we have this command palette which opens up and we can just start writing disable okay so let me just a second okay so ctrl shift p and i'm yeah so when i start writing disable the option disable javascript it shows up okay so let's now the javascript is disabled now to see the changes we'll have to reload this page so now the page that we see is when the javascript is disabled and this will always be the page that scrappy actually gets it right now if you see this page we can still see that there is listing but now this is the interesting part okay so a notice that the dynamic the infinite scroll pigeon nation is gone so see how this how useful this trick is if a site has no script tag if the site has a fallback mechanism where they can still render the content when there is no javascript this can be useful for us okay so what we are going to do is we are going to use this trick and we will have basically now we don't have to deal with dynamic content so we can just click on next button right so this next button again if we look at see this link button at the next button is simply anchor tag okay and this anchor tag how we can locate so inside anchor tag we have this svg so svg is just for showing that image and then we have a span so we can look for this next page right and then once we have this we can go one level up now when we have to go one level up this is where the css selectors are not going to be useful so if you have to traverse up in uh the html this a dom basically so if you are able to find a span but this the actual url of the next page is actually in the anchor tag so we have to go one level up and using css selector we cannot do that so we have to use xpath so this is where we see that xpath is more powerful but of course i recommend that you start with css selector because css selectors are easier all right so with this in mind let's look at any of the product page all right so this is the product page and we remember that we already have javascript disabled so now if you recall how it looked when the javascript was not disabled and there was this zoom button okay so let me show that to you so if i again enable javascript and reload this page so let this page reload okay i don't know why it is taking a lot of time to reload anyway so once it reloads we will see that this image is actually containing a zoom handle and finding that particular image is going to be little bit more challenging so right now there are no problems because we will be disabling the javascript so now we are ready to start with our command line so now we can use the terminal and create the scrappy project now if you want to usually whenever i show you scrappy projects scrappy spiders i show you standalone spider so we directly start with a gen spider command but now because we have to use item pipelines so because we have to use pipelines what we are going to do is we are going to start with a project okay so how do we create a project scrappy by the way if you can just write scrappy you will see all the available commands okay so this is the command that we usually use but today what we are going to do is we are going to use start project all right so let's run that scrappy start project okay and let's call this i already have download images so i'm just calling it idl images now one thing that you must remember is that you should not use image or images anywhere in your code whenever you are working with downloading images all right because image and images these are keywords which are reserved for internal use so do not use image or images in your class name variable names you know it will create unexpected problems and it will take a lot of time so now we have created a project so a project is nothing but so let me show you how a project looks like so yeah so this is how a project looks like right so there is a dl images inside that again there is dl images and then we have spiders which is blank there is only the initialism initializer module which is again blank there is items dot py so this is again it gives you an idea about how the structure should be there is middleware there are some default middleware structures so pipelines are there settings is there so some predefined settings and there are a lot of settings which are committed out so if you are not familiar with the the scrappy projects this will be a good idea you know just go and have a look at what kind of code is being written oh by the way uh i try to make these live sessions detailed i will be probably posting i know a concise a brief of this particular same concept as you know 5-10 minute video so let's make it interactive if you have questions just go ahead ask me all right so now we are inside the project directory so this is where we are going to write scrappy gen spider and now again don't use images right so what we are going to use is let's say products and the fourth parameter is supposed to be the start url i'm just putting an x okay and let's open that in vs code it's already open so now this products dot py file has been generated okay so at this moment right now we don't have to look into the details so i'm just going to hide it so yeah control b is the shortcut to hide it and ctrl shift e is the shortcut to bring this editor and the files okay so ctrl shift d will bring up debug control shift e will bring this editor which will show you all the files and control b is going to hide everything so these are some useful shortcut and as you move along probably you will remember them anyway so now what we have to do is we have to work with the site so let's come back here so i'm going to take the raw url i'm in the first url so copy this url okay okay probably let me move this terminal here okay so this is going to be the start url okay and this domain allow domain is not very useful so i'm just going to remove it allow domain is very important in case of crawl spiders but for regular spiders it is fine if you are running a controlled roop right so now we just have to quickly create a selector which will select all these eight products and so basically we just want all the anchor tags right so a is for anchor so yeah so this product card and this looks interesting because each so what i'm going to do is ctrl c okay and in fact we can yeah let's i'm just using ctrl c to take this particular product card ctrl f okay and what i'm going to do is i'm going to work with x part today entirely so let's use a double slash and then asterisk or star for matching all elements and remember that the there were some other classes applied as well to that particular div so we could have actually looked for that div also doesn't matter but we cannot directly say at the rate class equal to this this will not work because that div contain more than this class okay so what we are going to do we are going to make use of contains function again i did a live on not at the red div i did a live on x path if you have any doubts related to experts just go and watch that video okay so now we can see that this is giving us eight results okay so this is a good xpath now inside that xpath let's put slash a okay and this contains the links to the all products so now this is a good xpath selector so let's come here so inside the parse let's uh now what i was thinking is instead of writing the entire code maybe i can paste in the part which i believe you are already comfortable and focus only on the images part how does it sound let me know in the chat so probably you know what i can i've already written the code for this particular project so i can copy paste some of the code which is not related to the images itself all right so i'm going to copy this part so it will save us some time and it will be will be easier to focus only on the same only on the part which is actually useful so what i'm doing here is i'm using the same path that i just showed you okay and i've added at the rate href because we want to extract the value of href attribute okay and now we have two ways to either we can use response dot follow so if we are using response dot follow we don't have to actually convert those linked into absolute links right so i've just skipped that response dot url join part and i've created this self dot parse product okay i've not yet written the next page logic we will do it later so always remember that whenever you are working with any spider start with a smaller set okay so what we are going to do is only for the first eight pages we are going to first eight products we are going to write our code and towards the end when we know that everything is working fine then we will add the code for the pagination all right so what i'm going to do now is okay so this function actually i'll have to explain what i'm doing so this is the parse product function okay now let me show you what we are going to do here so whenever we want to export did you examine yeah of course let me show you how do we examine actually um a gcu one uh there are few pointers that i would like to provide you okay so always a look at robots.txt so i'm opening facebook.com okay so you will see something interesting here so any site slash robots.txt so you should always check whenever you want to you know if you want if you're scrapping and you want to make sure that you are not breaking any um terms and condition laid out by that site and you can see here in case of facebook that collection of data through automated means is prohibited unless you have written express permission so it's very clearly saying that you are not allowed to scrap facebook okay and for some other user agents remember that we pass user agent so we are actually supposed to pass actual user agents we are supposed to identify ourselves so for apple boot they have some for baidu for bing and all those they have certain rules okay so if i go to let's say ebay ebay also is yeah so again the use of robots is not allowed basically that is what it is saying okay now let's go and look at amazon now in amazon you will see that there is no such notice okay so this is kind of okay should be fine scrapping should be okay so this is like your first barrier when you should not scrap but there is one section okay i wanted to find that so for example here if you look at this disallow gp wishlist so you are not allowed to scrap wish list items for example okay so it's very clear so this is more uh this robots.txt is actually for automated scrappers crawlers so all these search engines will actually look at this so these particular essence these are specifically disallowed so this amazon is very you know this house spider whatever it is amazon has clearly you know put a disallow root disallow rule at the root level so this is one thing but you should also look at terms and conditions right so every site will have some terms and conditions so you should go and have a look that they are okay with scrapping or they are in fact it's usually works like that if as long as they are not explicitly explicitly denying it should be good to go and even after this if you want to go ahead and do scrapping then there are a lot of things actually you know that's why scrapping is usually considered a gray area area right so let's come back here so i actually lost track a little bit okay so now we are on this particular page right so from this particular page we have to get some information and return it right because typically you will be working with complex oh by the way uh if i go to settings dot py file which is generated here you will see that the robots obeyed robots.txt obey is set to true so by default and now this is applicable only whenever you are working with the scrappy projects so if you are working with scrappy projects the first request will always be sent to robots.txt and if there is something which is in deny you will not your parser will actually not go there okay so this is by default set to true if you want of course you can set it to false and you can see that you are identifying yourself as this particular bot okay so this is the default setting of course you can overwrite any setting now let's come back here and now you have to in the past product typically you will be returning some information so if we come to this site from this side typically you will be returning the product title the price there may be some other information here and there will be images and all those information you will be typically you know exporting in csv files or so you will be probably saving this in database so you will be doing something right so you will be doing something with that data but if you want to download the images then we have to get this url of this image and then we have to export it right so let me show it to you then it will become more clear so what we have to do is i'm just clicking on inspect or of course we can drag this and put it here so what we have to do see this app no script so inside no script we have this image so in fact i see one id here so whenever you see id you can use id so id is usually the safe way to create whether you're using css selector or anything or xpath so we can probably just copy xpath and let's paste it here and see yeah so it's a very simple xpath and we don't have to worry much except there is one thing notice that this is src is starting with double slash okay so double slash actually means that it will work on http and https but this is not a valid url scheme so this is where you will have to use response dot url join right so let me work it like that so image url is going to be response.xpath and the xpath is going to be this and slash and this image actually is contained inside this src right so at the rate src okay now notice something even though i am extracting only one image i am using get all now this is really really important it is important that you create it should be it must be a list of urls okay so it should be a list of urls even if you are scrapping only one url from a page it has to be a list so this is something very important so we can actually you know leave a url hint like that so this has to be unused variable okay fine but this has to be a list and this is very important okay so we might be needing for for example some other things so let me do one thing but remember that these this all the images or this one specifically we still need to create call response dot url join but typically if you are collecting more than one images from this page then you will have to use a for loop right so let's use this in fact and this is where i'm taking a shortcut here because i want to focus on important concepts right so i'm copy pasting some code here okay probably i can leave this out okay and let me walk you through what i'm doing here so this is the same x path which i just showed you okay and we are calling get all function so it is going to return uh mother the answer to that question is if a site is forcing you or stopping you from scrapping by using cloudflare then you should not be scrapping it i know it is not the answer that you were expecting but this is the truth if if a site is saying explicitly saying that please don't excret scrap then i should not so this is the real world answer and educational answer is still that cloud fair i last time i checked there was no way to break a cloud flare there was no way to bypass and usually personally i just avoid the sides of which oh by the way here is another tip for you if you are applying to freelancing jobs so if you're applying to freelancing jobs on upwork so i always ask for the site that needs to be scrapped and i do my analysis on that site before you know submitting my bid or even if i have submitted i mention it very clearly that this is a placeholder and it's a placeholder and i will confirm only after i have analyzed the site and if a site is the is already posted in the job posting then i'll do some analysis only after that i will apply for that job and if it is protected by cloudflare then i don't i just ignore that job or i do not apply okay or i do not accept so i am actually i work outside of work as well but if it is some kind of protection is there and then i i don't take up that job there is there are plenty of jobs available to you you know initially there is some kind of you know it takes little bit of time initially but anyway very soon you will realize that most of the work that you do with web scrapping is this web scrapping is just just one part of the project and there are a lot of other things where you have to connect the dots all right so um yeah so there are there are a lot of other things like recently i was talking to a client i'll tell you he's my hush i've mentioned probably in this other videos also and i'll mention it again i'll tell you how you get clients very simple you go and create a profile on a work and whenever you are creating the profile on a perk you put as much details you want okay so as you can rather so put as much details and don't just create a placeholder resume so in fact even one of my friend who has like 15 years of experience his profile was rejected by a work that you are not a good fit for our platform the reason he just you know created a dummy profile with not much detail so put as many details as you can as you can create a profile once it is approved start searching for web scripting okay search for web scripting look at the jobs that are being posted here on perk and try to find the jobs where not everyone will mention the website directly in the in the job posting itself so what you do is you try to find the jobs which mention the site take that side and try to solve it that is your practice okay so this is applicable for those who just want to get a lot of real life scenario okay even i do it sometimes see i i actually do it i just go to work search for web scrapping and find the jobs which actually contain the actual url to these sites and i try to solve them if i find it interesting and if it is paying well you know it's bypassing it's passing all my criterias then i will apply and whenever you are applying please apply a 100 personalized application don't try to you know um try to be the first one to apply instead what i do my most of the successful applications on upwork have been where i applied with samples so write a scrapper it doesn't take much time okay write a scrapper and scrapper scrapper a man okay so write a scrapper create run it at least on first page or two pages or whatever send you know creative excel sheet or spreadsheet and send that okay i've written the basic code this is a sample and then you make it very very personalized go and check the posting upwork actually has you know uh the upwork actually mentions the location of the person from which place is do some google even if you don't know the place okay just you know americans especially have with americans specifically weather is a good starting point so you know you can always open the conversation about whether that i noticed that you are in this city the weather is good or bad or whatever and then you talk about your job don't talk about yourself don't say that i have 15 years of experience or five years or five months don't say that say that okay so that's not i don't like for example even when i post on a work as client when i'm looking for people i don't give a damn about what kind of uh educational qualification or experience that person is having i want this solution to my problems so if you can talk only about the problem that the client has posted and if you have some idea about how you can solve it and if you are also providing a sample of the data so maybe it is just partial sample that even then you attach okay just try to find look at the reviews so typically reviews will upwork directly will not show the name of the client but uh when freelancers leave a review they will leave the name of the person many times so you know address by name so these are the small things so these will increase your chances for success i think it was enough so it was a big gyan as we call it as we indians and specifically hindi speaking indians we call it so the guruji has given a lot of gyan on upwork and i do not have any telegram group actually it's impossible for me to provide one-to-one sessions or one-to-one answers for now at least so i have a lot of things in pipeline so probably there will be something for my paid courses so it's not possible to you know reply to everyone but probably for paid courses i will have some place where um there can be a group or something anyway so let's come back to the problem in hand sorry i was just taking a time to read all the comments and making sure that i'm not missing anything all right so let's come back here so now what we are yielding here is a simple dictionary okay so this is one way of returning the data so we are returning the title and i've added added normalized space okay so that all the new line character tab characters extra spaces those are removed okay and then price also contains spaces so those things have removed so this is the x path for the price i'm leaving the i'm going okay so i've already posted this on github so don't worry about xpath and all those things what is important is this field is very important it has to be image urls it cannot be images urls you know you cannot put s here it will not work it will not work it has to be exactly image urls and it has to be a list so this is very important okay so this variable name this has to be exactly like this and this has to be a list of full urls they cannot be relative urls they have to be absolute urls and it has to be a list so this is the most important part and if you like to work with items so if i go to items.py file here so right now there is nothing but if you want what you can do is you can use this items file as well okay so let me show you if you want to use items file as well so let's i'm just trying to create yeah okay so what we can do here is so in case you are working with item files then you will have to create two fields image urls and images so if you're this is something and that you need to remember so if there is a situation where you are working with items then you will have to equals scrappy dot field in that case it's going to be two fields so these two fields have to be there these two fields have to be as it is no change you're not allowed to make any change okay rest of the things you can name them anything that you want it doesn't matter but these two have to be there as it is okay so this is our products py so this is done and you remember that we do not have a pagination in place so we are only scrapping as of now only one okay so let's see one thing so let's run this spider and see the output okay so scrappy list will give you a list of all the spiders that you have in the project and then if you're working with the project then you have to use crawl now you don't use run spiders so products okay okay typo scrappy so now there is of course a lot of output so let's do one thing let's set the log level to warnings so you can write warning or just want and let's send this to products dot csv okay so we will directly go and look at products.csv what we have there so ctrl shift e will open this panel let's look at this and control b will hide it so now we can see that we have title price image urls and link okay so this image urls is this huge url so again you will see that if there are more than one url they will be listed here but right now we only have one url so that's okay and this link i have kept it for debugging right so typically you will not be sending this now you need to make sure that your output contains this field your output must contain this image urls as it is so this is the first step but you will probably not want your image urls in this file unless you are working on something very specific dwd yes if you're ideally speaking if robots.txt is specifically denying that you should not scrap that okay then the site is specifically telling you that don't come and scrap this site it's very clear but still i mean if you talk about the legal issues then the legal issues will depend on your local country laws and it's it can be you know a little tricky but so basically you just need to know what you are doing so if your client is having if you know maybe robots.txt is blocking it but and that person who's having an account is specifically granted that permission so think of it as that way that maybe google or gmail for example let's say or in fact let me take even more practical example so let's say there is this website which is maintaining uh let's okay because i'm looking at a monitor so let me take this part let's see that um i have i'm a monitor distributor okay and i collect monitors from bank you from lg from samsung all these brands so there is this website which maintains my inventory okay so whenever i want to check the inventory what i have to do is now now think think about this from a client's perspective so let me let me show my face yeah okay so now um imma imagine this scenario and uh this scenario what i'm giving right now to you is a real world okay so i'm just changing few things here and there but i'm very practical i'm talking about so this client has is a supplier of monitor so every time he has to check for inventory he has to log into four different sites so he has to login to bank you site lg site samsung site and he has to check for the availability for the inventory for all these monitors and then he has to download all the files in his local machine and then then he has to create his own system of maintaining excel sheet or your or his online system or whatever so now technically speaking lg and all these sites which maintain the distributed information they will not allow for web scrapping okay you are legally not allowed to do web scrapping but this person is using his user id and password to log into the site okay so this person actually has rights to access this site whether he is using a python script or using a browser so now from the gray area you are actually coming to the white area where it is legally allowed so the site itself is not allowing but for that particular client it is okay to scrap that site okay so this kind of so these are like always gray area that's why web scrapping is always grey area and what i typically recommend is you start with web scrapping master it and then move on to other things where you are actually incorporating this data so whether you want to take the data science route you want to take the web development route or you want to take the desktop development route so you can take any of these routes okay so web scripting is always the first step and the easiest step and of course there are some jobs where web scraping is the only thing that you will be doing so okay it depends on the machine mother so if there are a lot of things which which do not allow use of a docker so using certain old machines using docker is a nightmare so but of course selenium will work fine in general if um and usually let's say there is a project which is using microservices that is already using docker so you are already using docker for a lot of things so using splash is not a problem so but if my machine is allowing me to use a splash and if there is only one page typically what happens is on the home page wherever you have together all the links when in the start url where you have to collect all the links only that page is dynamic and rest of the links are not dynamic so there is one page where you have to use uh you have to do the rendering then in that case splash is fine but again that will depend uh if you are using then if in case you are using scrappy um in case you are using proxies then you have to see what your proxy provider is you know it's what are the things that your proxy provider can support so again depends so there is there is no a definitive answer to anything you're welcome carlos and i hope that my videos are helpful for a lot of people and like the whole idea of web scripting is that i actually started making this video before this pandemic but i was not very regular so i was just releasing some of the videos and not uh doing it routinely but specifically this pandemic pandemic has taught us that there are shortage of jobs at the same time there are a lot of new jobs which have come up and a lot of people are coming into the field of coding who are not from the background of development development like me like i've been writing code since 95 i guess since 95 i've been writing code so i have a lot of experience in writing code i've been i've worked as software developer for many many many years but not everyone is that much fortunate so when i picked up python it was much easier for me the only thing was i was putting semicolons everywhere so i had to forget that so i have to but yeah picking up python was very easy for me i was not originally a python programmer but if someone is starting new i start with the default one okay so let me complete whatever saying and i'll come to you mother so so typically what i recommend is web scripting is something which is very easy for beginners so as a beginner it's it's very easy to start with okay so you can start with python very easily you can start with web scripting very easily as compared to many other technologies so if you want to go into web development or machine learning you'll have to learn a lot of link a lot of things but web scripting you don't have to learn that many things of course only with time and practice you will learn new things so i start with the default 16 concurrent request and depending on the website because like for example amazon has huge servers so they can handle quite a lot request but if there is already a small website then i will not increase my concurrent request because i will actually be bombarding that site and the only rule of web scrapping that you must follow is do not harm the website so take care of your website apart from your local machine capabilities as well okay so let's give it a give the discussions a break here otherwise we will not be able to complete this topic okay so probably um you know on gmail uh don't try to log into gmail that's not going to work out try to use the the imap interface so think out of the box don't scrap gmail use smtp or imap okay all right so now we have collected the image links in this img links now the next step is using the amg pipeline so of course nobody remembers right so you do a simple scrappy images pipelines okay and look at the documentation even i look at dogs okay some many times very commonly i look at dogs i don't have to look for dogs for everything but yeah a lot of things so i'm just assuming in so this is the standard one okay so this is by the way one more warning here this uh apart from scrappy you have to install a pillow as well so what scrappy does is whenever you're downloading images so let me install i think in this virtual environment i have not installed pip3 installed pillow so you will get some warnings related to pillow or p i l capital p i l so you will receive only warnings in the previous version it used to error out okay so it used to error oh but now it just gives you a warning and because of this long log you will probably just ignore it and we will wonder what why it is not working the fallen gaming welcome man so you need to install pillow as well so this is a word of warning okay now you need to put this images pipeline you can just simply copy and put it in your settings so by the way if you're using visual studio and if you have hidden either you can press ctrl shift e for explorer and you can come here and go to settings or the other thing that you can do is you can press ctrl p okay and then you write here settings and you will jump to this file so here you can just paste in this line so by the way the structure of item pipelines you will find it in the settings as well as commented line so here you can see right so here you can see that okay pipelines item pipelines you have to write like this so for images pipeline the readily available images pipeline is already there so this is one setting that you need and the second setting that you need which is also listed on the documentation where is the documentation is the file store okay not file store images store downloading the images and files is very very very similar but it's almost the same the only thing that you will do is instead of images pipeline you change it to files pipeline and ins there will be file store or images store so it works very similar all right so if you know how to download images you know how to download the files as well so if you want to download zip files or exe files or some other files you can use files pipeline so this is the directory if you do not give any path then of course it will be created this directory will always be created if it is not there all right so these are the two things it requires so now that's all that's all actually so if your um spider this is the spider if it is sending img image urls then that's all you need to do so let's run it okay so scrappy in fact we did it earlier right so let's run it it is going to now we don't need to send it to product so i'm just removing it okay warning disabled images pipeline it requires installing pillows so we did install pillow so probably what it is doing is i'm just opening a new tab and i'm uninstalling it from the system level from the user level rather pip uninstall pillow it is not installed okay control tab we did install pillow right uh desktop download images venv why do we have this error then list pillow 8.4 is already installed so i'm going to do something actually so what i'm going to do is come out because i don't want to spend much time on this error so i've just deactivated my virtual environment okay and i'm running the same command again what could be the error where is my scrappy okay let me quickly see let's create one more virtual environment here this is my shortcut by the way and you can create shortcuts like this so pip install in fact if i don't pip install just below i think it should be able to pick up scrappy install what i'm going to do is i'm going to install wheel as well so probably it will make i'm on m1 and on m1 mac anaconda is not stable so that's the reason i'm not using anaconda and this is going to take some time so let's try to run it directly from visual studio code so now how do we run from visual studio code so there are multiple ways to run it from visual studio code let me show you one way so you just press f5 okay or go to run and start debugging so it will show you which kind what kind of debug configuration you want so choose python module okay and enter your module name so that module that we want to run is products dot py okay so i want to create launch jot where is downloading omnishop omnishop is required only for so just create let's call this scrappy yeah so the uh control shift b so what i'm doing here is i'm running python modules crappy and i want to pass some arguments so the arguments that i want to pass is crawl okay this is the first argument and the second argument that i want to pass is the name of the spider so help me vs code help me come on yeah so file base name no extension so these are the two arguments that i want to pass so scrappy crawl and this okay so i've just created a launch launch dot json file okay so now i should be able to run okay start debugging let's see if it gets show some error crawl what's wrong because we are not inside the project cd dl images selector cd into that directory current working directory is workspace folder so actually workspace folder and then dl images okay so let's run this so if i'm going too fast just let me know so i just want to quickly see launch ah i have to run it from here so this file should be active okay so now this is running or did it ignore that so this is cd info info info i just want to see that that [Music] i don't see that pipeline activated so that pipeline should have been activated but okay this dir is has been created actually so it was just some environment issue and if i bring it here come on because this is full screen so let me exit full screen so these are the images which have been downloaded right so these are all the images which have been downloaded but what is the problem the problem is file name so if you look at the file name so this is actually created using the url hash so let me show you how to understand the logic how it is how the file names are created and how we can actually make it turn it into something useful okay so i'm going to delete this dir okay delete it dir and its content so deleted everything so ctrl b to hide everything command b actually so my keyboard is actually a total mess right now because i work with windows and i don't know if it is going to be visible here but is it flipped i think it is flipped so what i've done is i've swapped a controlled and command so yes it will download only no it depends harsh depends see whatever you send from here so this image urls here you are sending image links as list so it's a list so whatever you write here it depends what you write here how you are extracting the images so if you have more than one images here it it will download all those images and that is the main reason why this kind of structure is actually created so let me show you this the image pipeline so what i'm going to do here is i'm going to copy this part okay so let me show you how to personalize and you'll understand this more harsh so probably it should be already clear because uh this image links that we are sending from here it's a list so if you have written your logic in such a way that yes image urls is a must and if you are because here you can see that i am directly yielding a dictionary and if you are working with items.py file and if you are if you are using items if you're yielding items then you have to add one more field which is images okay and if this is confusing anytime this is confusing i will recommend that you go through the documentation so don't waste your time on stake or stack overflow or anything like that just come here this page has all the information that you need and in fact how to send it to amazon s3 if you're working on this kind of environment how you can send it to amazon s3 for that also i've already created one video and all the instructions are also here okay interesting idea dwd so you can create a flowchart yeah this will be good good to have so every site is unique definitely every site is unique and uh you just basically you have to focus on learning right so what are the things that you are learning so uh as you move along with the learning process the more you learn the more sites you will be able to handle okay so let's look at the code so what i'm doing here is you can follow the same steps okay so this we got from the documentation copy this go to the pipelines file so you can just press ctrl p and type in pipelines and there we have it so what i'm going to do is i'm going to delete everything okay i've just pasted it in here and what i want to do is i want to create image you know my custom pipeline which derives from images pipeline okay so from scrappy dot pipelines dot images i want to import it okay so i've imported this and what we can do now is can just hold command or control and click on it so now you have reached the source code okay and how the file path is created so what we need to do is scroll down and here there is this file path so now this file path is this is the big change i can't i'm right now i'm on scrappy 2.5 at the last video and which i created at that time probably i was in 2.1 or something so now uh in the recent version what has happened is we also have item here so this is going to be very useful so what we are going to do is we are going to take this file path here okay and i'm on so let's come to file pipe pipelines file and we are going to write a new file uh pipeline so custom now we can call this anything right so we just have to make sure that it doesn't collide with anything and we also have to make sure that it derives from images pipeline that's all and we will paste in this method here so right now if we run this code yeah one more thing that we have to do is yeah let me format this correctly yeah yeah so hash lib also we will have to import if you want to use this code but not actually going to use this line so not importing anything so see what it is doing request dot url so whatever is the um whatever is the name uh url of that particular image it is converting to bytes and effectively what it is doing is it is create using the hash lib library to create hex digest and if you don't understand anything from this what just the only thing that you want to uh you should be understanding is each url is what we what it is actually doing is it is creating sha one hash of this request dot url and sha1 hash is guaranteed to be different even if there is one letter or once even you know the place of letter or like if you take a url that url is going to be unique right in this url u stands for unique it actually stands from uniform but still it u stands for unique right we can practically speak it is unique so a url is unique so this sh a1 is guaranteed to be unique okay so that's why it uses this sha1 hash to create this image file name so what we are going to do is we are going to delete that part okay and what we are going to do is notice here that we have access to request so if there is something that was passed or sent in the request we can access that here if there is something in the response we can access that and most importantly this is item so this was the new addition and it makes things much more easier and even so i what will item contain if you are whatever you are yielding from this let me close this this so whatever you are yielding from this is item so if you want to access for example this title and this is what we are going to do you can write simply let me close this come back here so if we say item okay and if we say what does now we need to make sure that a node i post so let's take this title okay so there we have it so now we have the exact product title okay so this product title is what we can use to create the file path file name so this is going to be something very useful but there is one thing that you should be remembering you should be keeping in mind is that this product title can contain some characters which are not valid file characters right so we want this to be converted into valid file characters and for that what we can do is like we can create a regular expression depends on what you like you can create a regular expression and keep only the alphabets and numbers and underscores or right or in fact what i like to do what not just it's python it's open source world right so let's simply use so where is my browser so there is this very useful library called python sluggify and in case you don't know what is a slug so slug is typically what you see in the urls so for example if i go to my site okay so i'm opening my site control enter remind me later so if i go to blog ah man i'm not very active so when i write a blog in wordpress i just type in this read csv excellent scrappy okay something like this and for using this it creates this url read csv dash excel so this part is called slug and this is the library which helps in creating this slug okay so you just do a pip installs python sluggify i'm not sure i did it already or not so let's do it again so yeah it was not installed so now we have this installed and if you look at so always make it a habit to look at the documentation right so there is this phrase rdfm i'm sure all of you know what rtfm knows it means you know so basically read the manual so the manual in this case is the official documentation so always go through official documentation so there are a lot of things that you can do with sluggify okay and why i like this slugify is last check-in was three months ago right so this is fairly maintained and the build is passing okay so this is on travis so if you are into um automated builds and all you know what is travis so build is passing so it's updated library so it's good to have so there is there are a lot of things that it can support but we don't really need all that so all we have to do is from sluggify import sluggify so let's let's just paste it here okay and then all we need to do is call this sluggify method okay and we can also sometimes what will happen is okay simply call this sluggify is there anything in the sluggify okay just we can just directly send this so this is the easiest method of ensuring that file names are valid and there is one more property here so it's not showing up directly here but you will see that max or length what was it yeah max length so it's in small letters so you can specify what should be the maximum length okay so next length we can set it to something like 200 or whatever in case the product name is very long we can still make sure that the file name is still valid okay so this is going to be my file name so right now we are working with a single single one right so right now we are working with a single image per path if you are working with multiple image then probably you need to append some random characters so that every image is handled with like that so in your what you will have to do in that case is your item is going to contain one single item even if you are returning here we are returning multiple items but here you will get only one okay so here you will get only one so you just have to handle one image at a time so in case you are returning more than one images per page from this path then you can attach some random characters or leave that hash and just take first few characters from that hash and this will make sure that all the file names are unique okay and remember that pillow library so that pillow library is a pillow library is used by scrappy to convert all the images even if it is in some other format it will be converted into this jpg okay so this is converted into full file name and jpg and notice that's full because actually scrappy allows for resizing of the images as well so that's why we have here at the bottom of the page we had the file path we also have a thumb path so this also you can actually you know um this one also you can override if you want so this one is you just have to change only one thing but anyway i'm not going going into that so you look at thumb path as well okay so right now i think everything is set this pipeline is ready to be used so what we can do is we can use this and and instead of item pipeline this one what we are going to do is we are going to use our custom pipeline and we have to use fully qualified name so this is something that you can take help from or just make sure that your file pipeline is here so pipeline is contained it's not inside spiders don't be confused so this pipeline will be inside pipelines inside spiders inside dl images right so you'll have to provide fully qualified name here so let's do that so let's write it a dl underscore images dot pipelines dot custom images pipeline custom images pipeline and the priority has to be one okay if you want to download images the priority needs to be one and look at the dir so if you want to leave ask some other questions just leave it in the comments and before i wind up i'll take all the questions so now i think it should be ready so where is my terminal let's see if this time we are able to run it there is one last thing that i want to show you before i run actually now the problem is that that if you want to output this to a csv file or database or whatever is the flow here then that image url will also be sent there okay so if you don't want that so what you can do is you can come to your settings file we are already in settings so in settings you have to add one settings and these settings that you have to add is going to be just one second that is called what is it called i just need to recall before i make any goof up so image sorry not image speed export so this i always get confused so this is feed or feeds that's what i get confused about so feed export fields yeah so make sure that you are writing it correctly if you're not sure then you'll have to of course google it okay and if you're googling make sure that you're the first resource that you're reading is the official documentation so what this will control is what is going to be ins included in your csv or json or if you have any custom storage like database so what is going to be there in the feed export so in the csv file that we want to create we just want to have title and price okay so that is what we are going to do so in a list we will just say make sure that your case and everything is matching here we just want title and price so even though we are yielding four items from here the feed which is created so we'll test it using csv so there we will have only these two items in this particular order so if you want price earlier if you want price first and then title so you write it in that order okay so we already have everything ready so let's just run it without debugging okay so a lot of output here so the only thing that matters is we need to check item scrap count and we have item script count as 8 okay let's look at the folder so this is the dir let's open in finder so this is the full and now you look at the file names i don't know if i can i don't know how to zoom this up um as list so probably you can still see that now the image file names they make sense now these are sensible right so now all the illegal characters have been stripped and we have all the images the only thing remaining is adding the pagination logic and pagination logic should be very simple right because we have the next button so we just look for next button if there is any doubt so far just let me know so what i'll do is i'll quickly add the pagination logic so let's go to the site so this is the site let's go to home not home catalog yeah so we'll just look for this button okay so this is my page two will go if it is found we will just go to that page so this part i'm just going to copy paste so i'm just taking a moment to find the file but i've returned this code and i'll just copy paste yeah so this pagination will have to take care in the pass method so the next link is going to be the span this i talked earlier in the video if you were already there in the video from the beginning you know that i told you that you can look for the span which contains the next page and then you go one level up so this going one level up cannot be done using css so this is where you must use xpath so this dot dot is just similar to going to cd dot dot so we're just using going to parent and just taking the href okay and then we are converting so we are using response.follow so that's why we do not have to convert it into absolute url otherwise what we could have done is we could have returned or yielded one scrappy dot request and response dot url the response dot url join and this next link okay and because here in this case uh we just want to call the pass method itself right so there is nothing else so parse is always the default so we have not provided callback here so we could have provided for readability but just to remind you that pass is the default method that's why i did not provide parse method so this concludes the today's session okay so i'll show you the link where i've left it i've already checked in the code so i'll yeah so the code is here so i'm just paste it in the github link i'll put it in the video description as well so this is where you can actually go and look at the complete code and so yeah i have been thinking about starting a group on probably facebook groups is the easiest one most of the people have facebook i personally don't use facebook that much but yeah i do have facebook so let's let's do that so in the next chat i will drop in a message with the group name so let's create a group let's help each other out and if there is another question uh probably it's a good time to end the session so uh what i want to do is i want to keep mondays for the sessions live sessions yesterday there was something at my personal end so i was not able to do live so today we have the live and i'm going to post a brief edited concise version of the same tutorial as a regular video discord yeah we can do discard also yeah um and that is actually a good idea i've been considered thinking about that also but maintaining maintaining maintenance of all these groups it becomes very quickly it becomes out of the hands so it's not possible for a single person to maintain all these things so either you have to put some resources we'll work out something so let's work out something all right guys so that's it for today and see you on next monday okay same time i'll see you on monday with the live stream and i'll try to post more content for you and if you have any other ideas for further content there was one interesting request that i got was about selenium plus crappy so probably i'll take up that bots not too much interested i'm not too keen into bots right now so i'll let's uh focus on scrappy i like i i like scrappy too much so let's focus on scrappy and probably yeah so keep those ideas coming in okay so keep those ideas coming in and that will be the end of the stream today and i'll see you guys next time
Info
Channel: codeRECODE with Upendra
Views: 500
Rating: 5 out of 5
Keywords: download images with scrapy, python web scraping tutorial, python scrapy tutorial, scrapy for beginners, Python Web Scraping, web scraping python, how to scrape data, scrapy javascript, python scraping, scrapy tutorial, screen scraping, data scraping, Python Scrapy, Scrapy Spider, CSS Selector, webscraping, scrapy, web scraping, web scraping with python, scrapy python, web scrapping, web scraping using python, scrapy download images tutorial
Id: 2BsvriLQuOs
Channel Id: undefined
Length: 81min 15sec (4875 seconds)
Published: Tue Aug 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.