Image crawler in python - web scraping

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there everyone hit a sheer back again with another video and in this video we are going to do some fun with Python web crawlers and images so we want to design a simple crawler that can simply go ahead onto the web page can crawl on to some images and can download those images on to any given folder in couple of my previous videos I have mentioned this that how we can crawl on to some web links and all of that so consider this video as an upgrade onto step next of what we can do with those links and how we can download some of the images now shortly we are going to discuss everything line by line everything is going to be follow able and you will be able to understand every single line of code so we need to understand some of the advanced concept of Python and then we are going to design this crawler and it's gonna be super easy to do it's it's a really easy process so in order to do so first and foremost we have to work on some of the images that we want to have so this is my link pixel start come slash at the rate has a Sheshadri my own page where I distribute a whole lot of images because I love to click photos and yes I got 22 million views so far on just one website but I have others as well so and this reminds me of the sponsor of this video proxy crawl proxy crawl is a website through which you can crawl and they also provide you API for crawling on the website they have an amazing client list including Shopify Nike Oracle Samsung and we are going to talk about them later on that why people actually need these kinds of services that's gonna come up later on and thanks to sponsoring for this video so that I am able to create such videos for my audience ok so go on to this website and specifically we're gonna talk on to this link only pixels dot-com slash at a Chaudhary and by the way if you want to use any of these photos they are totally free without any credits or anything just feel free to use any of them so this is the link which we are gonna crawl so one thing that we need is gonna be this links so make sure you keep an eye on that and apart from that you can see it just loads a whole bunch of photos up here and I would like to get what is the link of these photos so I'm gonna right click on this and click on the inspect and I can see up here and that whenever there is an image link up here it always starts with images dot pixels dot-com / photos / and then comes up the number and this is a pattern you are going to notice when you're going to click on every single photo you're going to notice this is a pattern which this website follows to load up all these images so on this entire webpage there are so many links and every link is gonna start with HTTP images dot pixels com / photos and then some other things so this is my expression which I need to worry about to get all these links so step number one the goal number one is to get all these links which are shown on the webpage and the step number two is to download all these images onto a certain folder now for downloading we have variety of options we have some third-party libraries as well but we can actually do that by simply file handling process as well so let's go ahead and have some fun with up here okay so I'm gonna just move this up here onto my other screen which we don't need right now and but one thing we definitely need to have is a little bit discussion up here so I'm gonna open up my terminal and I'm gonna just clean that up I was just practicing a few stuff before making this video so that I can teach you in a proper manner so one thing that we need to understand up here is in emulation so in emulation is something in Python which is less discussed in the usual course because this is more like an intermediate advanced topic is this is not much this is a way how you iterate over something on to this so I'm gonna just first and formost I'm gonna just quit this so that it cleans up all of my existing memory and I'm gonna fire up my Python 3 terminal you don't need to do this you just need to watch this because we'll be doing this one more time so let's just say this is gonna be a simple my list and simply a list up here and this is gonna contain a few strings things that I do daily in my life which is definitely including a code part this is not the daily stuff but I still try to hit at least three to four times a week Jim definitely eat and probably one more which is making videos I make a lot of videos up here ok so this is my list basic list and in order to loop through we have variety of options I can simply go ahead and say for L in L in my list and just go ahead and say a print L so this is a basic way of how we iterate through over any list nothing big deal we have already seen this so many time but within um rate we get a more power to iterate over anything for example let me go ahead and do this again let's just say for L in my list we used to do this one but now we are gonna use something different we are gonna use enemy rate which is a keyword which we can introduce up here and then you can provide your iterate able so there we go my list and when I go ahead and do this for L in in my rate and just like that and I try to print this list you are going to notice something different not drastically different but something different up here that we are also getting these indexes 0 1 2 & 3 so this is what enumerate does it gives you ability of all these things now surely at this place of L I can mention the index and can print out the index separately as well but this is the thing that you should know because this is what we will be using in a minute okay this is all clear up so no big deal we are ready to go okay now in order to do this scrolling or crawling we are gonna need a couple of libraries appear so the first and foremost my favorite one is BS 4 which is beautiful beautiful soup for we're going to import everything from here okay once we are having this we also are gonna need a one thing mole which is a request so I'm gonna say I need requests I'm gonna call this as simply our queue for requests now biggest four is a library which helps you to just get an entire web page as a text format and parse that as HTML or any other format that you want requests helps you to send that request on to the web page okay that is all clear up and we are also going to need one more module which is OS OS is an operating a module operating system module which helps you to create some directory I don't want to just lay off my images all over the place I want to store them in a particular folder that is why I'm having it all so I need to write some file in my desk that's why I am using this OS module okay so now it's time that we send out some other requests so I'm gonna call this as simply a request - web - are - you can call it r1 r2 Superman whatever you like it's just available and then this RQ that we have imported has a lot of options that you can see up here get post session request a whole lot of them we don't need much of them we just need a get request up here and this requires you to pass on a string up here so what we're gonna do is we are going to bring up my screen up here I'm gonna copy this link up here and I can just go throw it back up here and we are going to just paste it up here so this is the link from where I want to get some stuff okay and request library this RQ is going to send a request on the web page and we'll get it back pretty easy stuff now we are going to use a soup now this is again available you can call it anything but we usually like to call it as soup you can call it as soup soup - but since I'm calling it as a request - web I'm calling it as soup - but you can call it as soup as well no big deal there okay now we're going to include a property from this beautiful soup which is getting this entire webpage in we usually get that text a at this entire webpage in the text format we want to parse that in the HTML format because once we get that into an HTML format then only we can extract all these images link okay so we're gonna say hey you know what beautiful soup is up here now this are - and the request that we just met is actually going to be in the text format but we want to parse that into HTML so this is again coming up from the documentation of the beautiful soup so I can use HTML parser this is now gonna convert everything into a parsing or a format of HTML okay that is all done now I'm going to create a simple empty array of links that is gonna appear okay in this links array I'm gonna populate all the images link in a moment so how we can do that we can simply create available we're gonna call this egg as X really that's not a good name to call up a variable but this is just for example I'm gonna call that as X so now take up your soup whatever you have called that I'm gonna call this as soup - and then we are going to select now what do you really want to select from this entire web a page so we're gonna go up here and I'm gonna say that I want to select all the images link and all those images linked who have some kind of source so I'm going to simply say SRC and then I'm gonna use a regex up here that is gonna be equal to and I'm just basically trying to match the HTML tag that I just saw so the tag that I saw up here was HTTP colon slash slash images dot pixels dot-com slash photos and in case you remember just just within a few just a few moments ago I saw you I show you that this was the image link which was common yes of course the image was very bigger the link was but this is like I want to select all the links which are similar to this link on the entire web page okay so now that my selector is already I can go ahead and use this selector to populate link in my links so I'm gonna simply say for IMG you can call it anything it's just a variable name just like we saw for L in my list just similar to that we're gonna simply say for images in X X is gonna be in populated so we're gonna go ahead and do that and what we're gonna do is we are simply gonna take this array which is links and we are going to use in method append and we are going to append it with all the images that we have got it and from the SRC okay so this is a plain simple format that we are having again I have discussed more about this like how we are getting all these areas of source and everything in a couple of my previous videos you can check them out as well this is like basic stuff now the moment of the truth what I want is I want to print all the links that I was able to append in this link so let's just go ahead and do classic print up here so in order to do so we're gonna simply say for L in links and I want to print that so I'm gonna simply go ahead and say print L okay so things are going a little bit good so far then we will be able to print out some links otherwise we are going to do about that again programming is a part of debugging I usually say debugging is a part of programming but now so we're going to go ahead and just grab this and I'm gonna run that with Python 3 if you are in the Windows you just have to say Python because you have just Python 3 installed I hope so and we are going to run this file which is crawler dot pi and it's gonna take a second but we are able to grab all these images ok so a whole lot of images are coming up so there we go pretty good and pretty nice but one thing that you will notice here that we are having a width of 500 so these are the basic image that's come up surely in order to grab like higher resolution image we have to craft this code more complex as what we are having here ok this is decent this is decent what I'm gonna do is I'm gonna comment this because I don't want to see them again my things are working fine so step number one is done I was able to grab all the links from the web and I will store that up here now what I'm gonna do next I want to take those links and I want to save that photo in a folder so step number two let's move on to that so first and foremost we are gonna use an OS module and it has a property dot mkdir to make any directory so I'm gonna create a directory that is gonna say hit a underscore photos so you can call it anything it's just gonna create a folder for you it's that easy in python and i'm also going to declare a variable i that is gonna start with 1 and we can actually go ahead and you're gonna need this variable in a minute because the index is something that I will take and we'll save the photos with the name of that index you'll understand that in a second so first and foremost we are gonna have for a for loop we're gonna call this as simply index and image underscore link in a new male rate remember I told you about the enumerate adjust a moment ago and we're going in we are going to enumerate over this links array okay once we are enumerated over that we are gonna put up a condition because I don't want to download like all the images if you want to download all the images that's fine also but I want to go ahead and grab less images so I'm gonna say if I is less than or equal to 10 because I want to grab only ten images I'm gonna go ahead and say that this image underscore data the variable that I've just created is gonna come up from wreck dot to get and we have to get from this image link remember our array is storing all the image link okay now what we're gonna do is request you can have an access to this content this actually can grab all the content which is stored add that link and then you have to explicitly mention that what you want to do with that so what I want to do with that is wid open and then I'm going to open a folder which I have already created I'm gonna call this as hit H underscore photo so it's gonna look for the folder hit H underscore photo will open that folder and I'll just put a slash so that we go inside that folder I'm gonna go up here and I'm gonna simply concatenate it with a string okay this string is going to be responsible for saving these images and surely you can use other modules to generate a random name of the images but since I do have access to this index I can use this index and I'm gonna add a +1 because index starts from 0 I don't want to call my image has 0 dot jpg I want to call it as 1 dot jpg okay and once this is being done I'm gonna concatenate it further with a dot jpg otherwise there will no extension up here so I'm forcing it to have a jpg extension and of course in order to write it further I have to use the right property so I'm gonna say write binary a plus so that it writes it up as here and I'm gonna say as f simply to have a simple file and once I'm into this one I'm gonna simply say F dot right so this is the file handling module and that is gonna write image underscore data there we go really simple but once I am outside of this I can actually shrink this a little bit so that you can see it on one line or I can just close this yeah this is better okay so this is what we are having we are storing the data but since we are into this if conditional and if the image was successfully able to write on my disk then I want to update this I variable because that's what is keeping me inside this loop with a conditional so all I have to do is simply say I is gonna be plus equals 1 or you can simply say there are a lot of ways to update this variable I'm not gonna do much of that okay what is gonna happen in the else part else part is like the most easiest one all I have to do is f dot close I'm gonna close this file handling thing and I'm gonna simply write a break there become so so simple okay now let's go ahead and try it and again I just made a mistake it's not our EQ it's our Q because I imported that as our Q okay so looks pretty simple if I haven't made any typos I think I made a typo here nope okay so this is all good but again it can be messed up we need to tweak the things up a little bit we need to debug that a little bit it's all okay now we are gonna go ahead and open that up so let's just hit ctrl terminal and I am gonna open up this one and let's just run this file again and see if we are able to grab something up here so there we go and there we go it creates a new folder it says attach photo and we were able to grab a one two and three if I just click on that yes of course these are my photos I have clicked them personally and yep they are beautiful photos so there we go a really simple Python crawler up here surely it can crawl more photos but restricting it to ten photos is actually better now this brings us to another exciting thing that I was able to grab these 10 photos but what about when you have something like CAPTCHAs in is as a hurdle when you want to scroll these things maybe somebody is blocking your crawling things based on your IP because you are sending too many of the automated requests so in these conditions let's bring up this guy up here in these conditions are websites like proxy crawl and there are hundreds of others as well but this happens to be our sponsor so gonna talk about them they provides you the code which helps you to run these automated tasks over through there api's they manage all these things like avoiding the CAPTCHAs they also manage how to just whitelist your IP addresses and a whole bunch of other things so a lot of client use them including Spotify including Nike Samsung Oracle they have a good client list and they gives you and what more is interesting up here when you sign up to their and you enter the coupon code hit age which is myname all in caps then you get more additional requests so that you can try out their product you don't need to buy them right away you can simply try out with my coupon code and just go ahead and there's a link in the description section you can just click on that and you can try that out they gives you some of the API documentation as well their token as well so that you can fire up some of the request as well so there we go they have some things up here one thing I would like to mention up here they do provide you some of the code as well but I'm not really super happy with the code and the documentation that they are giving I think this can be improved like significantly this is not the code that I'm super-happy at least on the Python side on the node GS ID it's okay it's pretty decent but on the Python side it's not really good they could have given more example and all of that they give you more example of how you can do crawling 'z and user agent page weight including the crawling zon web sites like Instagram Facebook Amazon and now is the time of shopping a lot of big sales are coming up so you can simply create a simple code that can compare for maybe a laptop onto Amazon and Flipkart and other web sites as well so pretty easy we do that all the time in our boot camps as well so there we go go ahead check them out the link is in the description section and thank you so much for sponsoring this channel because we want to create more videos and your help your sponsorship help us to deliver more awesome quality videos so that's it for this video in case you liked and enjoyed this video go ahead hit that subscribe button and I'm gonna surely catch you up in a next video [Music]
Info
Channel: Hitesh Choudhary
Views: 31,723
Rating: 4.8686132 out of 5
Keywords: Programming, LearnCodeOnline, python, image crawler, web scraping, python scraping, machine learning, AI, python tutorials
Id: 7hpQQ36kKtI
Channel Id: undefined
Length: 19min 58sec (1198 seconds)
Published: Fri Sep 27 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.