How to Scrape and Download ALL images from a webpage with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone and welcome john here and today's video i'm going to show you how you can create your own image downloader using python so we're going to be using python requests and beautiful soup and we are going to be finding all the image tags and then saving all of the images that it finds to our computer so let's get started the first thing we want to do is import requests and from ps4 we're going to import beautiful soup and i'm also going to import the os module because that's going to let us create folders and change directories which we're going to need to do so now i've got those installed the os1 is in the standard python library if you need to pip install requests or beautiful soup go ahead and do that so this is the website we're going to be getting the images from everyone knows this website is airbnb um i've never been to ljubljana before but i'm sure it's really nice so what i'm going to do is i'm going to try and download the images that it lets us from these listings what i'm not going to do is i'm not going to go into each and every individual listing to get all the images i'm just going to get the top the first one that it gives us so what we want to do to start with is inspect element so we can start to see how it looks so if i make that bigger so we can see if we hover over the first image here there is an image here image class blah blah blah and all this but more specifically the most important thing to us is it's actually inside this image tag now images in html will always be inside these image tags so we can actually just use find or with beautiful soup to get them all and start collecting the links that we want to then download so now i can see that is in there i'm just going to double check the page source it's always useful to do and i'm going to just copy some part of the text so we can get to the let's just copy and we'll search for under free parking just so we can see that it's there and it looks like it is available so we know that we can't we can get to it so i'm going to copy this url it's quite a long one i'm just going to put it in here so we're going to say url is equal to this and just move that up and out of the way the first part is to actually reach out to the server with requests and then get that information back so as always i like to do r is equal to requests.get and then we give it our url which we have specified here see these two right here the next thing we want to do is we want to create our soup so we can do soup is equal to beautiful soup and then we want r dot we can do text in this case and we'll do html dot passer so beautiful soup is just the html password in this case let's move that up one and now i'm just going to check that this is working like i always do and i'm just going to say print soup dot title dot text and run that and hopefully if we get something back that is right which we do we know that this is all going to work let's clear that off delete that we don't want that what we do want is we want to find all of the image tags so they're all like this in the html which means we can simply do images is equal to soup dot find all because we want it to return a list of ev every single one that can find on the page and we want to do img like this what i'm going to do now is i'm just going to print out images and hopefully we get back a load of information there we go we do so we can see that we actually got a list and it's got all of this and we can actually see that the links are here inside it so we can see there but that's no good we just got the elements there what we'll do is we'll do a for loop so we'll do four image in images so each one of those elements that we just saw inside the all of the images list that we created here i'm going to print image and then after that i'm going to do src in the square brackets with the quotation marks because if i come back here we can see the actual link to the image that i hover over on the right hand side is under this src the source equals and we can access the information that's just in this little tag here which is where the image url is so to do that let's do that and then let's run that and hopefully scroll down and we've got a nice long list of image links that we could if i just click on one that didn't work if i go to chrome copy and paste it in we can see that is the image returned that's not quite the images that i was hoping for from this but you know it's there and it works so the next thing you want to do is to save the image but first what i'm going to check out is i'm going to try and give it a better name than just the file name so i'm going to go back over to our source code and i'm going to have a look and quite often you get these alt tags here which basically is the sort of the name for the image so we can actually access that the same way that we did the source tag this one we can use this in the alt tag almost all websites will have an alt tag for their images it's quite important for seo so they will be there we can access let's close that down so then let's get rid of our print statement here and say uh let's call this one link because that was the image link above that i'm going to put name and i'm just going to say image and then the alt alt tag like that so now if i print name and link we should get that information out as well okay we can see it's all here so the first one this is obviously something else at the top of the page it doesn't have an alt tag and it seems to be just a gif file we're just going to ignore that for now um and that will be fine but the rest of them are all there and working to save the images we can do with open so we're going to be opening a file writing to it and then saving it and we need to give it a file name this is why we've gone ahead and got the name from the image here so we can call that our file this name we need to give it an extension so i'm just going to do plus and then i'm going to give it a jpeg for an image extension it doesn't matter if the original file isn't a jpeg file or if it's jpeg go ahead and try and save it as a jpeg first um that's usually your best option most web files are jpegs anyway so that's a good start and then we want to do wb because we want to write to it but we want the bytes we want to know the actual raw bits of the information that are in there so that's why we need wb and then as f and our codon and then under here we want to actually send out a request to the individual links that we can then get the information from them from the server so we're going to want to do another request so i've got r is equal to request dot get up here so i'm actually just going to do i m for image and then we're going to do requests.get and then we're going to say link and then we want to do f dot write the i m that is our response for the link for the image and we want the dot contents the content is going to be the bytes content so we can be able to save that using our write with our bytes file and then save that to the disk so i'm going to run this now and we'll see that it's going to go out and download all those images and it's going to save them into the current directory that we're working in we've got no output so i've got an error here and that's because what i've tried to do is i've tried to write a name that is not an acceptable file name so the best thing to do is i'm just going to go ahead and hit replace and i'm going to replace all of the blank spaces with a dash now hopefully what that will do is it'll fill in all the blanks that are actually causing us issues and saving that with their new file name so that's looking like it's failed right so that didn't work so let's go ahead and replace the i think it's probably the slash forward slashes and replace them with nothing let's try that okay there we go so we can see when i actually read this error the first time i didn't take into account that there was the forward slashes that were causing the problem i was just looking at the extra dots so after we replaced that it worked fine so if i go ahead and open the folder we can see we've actually got all these images here so if i just open the reveal explorer we can see that we've got them all and they're all saved all of the thumb all the images there for all of those and they've all got their appropriate names as we save them the duplicate ones are where we run it the first time i'm showing that bigger for you guys so there we go so that's worked that's great so there's a few things we can do to improve this although this is uh the basics sort of frame of what it is that will work but what i'd like to do is i'd like to turn this into a function that we can then use for different websites add a little bit of error handling in as well and also create a new folder that we can say say hey save all of the images from this um this page into this folder okay so i'm going to actually just collapse some of this down now and i'm going to create our function so def defining our function and i'm just going to call this one image down and then inside this function we're going to give it two two things so we're going to have url and we're going to have folder so when i say folder i'm going to create a new folder with the name that we give it so we need to indent this now to create a folder on python it's really simple we use the os module that we've imported and we would just do os dot m k d i r make directory but we need to kind of do a little bit more than that first so we need to find out we need to get the current working directory first and then we need to create one inside that because if we just did this it probably wouldn't be in the right place so we want it to be in this folder but a new directory so what i'm going to do is i'm going to say we're going to do make a directory but what we want to do is we want to join the current working directory and the folder name that we give it so i'm going to say os.path dot join and there we're going to join the two together so when we do os.path.join it will automatically put in the forward slashes in the correct places for us and we're going to join the two of os dot i think it is get current working directory and folder so that looks a little bit sort of long and maybe quite a little bit convoluted but all we're doing is the main part is we're creating a directory and what we're doing is we're creating the directory that is joining together the current directory we're in and the new folder name we give it okay so it's it's it's just all on one line but it should be quite straightforward what i'm going to do is i'm going to do try first um and then i'm just going to do a real um basic error handling you shouldn't really do except pass but for this case i think it's fine because we we know what this is doing um so i'm going to try creating the directory and if it fails instead of kicking us out our program is just going to move on okay so then we can do our r is equal to request dot get and we can find all the image tags and then we can get the alt and the source for each one and then we can write them all to the file but what we haven't done is we haven't actually um changed into our directory so i'm going to do that underneath that i'm going to do os dot ch there for terrain change directory what i'm going to do is i'm just going to paste this back in because this is now created this directory the join so i'm going to go ahead and put that right in there because that's just going to go ahead and change into that directory that we created now we've done that i'm just going to add in a quick print statement down at the bottom so i'm just going to say just so we can see it working not like that print and i'm going to say writing and then we'll give it name okay so what we've done is we've turned our little basic script just into a function that we can reuse we're going to give it a url and then a folder name so i'm going to comment this url out here i'm going to let's find another place let's go to where else do you want to go let's go bratislava why not and select some random dates that we might be looking at going cool great so we've got a new link let's copy that and underneath here we're going to do image down for our function and if you remember we have to give it the url and this is hidden by me there we go and then we're going to give it the folder name of which i'm going to just call it bratislava why not i'm going to save that let's move back over here and then going to run that and we'll get writing see we still get that blank one at the top but i think that's okay we we kind of understand what that is we could write that we could write some code out for that if we wanted to but i don't think we need to and let's go to our file browser and we can see we've got a new folder here created and all the images in and if i reveal the explorer we should have all those images right there so that was nice and easy um i'll put this code in my github uh you guys can go ahead and take it and maybe change it a little bit make it work for you um but it's pretty simple uh the only sort of complicated bits that you may or may not have seen is the os module and changing directories and creating new folders just keeps it all tidy and you have to do a little bit of replace on the string of the name if you're using the alt tag you don't have to use the old tag you can call it whatever you like you could just call you could do a loop and you could say the first image you find is called image one and then all the way down just keep adding onto it if you like i just thought it was it was a nicer way to have the actual alt name of the image in there um just makes it a bit better to sort of know where you're at and know what it is that you've actually got the image for but you could call it whatever you like so that'll do it for this one guys thank you very much for watching don't forget to like comment abs and subscribe and i will see you in the next one thank you bye
Info
Channel: John Watson Rooney
Views: 24,572
Rating: 4.9375 out of 5
Keywords: image scraper python 3, python image downloader, web scrape images, bulk image downloader, image downloader python, web image downloader, website image downloader, python download all images from url, python download image url requests, image scraper, web scraping image, scraping images python beautifulsoup, image crawler, web scraping, python scraping, python tutorial, python programming, beautifulsoup web scraping, python scraping tutorial, coding tutorial, beautifulsoup4
Id: stIxEKR7o-c
Channel Id: undefined
Length: 15min 0sec (900 seconds)
Published: Wed Oct 28 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.