Scrape Dynamically loaded websites with python | python webscraping technique 2020 | python project

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

How is using API considered scrapping?

👍︎︎ 1 👤︎︎ u/PewPaw-Grams 📅︎︎ Sep 20 2020 🗫︎ replies
Captions
[Music] what's up everyone my name is adil and i'm a computer science student and in this video i'm going to share a technique that i always use to scrape dynamically loaded websites so this is a technique that i prefer to use before using selenium so this is just going to be me trying to show you what i do with python and maybe you'll learn something from this video and maybe if i'm doing something wrong you can point that out to me so that i can improve on that so that way we are helping each other yeah so without wasting any more time let's get right into this video right now as you can see i'm on this website called unsplash.com and the goal here is to scrape images from this website as you can see this is a website which provides free images you can all use all these images in your websites in your videos or in your applications or whatever so if we try to examine this website a little bit if i scroll down you can see as i'm scrolling down more and more images are being added to this webpage so which means that this website is a dynamically loaded website so what's happening in the background is javascript is sending a request to the backend server and fetching the data and updating that data on this web page in real time my internet speed is very slow that's why it's taking a lot of time to load all these images but if you have a high speed internet these will load really fast so if you have a little bit experience with web scraping in python you might know that we can't scrape dynamically loaded website directly by requests or beautiful soup we can use something like selenium but i don't like a browser window popping up on my screen every time i try to scrape something so i'll always try to do something else before using selenium selenium is my last option so in this video i'll show you a technique that i always use in order to scrape these kinds of websites so what i always try to do is i go to the developer tools you can go to the developer tools by right clicking and then selecting inspect element and this thing will pop up on your screen and you need to go to this networks tab so what i'll always try to do is i'll check this xhr i'll keep this thing checked and if i have a search box on a website i'll search for something so let's say i search for dog as you can see as i search for dog some requests are being sent to the backend server and one of these requests is this one it looks very interesting but first let me tell you whenever you are trying to do this process always look for those requests which are json so right now we have three json requests but this one is looking very interesting to me that's because if you have worked with apis their url is structured kind of this way we have this search query and then we have per page so it looks like an api request being sent to the backend server so if we double click on this it will load in a different tab and you can see we have a huge huge json data here so if we open this up we have some results we have some urls if i open this up you can see this is an image of a doc so if we open this one this is the same image but in a bigger resolution so i think all these things right here raw full regular small thumb these are the resolutions so what we have done is we have hacked our way into the back end api of this website so now we have access to the whole data that this website is using in order to load all these images so if we have access to the data we can extract these images very easily we don't need to use beautiful soup at all we can only use request library in order to do this job also i'm going to show you some tricks that you can use in order to not get blocked by unsplash servers but there is one more thing you might be wondering why not use the api that this website is providing us in fact you it is providing an api and you can use that but there's going to be a restrictions uh and the restrictions are going to be you can only send 50 requests per day something like that so if you are trying to scrape some images for some fun project then this is the way to go if you don't want any restrictions on you then this is the way to go so now we'll start creating a script and we will start scraping these images i'll show you how we can create a script in order to scrape all these images now so what i'll do is i'll go to my vs code you can use any other text editor that you like and we'll start coding right now i'm inside visual studio code and i have created a file on splash.py let's go to the browser again and let's try to examine this link a little bit you can see we are searching for the docs so it's stored inside this query parameter of this url and we have one more parameter which is xp set to none we have per page set to 20 and page set to 2. so if we try to change these parameters so if we search for cat let's see if we get images of cats so if i open it up open the urls so if we open it up you can see we get an image of a cat so it's working perfectly so the other thing that we need to see is we need to see which method this request is using so it's using a get method also let's change this page parameter to zero so yes we got a response so the page numbering is starting from zero and whatever we want to search for we need to change this query parameter and this per page means pictures or images per page so each request will get us 20 pictures and i believe that the maximum images that this api allows is 50. so we will look about that later on so let's copy this url now and we are going to define a variable url and we are going to paste it inside here next we will import requests module so i'll say import requests and now we will send a request so i can say r equals to requests dot get pass in the url and we are going to print r dot status underscore code so if i save it run the code let's see what we get we are getting a status code of 200 which means okay so which means that we have successfully made a get request to this url and we got a response back so now what i can do is i can define one more variable data and set it equal to r dot json so this will fetch all the json data so if i print data let's see if we get the json data yes we are getting all the json data here so what i'll do is now i'll run a for loop so i can say for item in data and inside this data if we come here we have this results and then we have all the pictures so in here we need to go to the results and then we can loop through all those results so inside here i can say name equals to if i come here open up an image you can see we have an id and this id is unique for every picture on this api so we can use this as our image name so i can come here and say name equals to item id this will give me the id that we just looked here after that what i'll do is i'll also get the url of the image so i can say url equals to item urls and inside here let's say we first get the thumbs what uh what i have done here is i have we first need to go inside this urls and inside here we have some options these are actually the resolutions so this raw is the highest resolution this thumb is the lowest resolution so whichever resolution you want to scrape you can put it inside these square brackets here and you will get image in that resolution so now the only thing left to do is to save the image so we can say with open name plus dot jbg and we want the mode to be right bytes and we are going to say as f f dot write requests dot get url dot content so if we save this file and run our code let's see what we get we have an error thumbs we have a key error thumbs so if we come here it's actually thumb not thumbs so if we run the code again let's see what happens you can see some images are being downloaded to this folder let's open some yes we are successfully downloading some images if i reveal this in file explorer so you can see it here we have downloaded these images using this script so our code works perfectly but i don't recommend you to let me first delete all these but i don't recommend you to code any scraper like this yeah it works perfectly it works fine but if you want to create scalable programs and if you want to be able to maintain your code and add new features later on then then this is the worst way to go so what i'll do is i'll get rid of all this code all right and i'm going to create a class so i'll create a class name it unsplash all right and we'll define a constructor inside this class so i can say def underscore underscore init underscore underscore parentheses passing the self if you don't know what a constructor is constructor is a method that will be automatically called whenever we try to whenever we instantiate an object of a class so what we will do inside this constructor is we will take in few arguments so the first one is going to be search underscore term and the second one is going to be the per underscore page and then we will set some attributes to our class so the first one is going to be self dot search underscore term and we will set it equal to search underscore term the second one is going to be self dot per underscore page and set it equal to per underscore page all right next thing that we will do is we will create a method that will set our url so i can say def set underscore url and pass in the self and inside here what we will do is we will format this url in such a way that this time that this hard coded cat search query will be changed whatever we will pass in inside this search term and this per page will be changed accordingly to this per underscore page attribute of our class so we can grab this url from here cut it paste it down here and we need to get rid of this thing here and we will create an f string and instead of saying get here which is a hard coded string we can give two early braces and say self dot search underscore dom and in here we can say curly braces self dot per underscore page and we also need to change this page equal to self dot page and of course we need to define self dot page attribute of our class here and we are going to set the default value to one actually zero so then we can return this f string so what we have done here is we have formatted our url in such a way that whatever we want to search for will be inserted in this query parameter of our url after that we will make a request so i'll create a method make underscore quest and it will also take self and inside here we first need to define url and set it equal to set actually self dot set underscore url so this will give us url and then we can make a request so we can say requests dot request and pass in the method the request method which is get in our case and url so after that we are going to return this so this function make underscore request will return the response that we will get from this request after that what we need to do is we need to get the data so we can say def get underscore data and pass in self and inside here we can say self dot data so we are setting one more attribute to our class which is data and we are setting it equal to make underscore request and we need to add dot json at the end in order to get the json data so so far what we have done is we have created a method that sets our url and one more method that makes a request and now we are getting the data and now what we can do is we can create one more method that will dynamically create the file for file paths for us so let's create def save underscore path and pass in self and it's also going to take an argument which will be the name so inside here we can define one more variable download underscore dir and set it equal to on unsplash so what i'm trying to do here is wherever my file is stored so wherever this file is located i want to create a directory named unsplash and then i'll save all the images inside that folder so in order to do that i first need to check if that directory exists in my current working directory or not so then so to do that i can say if os dot path dot exists and pass in this downloader actually it's if not because i'm going to check if this directory does not exist in my current working directory then os dot make the download underscore tool so what i'm trying to do here is i'm checking if this unsplash directory doesn't exist in my current working directory then go ahead and make that directory so we also need to import os here so i'll say import os after we have done that what we can do now is we can define a file path so we'll start by getting the full part to our current working directory so i can say os dot get cwd so this function will give us the current working directory and i'm going to wrap it inside this os dot path dot real path function what this function will do is it will dereference all the symbolic links on the operating systems that support them so this line of code here right now it's giving us the absolute path to our current working directory and then we can join this unsplash directory with this path and then we also need to join this name to that path also we need to add an extension so what i'll do is i'll use one more function os dot path dot join and this os dot path dot join will join all the paths that we provide to it as an argument so the first part so the first argument that we are giving it is our absolute path or current working directory the second argument we will give it will be the downloader which is the unsplash ter and then the name and then we are going to store it inside an f string so i'll do something like this and at the end we will add an extension.jpg so now we have created a method that will give us and that will generate file paths for us in order to save our images next we need to create a method that will download an image so i can say def download and it's going to take self and url to the image that we want to download and the name that we want to set in order to save that image and inside here we are going to first define file path and set it equal to self dot save underscore but let's save and pass in name and then we can say with open file path comma the mode is going to be write bytes as f and what we are going to do here is we are going to say f dot write requests dot request set the method equal to get and we are going to pass in the url and then we are going to say dot content so what we have done here inside this download method is we have first fetched the file name that we are going to use in order to save our image and we are going to do that by using this save underscore pass path method then we are opening up a file with this file path and name and we are opening it in a write byte mode and then we are sending a request to this url and getting the content from and getting the content of the image and writing that content inside our file so this method will download an image so now we will define a method named scraper so this is going to be the method where we do all the scraping so this is going to take self as always and it's going to take pages as an argument and inside here we are going to run a for loop so we are going to say for page and pages all right and we are going to actually it's not pages it's page in range zero comma pages plus one and then inside here we will now start scraping our images so if we come here we first need to get inside this results and then we will be able to scrape all the images as we have seen before so now inside this for loop what we need to do first is we need to make a request so we can say self dot make underscore request all right and then we need to say self dot get underscore data so after we have called this get underscore data method we now have data to work on so now we can say for item in self dot data and we need to go to the results results and now we can scrape each images one by one so like before we are going to set name equals to item id and url equal to item let's say urls and we used thumbs in that first example so let's use that again just like this but this is not the right way to go because if we come here you can see we have five different resolutions for our image so we can define one more argument in our constructor that's going to be the quality so quality equals to let's keep it thumbs just for just for testing purposes and we are also going to set this per page equals to 20. so now we can pass in any of these things any of these resolutions here inside our constructor parameter and we will get image of that quality so url equals to now inside so now instead of saying thump we can say self dot quality and also we need to create that attribute here we can say self dot quality equals to quality all right and now all we need to do is we need to say self dot download and we can pass in the url which is going to be the url and we will pass in the name so that's it so now we can test our code so the first thing that we need to do is we need to instantiate this on splash so we i can say scraper equals to unsplash all right and it will take the search term as an argument so let's say we want to search for cars and then we can say scraper dot scraper and let's say we want to loop through just one page so before running our code we need to come here and write this thing here self dot make underscore request and we also need to return we need to return this app string from this method so we forgot to do these things and we are doing them now so we are all set so before running the code let me show you this is my current working directory and we don't have a directory named unsplash here so our code should first create that directory then save all the images inside that directory so if i save our file and run this code let's see what will happen so a directory named unsplash has been created and we are getting some images inside it if we reveal it in file explorer you can see we are getting images of car so you can see the images are getting downloaded one after another so it's working fine it's working perfectly so we can close this and let's break our program by pressing ctrl c so now we are successful in scraping images but now we need to add few more things in order to not get blocked from the servers so the first thing that we can add to our script which will save us from getting blocked from the servers is that we can add headers to our script so i can come here and i can define another attribute to my class that's going to be self.headers so i can fetch those headers by coming here and clicking here on the headers and if you see down here we have the request headers so these are the headers that will be sent whenever we try to send a request on this url so i'll copy all of these and i'll come here and say self dot headers and i'll create a dictionary just like this so only keep these headers that i have kept here and i have deleted few of them so we are only going to use these headers so after we have created a dictionary of headers we can come down here in the make request method and add one more parameter here that's going to be headers equals to self dot headers all right and we can come down here as well and say headers equals to self dot headers so what we also can do is we can use proxies so inside this request or request method we have one more argument that's the proxies proxies and it's going to take a dictionary of proxies just like we added the headers so if you buy some good proxies and use them inside your scrapers then you need to use them very cleverly and if you do that then there are less chances of you getting blocked from the servers and there are a few more things that we can do about not getting blocked but this video is getting too long we'll do that maybe in the next video and also if we come down here when we are looping through pages we need to add one more line inside here so after we are done looping one page we need to update this self dot pages value by one so i can say self dot pages plus equals to one so now there might be some bugs here and there we can fix them in the next video and we can refactor this code a little bit we can do more of a refactoring on the score yeah we will do that in the next video as well and we can add few more features like we can add and we can add a method that will return a pillow object that we can directly use in a some image processing or we can create a method that will return a tick inter photo image object or a tick interrupt photo image that then we can directly use on our tech inter applications so we can add lots of features here and also and also we are going to add this line of code if name equals to main and we are going to grab this and indent it just like this so yeah this video is getting too long i'll do the rest of the things in the next video so if you want me to create another video if you like this video do let me know down in the comment section then i'll definitely create the second part of this video and if you have any questions you can comment down below give this video a thumbs up and please subscribe to my channel it helps a lot and and yeah i'll see you in the next video till then peace out
Info
Channel: Code Bear
Views: 17,499
Rating: 4.927928 out of 5
Keywords: scrape dynamic websites with python, scrape dynamic websites with python without selenium, scrape dynamically Javascript loaded websites with python, python web scrapping, scrape images with python, scrape unsplash with python, python webscrapping techniques, web scraping, web scraping python, python webscraping 2020, python requests module, python requests, python project, python project for beginners
Id: 8Uxxu0-dAKQ
Channel Id: undefined
Length: 26min 30sec (1590 seconds)
Published: Fri Sep 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.