Always Check for the Hidden API when Web Scraping

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you're scraping a site when your code looks like this and then you see this then you might think that you need to use selenium to click that button when in fact what you need to do is look past the pretty pictures and the css and the html and see what's actually happening behind all of that here's the code for getting all the products and all the data even some information that isn't actually available in the html itself and if you follow along with me for the next seven or eight minutes i'll show you exactly what i did to get here what i used and where i looked so the first thing that we're going to do is we're going to come back to the website and we're going to open up the inspect element tool this is normally where you'd go to see and have a look at the html done but what we're going to do is we're going to head over to the network tab we're going to click on xhr and we're going to reload the page what's going to do is we're going to see all of the requests between the server here and we're going to see if there's any useful information that pops up if you're new to this sort of request this is possibly the first delay so you really want to come and have a look if you're not new to this you might think i know this about this but i can't see the actual products here so let's click on a few there's no product information there this is just some random javascript stuff going on here let's make this over bigger so we can see no product information this looks promising nope no good though so here's a nice trick scroll to the bottom of where that load more button was we're going to click on that and it's going to fire off some new requests and some of these ones are going to be the ones that we're interested in let's check this one out at the bottom there we go what does this look like this looks like it's got a page size it's got current page product information excellent so this is basically all the information that is being taken by the website and run through the javascript and turned into what you see on the left hand side what we can do because we don't want any of these actual pretty pictures or any of this stuff we just want the raw information is we can just mimic this request in our code to get this exact data out now there's a few different ways of doing this you will need some kind of api program like postman i use insomnia they're both free it doesn't matter which but what you want to do is when you find the response here if you want to go here and go copy copy as curl c url i'm clicking on the windows one doesn't matter i'm going to come over to my insomnia we have our new thingy up here our environment and i'm going to hit new request and we're going to call this one sg huts and then in the get request because we saw it was a get request i'm just going to paste it and hit send now if everything works as we hoped it will that is exactly what we saw in our browser but what can we do here well because we're using our api tool it split everything out for us nicely so we can look at the query and see all these nice options we have that we can easily change now the page size this one i'm thinking is quite interesting what i'd like to do when i see a page size is i like to smash it straight up and see what happens so i'm going to hit 100 and we're going to see what comes back you might get an error but in this case what we're going to get is our response now says page size 100 and we can see if we go to collapse the whole products there's a hundred products here now this means a couple of things to me the first one which is less requests to the server to get the information that i want and also it makes our lives a bit easier because we know we can change more and more different things within our request it tells us how many total products we have and we have all the product information here that we saw before so you can see all this information here so what can we do to move through pages well if you see up the top here it says current page is equal to two and that's because that's the request i copied if we come down you can see current page here let's just change that to one let's run that again and we've got page one so now what we want to do is we want to transfer this into something in our code so we can automate going through all the pages and getting the information that we want fortunately insomnia and postman do this for us you can come over to the request hit the down button here and click generate code change it to python and requests and there you go that's all the information that you need to run in your code editor to get this information out here now there's a lot of headers and stuff here you can experiment and change and see which ones you need to remove and we can see we've got an empty payload that's okay generally speaking i tend to just leave all the headers in for now although if you wanted to you could experiment here and start removing bits of information that you may or may not need to customize your request i'm going to copy this to clipboard i'm going to come back to our code editor i'm going to paste it all in now the headers here have everything including our user agent and our cookie now the cookie could be important it might be or it might not be you can try getting rid of it in your request if you want to but because this for me is just getting this information out once this time i'm just going to leave it in and we're going to let it be there so i'm going to collapse the headers because i'm happy with the way they look now here's our query string this is all on one line i know it's not very tidy you definitely want to tidy this up but for just for this the case of this example i'm going to do is just scroll right across until i see the current page is equal to one you see it there and i'm just going to go ahead and put my f string here and i'm going to change this to our x and hit save and we're going to come all the way back here and we're going to tidy some of this up there is no payload we can remove that we're gonna put our axis equal to one up here just for demonstration purposes we're gonna hit run and we're gonna see what we get back hopefully we get back all the information we just looked at all just flicked by i'm guessing that is exactly it that's good so what can we do from here well we know that there were around about a thousand products what you could do is you could make a risk make a request here check out the response and grab the number of products and then work out how many pages you needed so we can see over here total products 1013 so you could say 1013 100 per page okay how many pages do we need we're going to need to do 11. you could do that if you were trying to make this repeatable but in this case i'm just going to make a loop that goes through x 1 to 11 and gets all the information so let's do that here let's indent this and then here we're going to make our 4 x in range and we'll do 1 to 12 because that will be 1 including and up to 12 not including up to 12 sorry get these headers collapsed again they're taking up an awful lot of screen space there we go now to deal with the response we don't want the text response we want the json response the r response is a long word let's get rid of that r.json there we go let's put that inside our loop and let's run that and if we get some info flicking by we know that our loop is working we're just checking for any errors that seems to be good to me fine and now we just need to do something with this data now the easiest thing to do is to take it and put it into a panda's data frame because we can normalize all the json and we can generally flatten it out nicely and get something quick and easy that we can export to a csv file or whatever output we need so what we're going to do is we're going to import pandas as pd we're going to save that and now we just need to figure out where all the actual json product information is that we want easiest way to do that is to come back to our api client just smush this over a bit if you're trying to work out how to get your data properly out of your json response you can click up here and you can save it or you can copy the whole lot paste it into a vs code file so you can sort of look through it and examine it maybe write some code to get through it that way but i don't need to do that in this one i know that there's it we have one up here that opens so then we have this so we need this key then we have a products key underneath then we have another products key and then we have a product list and that is the product list here that has 100 items so i'm going from here down here here and here so i'm going to copy this one two lots of products and a product okay so let's do let's do data is equal to our dot json and now we can let's print out data the first key then we want the next key which was products and you can see how i'm just chaining these together as i go down the tree products and then product need the quote marks product so what i'm going to do is instead of printing that all out i'm just going to print the length of that because that is a list so i'm going to run this and we should just get numbers each time so we should get 100 100 100 etc etc there we go that's all the products there was 99 on that page for some reason one two three four page four only had 99 that's interesting it's starting to slow down a bit maybe we need to time our requests a bit better so i'm going to stop that but we know that that's working so what do we want to do with this information we want to loop through each and every product in here and add it to a new product list so up here i'm going to go results list and i'm going to say here four p for product in now this is the where we just saw all of our product lists so we can go in here and we can do res dot append p and at the end of this let's print the length of res just to check that this is working let's make the page numbers less so we do one to three so we can just double check that we get plenty of results in our results list 200 results two pages seems good to me now what we can do is we can take this results list and we can create a data frame so df remember to call your data frame something better than df if this is in some kind of code that you're not just running through like i am and we want to do pd.json normalize then we need to give it our res and then we can just do df.2 csv and let's just call this one first results dot csv now by running this i'm going to let's increase the pages let's just do five so our four pages so we should have 400 results let's let that run and maybe you would want to put some kind of print statement in so you can see what's going on but it's finished and here's our first results there was my test file we can see that we have this information here so if i open it up in excel we'll have a better idea of what we've actually got so here's our results file we can see we have our index you may or may not want that there's a color number of colors you can have here and then there's also the name of the product the list price all the information that was in that json format all the way along to the url model name some other stuff now we can see in here that we actually have a list of dictionaries depicting all the other colors and etc etc so that's got all that information and this is what i was showing you when i said that it wasn't in the html to start with now i haven't flattened this out but you could quite easily write something that would basically flatten this all out for you but this is just a rough demonstration of how to get the information not necessarily how to deal with it all but it's not too difficult just to flatten this all out so if we pop back to our code now what we started with was this and then we ran into a problem where we needed to load more and go through the pages and we ended up with this which is basically getting us all of the information super quick super easy and straight to a csv file if you've enjoyed this you should check this video out because it's got more information how to web scrape like this
Info
Channel: John Watson Rooney
Views: 10,559
Rating: undefined out of 5
Keywords: python web scraping, web scraping with python, learn python, modern web scraping, api endpoints, hidden api, web scrapping
Id: DqtlR0y0suo
Channel Id: undefined
Length: 11min 49sec (709 seconds)
Published: Sun Aug 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.