This script I threw together saves me hours.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so I'm going to show you a tool that I wrote for myself that loads up the page using selenium wire and checks all of the network responses and requests that it makes so we can easily find that Json data that's in that backend API so we don't have to keep loading the page and looking at seeing what's going on we can give it a URL it will load it up and it will get a nice list out of URLs and also the responses save to a file for us to interrogate work out what we're doing I like building tools like this they make your life so much easier hopefully you like this one too so we're going to be using selenium wire which is an extension to selenium it kind of adds to it so you'll need to make sure you pip install that and then we're going to go ahead and do from selenium we're going to import in driver and then we also need some of the utilities so from selenium wire dot utils we need to import in decode now I'm going to import decode in as decode s w because we are going to use normal decode as well I'm also going to import in Json because we will need that later on so what selenium wire does is it will load up the page and it will then show and check all of the network activity that that website is doing so we want to be able to see that and we want to be able to intercept that so we can create a few functions first so what I'm going to do is I'm going to call this one show request URLs so what this is going to do is it's going to just return us the URLs that the site has made requests to externally this is where we can easily find the API so here we need to give it the driver and I'm going to cover this in just a second and also a Target URL so it knows what to load up within this driver I'm going to do driver.get this is basically going to the page and we'll say Target URL I'm going to create a blank list here of URLs so we can like add them to it and from here we basically just want to interrogate the requests now we do this for using the driver.request so we'll do four request in driver dot requests and we'll just append it to our list urls.append I'm going to make this a dictionary of the key URL and then request dot URL so this is the first part of our selenium wire that gives us access to this request and the driver requests here and also the responses which we'll do in a separate function so I'm just going to return the URLs from this function and we'll create a new function which will be our main this is where we're going to run everything so here we need to actually initialize our Webdriver so I'm going to say driver is equal to webdriver.firefox you can use whichever one you like which is installed I like Firefox massive Firefox Fanboy and we need to add in some selenium wire options here as a dictionary because when we get the response back it's going to be encoded or it's going to be bytes we want to make sure that it doesn't do any extra encoding so we'll do disable encoding is equal to true and this needs to be a dictionary here so now that we have this driver we can then use it within our show request URLs to actually open the web browser and load it up so I'm going to say that our Target I will just call this URL is going to be equal to I'm just going to grab it from over here we'll use this website here as a good example so now we're going to say that our URLs is equal to actually I'm going to change this because that is going to be a bit confusing Target URL okay so now we'll have our URLs which is going to come back what's coming back out of this function it's going to be equal to show request URLs and we'll pass in the driver which we've created and also the target URL here like so then let's just run through these URLs and print them out for URL in urls print the URL out and then let's make sure we run this function so Main so if name is equal to done domain and then we'll just run the main function then we can just run the main function like so let's save that can I format with black in here I don't know do I save that great so let's give this a go let's run this now and check and see if we've got anywhere so I'm going to hit run hopefully this is going to load up the browser as you'll see it happens on the right hand side we're going to go to that Target URL which is that website that I put in here and it's loading up and we're going to get back a load of URLs that this page is now making requests to like so now this didn't close because I need to add that in but you can see we now have all of these URLs now that's everything that the network every Network request that's happened when that page was loaded up a request has been made to one or many or all of these URLs rather so this is really interesting and we can actually look through this um you'll find some things more interesting than others probably the ones that you're going to like the most are ones like this where you can see we have this full URL for the API search and then this product identifier this is really what you're looking for and this is going to give you a good idea of how you can actually get the data from this website so I think that this is a pretty handy way of looking at it what I'm going to do now is I'm going to add in my driver.close because we want to make sure that this browser closes when we are done another thing that I do like to do because we're looking at URLs is maybe have a list of keywords like perhaps we want to have products or maybe even you want to put in API might be a better option so we want to know if there's an API coming back and sometimes the API might have something like V1 in it or or whatever you'll use keywords depending on your knowledge of the target site and what you've sort of decided you want to do or just general knowledge overall I tend to use bu have been using just API but what we'll do is we will then have a look and check these URLs so we'll do four keyword in keywords if KW for our keyword in url print the URL like so of course I need to reference the dictionary key here because otherwise it's not going to know where to look we're searching within the key rather than the value so we want to look for key in the URL value so this should give us now the list okay so there we go that's a bit better so now we have a list of more condensed URLs that have the API keyword in them and this is a pretty good start it gives us a good idea of what's going on but we can do more because we can actually then interrogate the actual API response which is obviously going to be Json so we've got a good opportunity to actually just grab the data there and then that we might want so I'm going to create a new function and I'm going to call this one show response and we'll say driver again and we want the target URL targets URL and we'll need to do the same thing here I'm just going to grab this and we'll paste him in here now we'll say our responses is going to be equal to our blank list and we want to now look at how we handle the encoding so I'm going to say four four request in driver dot requests we need to access the request because we need the response from the request we're going to need to do a try and accept now this is a bit messy I'm not really sure what the best way to handle this is so if you know A Better Way stick it down in the comments below so we can all benefit I want to say our data is going to be equal to decode SW and within here we need to pass in a couple of bits of information the first one is going to be request.response dot body because we want to decode the response body we also want the request dot response Dot headers dot get this is going to basically get the information it's going to understand the headers that are coming back and we want content encoding this is all from the documentation for selenium wire and identity like this then what we want to do is we want to say response is equal to json.loads because we want this to be Json information if it's not Json data we're not interested so we're just going to discard everything else and then we want data dot decode and this is why I said at the beginning we import Cellini and why is decode as decode SW because we are now accessing Python's decode and we want to say this has got to be UTF eight this is going to give us the actual information that we want so if this is valid if this works inside our try block I'm going to do responses dot append the response that we got back just here and if it doesn't I'm going to do that thing that you probably don't want to do I'm just going to straight up ignore those errors because I don't care we want to then return out here responses like this so now we have a nice neat list of only the things only the response is back from the back end to the front end that are Json encodable that's the information that we want like I said we're going to discard everything else so now what we can do is I'm going to say that our responses are going to be equal to our show response again driver and the target URL and then we can actually save this data now you'll notice here that I am actually loading the page up twice and this is intentional because my idea going forward with this is I will have some kind of uh I'll pass or maybe even go the full route of click and we'll be able to choose whether you want to see just the URLs or the responses or both so I've got them separated like this for the moment also means you can choose as well which ones you want whether you want the responses or just the URLs so we are going to load the page twice I don't see that being a massive issue so underneath this so we do get the URLs I'm going to do with open and we're going to save these responses in to a Json file because there's a potential there's going to be a lot of them and there could be a lot of data so it's definitely worth saving so I'm just going to call this data.json W and we want to do as f and we want to do Json dot damps is our responses into our file there and let's give ourselves a little space there now so if we go back to the top we have selenium wire which we're using we have our first function which gets the URLs which gets the URLs that's being requested to our responses then so when we open this page you'll get all of the information back nice and neat that you can just see and interrogate as opposed to having to load it up in your browser and have a long look around through the network Tab and see what's going on now this doesn't entirely replace that but is a good start and I think this can be improved and built upon too so let's run and we should get our data Json file out and also our print of URLs that are requested with that keyword in that we've chosen in this case API so you'll see the page does load twice as explained earlier I'm okay with that for the moment and we've made a mistake and this needs to be requests not request otherwise we're going to get that error that we just saw here which means you can't do it because it doesn't exist so this should work now this time one more small error dump to file dumps to string Third Time Lucky maybe okay so that finished and we do have a data.json file so let's open that up I think I can format document there we go so now we can see all of the Json information that came back and we have this items here so this could be interesting for us to look at and find out more about there's a product URL all sorts of information we could scan through this and have a look and see what information is available using this method to scrape data and of course this is my preferred method if we can do it and this tool that I've just shown you hopefully will help you know whether you can or cannot use this method or whether you need to take a different approach so hopefully you've enjoyed this video and got some value from it I have a patreon which I'll link down below there's a free tier check that out and also like And subscribe really helps me out I hope you've enjoyed this video cheers see in the next one
Info
Channel: John Watson Rooney
Views: 17,753
Rating: undefined out of 5
Keywords:
Id: HVWNRMEH9Ro
Channel Id: undefined
Length: 13min 38sec (818 seconds)
Published: Wed Aug 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.