Request Headers for Web Scraping

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're going to be talking about http headers what they are and how we would want to use custom ones when we're web scraping so http stands for hypertext transfer protocol and it's designed to allow web browsers and web servers to talk to each other and transfer data usually html or maybe json for displaying the content of a website the client which is us initiates the request of the server and waits for the response within both the request and the response we have these headers additional text that is used to provide information about the request to help each party work out how best to serve and deal with that data the request headers are the ones that we're most interested in as web scrapers as we want our programs to seem as human-like as possible we're going to cover the four most useful headers and what values you might want to send along with them when you're web scraping so what do these headers look like well each request is categorized into a few different request methods the first and the most common is the get method which is what we use when we use python to send a request to the server to get the data the opposite of this is to post data which is where we send data to the server and we do this quite often too when we're logging in or something similar like that we can form both of these requests in python using the request library and this is one that i use a lot in all of my videos you've probably seen it or are aware of it so after the method we then have our headers and these come in a set of key and value pairs or like a dictionary in python so here's a typical set of headers done from a request from my web browser to google.com so the important ones to note here so except this is the content types that we can understand this could be text images or video but in this case the asterisk asterisk is saying any and no preference the next one is accept encoding which is all to do with the compression algorithm that's been used for that request in this case it's using br as we can see there which is the broccoli algorithm and this is a very common one that's used since about 2016 and is supported by all modern web browsers and programs this one is the cookie the cookie data that we sent so this will be cookie information that we've stored that the server would have sent to us originally if we were trying to identify ourselves or keeping some kind of persistent data we are going to cover cookies a bit later on so we'll come back to that in a bit more detail the next one is dnt which is do not track this is whether we request privacy or more personalized content like ads and stuff like that that might be based on your browsing history um this can also be read from javascript and it's sometimes a setting in your browser on firefox to stop stop sending that information to the server the next one is the referrer and this is where we've come from this is the url that's sent us here and it's generally just used for analytics but can be quite important to put in there to identify yourself and the user agent the last one and this is the most important one and this identifies the browser or app and operating system and versions of who has sent the request so as i said this is the most important one i've actually done a video on this in its entirety um which i'll link somewhere down below but this video here will also contain the information you need to know about the user agents so if we compare what we've just seen with the default headers sent by python and requests we can see that not only are we missing quite a few but how easy it could be for a web server to block us just by looking at the user agent shown here we're clearly identifying ourselves as python requests and it can even see the version number that we're using the second set of headers i've got here is using the request library they are the same however we are sending a real browser user agent and this is just a thing that the request html library does for us and it's quite helpful in that respect so that's all well and good but we can change all of these headers and send custom ones as well and i'll show you how to do that just now so we're going to jump into the code and i'll show you what we're going to do so to show you how we can actually change the headers and what they look like and how we do that with python i've got some code here and we are importing requests and i'm pointing my url to the http bin website now this is a website that is basically designed for showing you how your requesting responses look like from the server and i'm using the forward slash headers part that means that our response from the server is actually going to be the headers that we sent it so we can quickly see what's what headers we are sending and how what changes when we change them in our code so to show you what i mean i'm going to do r is equal to requests.get because we're using a get method on the url and then i'm going to print out r.text and as i said because we're getting our headers in the response we're going to get our request headers as part of the response we'll show what it'll show us what we sent and we can see right away as i looked as i showed you earlier the user agent here is python dash request slash 2.22 which is the version and we are barely sending any other headers now this is the standard headers that we would send from requests and it can be quite easy for websites to or servers to simply block these when we send the request to the server from from within our code as opposed to all of the headers that i showed you earlier that came from my actual browser so we're going to want to add a few of these and change them specifically the user agent this one here i believe um x used to determine sort of more custom headers i think that's been deprecated now but it's still about as you can see i think this one is something to do with the aws hosting so we're going to ignore this for now so to set up our custom headers we want to create a new dictionary here i'm going to call it headers and we want to define our values and our keys the first one i'm going to do is user agent because that is the most important one to change so we type user agent in our brackets put the code on and then our string will go into here i'm just going to copy that across that i've got saved over here i'm going to paste that in there now this is as you can see a windows machine running mozilla firefox and the version there and gekko is the driver for firefox if you weren't aware of that before so to get this string information this dictionary sent with our request we come into our r equal to request.get and after the url we do headers it's equal to our headers dictionary here now if i run this we'll see that we get down here the user agent we specified is now being sent as the request header to the server now in most cases is if you're having authentication issues or something similar it's definitely worth just trying that one out straight away but it is worth trying to replicate as much of a real user as possible when you're web scripting so i'm going to go ahead and add a few more in we're going to add in the accept language one just so we look a bit more real and again i'm just going to copy this string from over here so i don't have to type it out or ever remember what it is because i never remember i always copy and paste and we're saying engb and then we're going to do the referrer so this is basically where we are saying we came from and in this case i'm going to say that we came from google.com and then i'm also going to put in dnt for do not track and i'm going to set that to 1. so if we run that now we should see all of these headers pop up that we sent and we can see right away here we go we are saying we have the accept one in already here's the language we put in here's our dnt uh our referrer which is google.com and our user agent now you can send completely custom headers and some websites will try and do this they will only accept really specific things um to try and block you maybe but if you're having some issues just go on your inspect element and see what your browser's actually sending but we could just say something like uh let's put john and then uh sub to me please i've seen other i've seen other youtubers do this so let's give it a go and there we go we can see that we have our custom sub to me police header in there so i want to make a note on cookies and sessions so cookies were designed to make some data persistent when we're using websites things like maybe what's in your shopping cart or what you've been doing on the server and then the one that we're going to be most interested in though is the authentication side which sort of lets the lets the website know that we are staying logged in over multiple requests so along with being useful for keeping us logged in they're also quite crucial for us when making requests scraping data from an api endpoint i've got some examples of that on my channel too which i'll send a link to these cookies do have an expiration date though so you'll need to check your request and find out when it expires because quite often then that's when your request to that api endpoint per se will stop working we can use a session object as well to keep hold and keep track of all of our cookies for that session and both requests and request html will do that for you and this could be quite useful for you if you're trying to maintain your session on the server while sending multiple requests from python so that's going to do it for this one guys hope you have found it useful and gain some value from this video if you feel like you have please hit that like button and let me know down below and uh thank you very much guys and i will see you in the next one real soon cheers goodbye you
Info
Channel: John Watson Rooney
Views: 6,214
Rating: 5 out of 5
Keywords: request headers explained, http request headers explained, request headers python, http headers tutorial, custom http headers, understanding http headers, headers for web scraping, web scraping with python, how to send custom headers, user agent python, user agent spoofing, user agent browser, user-agent header, fake user agent python, python requests user agent
Id: Oz902cJcCUg
Channel Id: undefined
Length: 10min 2sec (602 seconds)
Published: Wed Feb 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.