Scrapy POST Requests

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so this is one of those spontaneous lives so i have not yet i did not actually announce it in advance what i'm going to do and i thought covering the post request is going to be a good idea so i'm not sure how many people are actually going to watch it live but let's see so there are few things that i want to talk about the post request so this is one of those areas which is relatively strange for newcomers people who are just entering into the world of scrappy or web scrapping in general so what are we talking about let's start with a basic example so this is our favorite site codes to scrap and uh this is one of those sites which we are going to use a lot use a lot so this always happens in live videos you always end up saying wrong and stupid words so anyway hello ram good to see you so all right so what we are going to do is we are going to understand what happens when we have to create a post request so how these post requests are created and how they are sent so we are going to look at couple of examples so we'll look at this site we are going to have a look at one of the dotnet pages this is another.net page that i have in mind okay this is and and i will work around i will improvise so there is one more side that i have in mind that i want to show you so let's start with the basics now in most of the cases you will see the post request when there is some data which is being sent to the server so this will happen when you are having a username password and you are trying to log in or it will happen when you have search so you have to search for something so these are two things and then let me bring up one more example so then there are sites like these which actually will use post request for pagination so it's not a get request it's actually sending a post request in the background so you have to deal with post request all the times so let's start with the basics so we'll start with the this site and let me bring up the terminal so i like to work with windows terminal this is this is probably the best thing which has happened in windows in a very very long time all right so let's go to the site and let's examine what is happening here so i've pressed f12 to open the developer toolbox and now we are going to the network tab all right so zoomed in a little too much now in this site you can enter any username and any password and it will work doesn't matter so this is only for learning and practicing so i'm going to click login and let's see what happens here so as you can see that this is a post request so let's make it bigger and let's examine this so this was a request to this page and it was of type post and the status code is three zero two so three zero two or three and three zero one or three zero two so these status code means that you are being transferred to a different url so whenever there is a redirect scrappy will take care of it automatically so you don't really have to worry about 300 so 200 and 300 they are all fine if it is 400 500 then okay we need to be actually do something 400 500 means there is something wrong and 100 they are typically for informational purposes but practically i don't see anyone actually using them so let's collapse this section uh response headers i'm not going to look at it right now not really interested and let's expand this even further and in the request headers we can see that these are the standard stuff that is passed with the browser so you see this cookie here this cookie was sent and there were a few other things for sign for example here this is the important one so this content type is application slash x www form url encoded so this content type is typically used whenever there is a form submit then we have user agent and all those standard things and this is where we are sending the user data the data that we are submitting so the username and password is standard there nothing special about it but there is one special token that you can see here so we need to handle this so we'll start with this csrf token so this is actually for it's a security prevention so cr csrf stands for cross site request forgery so let's not go into what is csrf in detail but right now what matters is we need to get this value because every time this value will change so let's log out and log in again going to clear everything put anything doesn't matter and this time this has changed or and it's the same i don't remember actually to be honestly speaking i don't remember so what we have to do is we actually have to locate where is this csrf request coming from so there are multiple ways the simplest one in this example for example i have clicked on log out and i'm looking at the source code of this page and let's simply search for csrf token okay so i'm going to zoom in control f c s r f where is that we're supposed to be just there like that okay it will be here on this page i started looking too early so on this page where the form is being displayed if we look at the source code and look for csrf token it's right there and this time definitely the value has changed so this you have to handle dynamically dynamically you have to capture it and send it so let's start with reading this particular field so let's open the shell and we can see that this is an input and the name is this so name is unique so let's use this use this name so there are multiple ways we can find it using css or xpath so let's look at both before that actually what i want to do is i want to open the shell and point to this url i'm going to open scrappy shell and i'm going to paste in this url cls my shortcuts which work on command prompt they don't work here here if i press c it gets clear anyway so inside we are in scrappy shell and if we look at response dot css okay and we'll write get all just to make sure that we are getting exactly one and what we are going to do is we are simply going to copy this come back here and because this is an attribute so the attribute needs to be placed in square brackets and there we have it very simple now we have to do something similar using xpath so let me write the skeleton first and let me show you how it can be done very easily so you can take this is input so just copy this entire thing again which was already on the clipboard anyway so double slash and star for any element and again in the square brackets paste it here there will be one change that we will have to prefix it with at the rate so in x path all the attributes they need to be prefixed with at the rate so there we have it very simple and now we have to get the value so if we are using xpath we can just write at the rate value and this time we can actually remove get all because we know that there is exactly one so there we have it so this we can store it in a variable and then construct the form data that we are going to send if we are using css the we have to use the attr function so this function is not standard css this is crappy specific and the attribute that we want to get is value and there we have it so these are both both the ways so using css and xpath how you can get this value so this probably you have seen it quite a lot and what i can do here is i can go on and talk about request now request using form so there are two ways so from scrappy we can import two things request and form request so we can use request or form request both any of them depending on the scenario we can choose any of these and we can actually create and send a post request so let's talk about this example so this is obviously a very straightforward form submit request so in this case let's start with form request all right so let's create one object of form request and there is no auto suggest here there is auto suggest for available functions but not for parameters so anyway so we can do one thing we can call the help function and write like that so we'll immediately get that form request accept okay unnamed parameters so not very useful and then form request also has one useful function called from response so this one is interesting because it can take response object directly because whenever you are submitting the form you already have a response object you can provide a form name or form id form number so these you will need only if there are more than one form on the page and you want to submit a specific one then you have form data so this form data is actually this has to be a dictionary so we'll see how we can create dictionary very quickly and don't click and form xpath so there are few other parameters so i'm going to press q so i don't want to read through all the documentation ctrl l to clear and let's create form request dot from response right so this is what we are going to do and the first argument is response but i need to wait and let's copy the the data that we want to send okay so f12 i'm on f12 respond okay took a moment to respond all right so i'm just going to put something and click on submit because what i want to do is actually i want to look at the data that is being sent the structure of the data that is being sent okay so this is enveloped csrf token i'm ignoring that so this is the form data and these all this information this information we have to convert into a dictionary and then send across all right so i'm just going to press copy and paste in here so this is the information that we have copied what we can do is we can create a dictionary like that so let's call it bom data or let's call it data all right so every dictionary will have a key and a value so don't worry about hard coding i'm going to change it so this username can be anything doesn't matter for this particular site if you're working with a specific site and you have a specific username and password so you can put it like that so this is the user and this is the password so whatever so this is how we can create the form and i made a typo by missing out a comma here i like python because it allows empty commas as well other languages do not okay so this is the dictionary that we have but the csrf token is not yet dynamic so let's go up and let's see where is the selector for csrf token so this is the one minus we don't want get all we want get okay so what i'm going to do is i'm going to copy this and in data csrf underscore token like that so now it's having the correct value so actually when you are creating the dictionary so instead of creating like this having doing it in multiple places so what you can actually do is you can just copy this and you can write it like this so whenever you are creating your form data you can directly construct the form data like that all right so this will give you your form data that you want to submit now once you have the form data so let's come back to the form request so we were using the function this function so we have the response object now we have the form data object and this we created as data right and let's store it in a variable r and let's try to execute and see what happens so if we are writing inside the spider we will call the yield method here we can just say so if we are in this spider we will call the yield or here we can just use the fetch method so you can already see here that this is redirecting 302 to this particular page and now we are on this particular page so how do we know for sure that we have actually logged in so for that we will have to look at the behavior how the pages actually working so what i'm going to do is i'm going to just reload this page and quickly log in so reload login and here you can see this this this log out text you can see right so this log out will be only visible when we have actually successfully logged in so let's look for this i've just right clicked and clicked on inspect and let's create a selector and we can see that this is a tag which contains log out so actually it's very simple if you remember my previous videos and actually you don't have to remember the previous videos so you can just call response dot path and inside this you can just look for the anchor tag or any element and let's put a condition so what is the condition the condition is that the text should be the one that we copied is it there yes it is there if it is there that means we have logged in alright so this was the first and basic example so now let's move on to the second example so we can actually this probably you have seen it at a lot of places so what if we try to do the same thing using not form request but request shall we try to do it let's do that so now we have request okay and the request needs a url so first of all we need to actually log out okay so that i'm going back to the login page and i'm going to execute the same selector so the good thing about and this crappy shell when you have ipython installed it's always always installed ipython so you just need to write the standard python console is very stupid actually i don't have a better word for it so that's that's very very basic so what you need is ipython you should always have this in all the virtual environments unless i'm creating very specifically for a client i will install it ipython even if i have to maintain requirements.txt file i will install requirements.txt and then install ipython because you see this in and out and one more pencil thing that i can show you right now is i just have to press up arrow and even though these values from the previous session i have it so see now if you are looking for response dot xpath log out we don't have anything that means we are not logged in so now what i am going to do is here so we are going to create a request okay and now this thing is going to change a lot so first of all we need a specific url where this request is being submitted so that we need to get from here from the network tab are we logged in right now so yes we need to log out and login again this logout login logout login okay i've been doing this a lot this morning for some other project anyway so just notice the url here in the post request now in this particular scenario the url is actually same but this url may also be different so let's take this copy and let's put this url here okay we don't need the response now it doesn't need form data what it it will take is body and the body if you supply formed it uh data so that data that we okay we don't have the data created so we need to create it again so right now i'm going to expect errors yes we have errors what we can do is we can simply go up a few times because we already did it right from scrappy import request so this i'm going to write and this part the data so this i'm going to take as it is just to save some typing all right so let's go back and all right so right now if i execute like that so let's see we have an error and the error if you look at the error it's very simple it says two bytes must receive an str or byte object no god dict so even though this message might not be clear there is one thing for certain that it received a dictionary and it is not expecting a dictionary it is expecting str or bytes right so let's go back up and this data is actually a dictionary now how do we convert this data from a dictionary to a string so this is where our json comes into picture so import json and json has one useful function and this function is called dump s as for string so dump string so what kind of input it will take any python object so right now we have a dictionary so we put a dictionary and we got a string object back right so if you want to check b type we have string object all right so now what we can do is let's go back and here we are going to call response dot dump s and we need one more bracket oops not response dot dump s json dot dump s so json dot dump s okay so let's look at r so now it is still a get request so what we need to do is when we are submitting this uh when we are creating the replaced object there is one more thing that we need to provide so this is method this is going to be post now let's look at r so now this is going to be the request which is having a method post okay so let's call the patch method and see so it crawled but it did not because we did not see 302 and if we look at response let's see what happens here so it's going to open in the browser which is the default browser and just give me a second and this is the window actually my default browser right now is set to firefox you can see that while logging in please provide your username so it did not receive the username so we know that we do have username here so we have a username and password but the form still did not get through so what are the other things which are happening they remember the other thing the other thing is headers so let let me try that actually this is the first time i'm doing it this way but let's take all these request headers and specifically this is the most important one content type so let me take just this one and let's try it if it doesn't work we will take all of them so i'm going to create one dictionary and what was the key content type so the key is content type and the value is this all right and let's see address equals headers still know let's quickly check what is happening on my preview the display looks green anyway i'm going to wait so i'm just going to execute this again because the csrf token has probably changed again let's create the request and nope all right so let's not waste time on this i'll uh probably analyze what is happening sometime later so now we have a few options here i've already crossed my expected time i was hoping to finish this within half an hour now we have three examples here so this is asp.net page and if you click on login all right so this is just a local website for testing that i created uh standard stuff nothing fancy so how it works is very similar so let me try to recall the username for this so admin at test.com i think i copied this whole thing as password let's see so whether it works or not does not work and let's examine this post okay so let's see how we can there is some problem with my visual studio which is eating up a lot of ram all of a sudden so i'm going to exit this and probably will look at this particular page sometime later yeah i'm getting error from youtube and this this is one of those weird things which i never liked about visual studio and this is visual studio code by the way the one that we used normally is visual studio code and this application that i was running was from visual studio the full version which is like 2gb 3gb bare minimum size so that it takes up a lot of resources so anyway let's go to this page and let's look at what is happening but i think i should wait for a moment until the time i get this warning dismissed and i mean i get that sometimes too i did not got actually see the context what do you get sometimes so usually there is a delay of some time so by the time i speak and it reaches you there is usually a minute a few seconds of delay so anyway let me know in the comments what exactly you were saying i get that sometimes so let's talk about at this particular site so this i'm going to press f12 and let me show you what this site is about f12 and in fact let me try to carefully close other things all right so uh this was actually again one of the freelancing projects are the memory errors okay i got it yeah i got 16 gigs in this laptop but yeah or visual studio even if you throw 64 it will take stick up all right all 64 uh gbs it's good when you have a lot of lot of ram and very good ssd but otherwise it's it's painful so let's come to this side so again this was something i found on one of the freelancing sites or as a job posted and this scrapping is very simple when you look at it for the first time it's a standard table nothing dynamic about it on the first page so right now we are on page three but even if i want let's let's close this and reload actually it doesn't matter so there is no point closing and reloading so if you look at the structure this is just a table and you can actually get this data using pandas or any other way that you want so the first page is not going to be difficult so you can use scrappy and very quickly you can run a for loop on the rows and you can scrap all this data now the problem is when you go to the next page so i've opened the network tab and let's click on next and see what happens here so you can see that there are multiple things that you will see here number one this page url is different and this page url is different yes the domain is same but the actual post is going to this views slash ajax right so this is where form request from response is not going to work because it's not exactly submitting to the same one so this is where this is different secondly you can see that uh in applic in except it is sending application slash adjacent and in the response what you're getting is json even if we come to preview tab you can see that it's a json request so this is all together a completely different ballgame so unfortunately i have something so i need to wind up this stream for today but uh let's do another stream tomorrow probably and let me know in the comments if you like to see this particular thing handled in tomorrow's screen stream and if you have some other areas where you want to understand specifically talking about form request okay so if there are certain sites which you want me to open up and look at i will be happy to do that so i'll see you tomorrow and we will examine this site and how we can handle this so this is i'm sure pretty much going to be a lot of fun so for today that's all i'll see you tomorrow and i hope to see your comments with other side sets but whatever you want to be scrapped alright have a nice day bye
Info
Channel: codeRECODE with Upendra
Views: 1,046
Rating: 5 out of 5
Keywords: python web scraping tutorial, Python Web Scraping, selectors in scrapy, web scraping python, how to scrape data, browser scraping, scrape web pages, website scraping, python scraping, screen scraping, data scraping, Python Scrapy, web scrapping, CSS Selector, web scraping, web crawler, web spiders, webscraping, scrape, scraping, pandas web scraping table, pandas tutorial, web scraping with python, python projects for intermediate, python tutorial, python webscraping
Id: Cu2q7tr4Bqg
Channel Id: undefined
Length: 36min 24sec (2184 seconds)
Published: Wed Mar 10 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.