Scrapy POST Requests & ASPX Pages

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to another session so today's session is again going to be about post request so for those of you who were live yesterday so i did this session on post request but i was not able to complete it and for those of you are watching it from the beginning i'm going to cover some new things and i'll try to make this more understandable and of course another change as you can see that today i'm uh i'm not displaying myself it's just going to be screencast so let's start with post request so what is post and what is get so this difference we must understand before we can actually go into post request and get request so let's assume that we are opening up any of the site let's open scrapper let's open google for that matter google a good idea no let's open uh scrappy documentation i hope this is the right one now when i press enter what is happening what this browser is actually doing it is assuming that the protocol is http and it is sending a get request to this server so there are few like there are a lot of things happening here but the basic thing that we need to understand is the moment i press entered it assume that the protocol is http so as you can see that this is http s but yeah the protocol is still hypertext transfer protocol http and it sent a get request to this browser and this browser returned a response so this response was actually the html which contains references to lot of other files so as you can see here there are a lot of js files css files and of course there will be lot of images so all the loading of all the extra files that was done after this page was loaded after the response was received the browser analyzed it and it realized that okay we need more files and it sent get request to all of those so by default whenever you open any site in the browser it sends a get request now what about the form submit for example here you can see that you can actually enter some text for example if i'm looking for let's say form request just a random search or let's say search for forms and what happens when i press enter now this time will be a get request or post post request now that will depend upon how this page is written so let's look at the code now here we are looking for this form tag i'm going to zoom this up and here you will see that the method is get so when you press enter for this particular form this is still going to be a get request so understanding get request is actually easier the query have written forms let's do something like test and let's see what happens when i press enter so we have the network tab open and the developer tools open so let's go to network tab okay right now it is blank let me hide this overview as well so we have a clean one and zoom out and now i'm going to press enter and we saw in the markup that it should be a get request and notice the first one the first request so there are a lot of other dependencies that the browser is loading but the first was the get request you can see here right and this we saw in the source code that the form has specifically mentioned that this is a get request now whenever there is a get request or post request some data is sent by the client or the browser so in this particular case there were a lot of other information but the main thing that the browser sent was that word test because we looked at the word test right so this test is actually sent here right ho right here in the query string itself or in other words every information that we sent is there in the url itself so this is the get request okay so in get request all the information that is being sent is added to the url now while in the post request so think of this as a postcard so postcard is open so you can look at it just a piece of paper so you can actually look at the information which is written on it right and the post request they are sort of an envelope right so all the data that is sent in post request is actually contained within the request body so you have a request header and a request body i'm trying to make things very simple here so things are little bit more complicated in fact they are very very very much complicated and if you're working as a web developer or appearing for an interview you will definitely need to understand what is the difference in get and post if you're working on scrapping information then again you need to understand how the websites are working so you need to have a good understanding of how websites work you don't need to understand how dotnet or java or none of those things you don't need to understand but you need to understand how this overall html css and this information exchange how does this happen so for example there is this we opened at the page that we requested was uh thank you kenjo i hope that it continues i continue to help you so there are a few other things which are happening for example we looked at docs.scrappy.org but it actually did some dns resolving and it went to this ip address but let's not go into all that okay so let's focus on get versus post so in get you have everything in the url in post it's like everything is covered in a box and you are sending this information in a box now that box which contains the information is called form okay so we are going to look at how we can create this form data and do mimic what the browser is doing so that was all the theoretical information that i wanted to give you about get and post maybe few more things it's not just get and post there are few other things for example there is a delete which is used for deleting the files get is for getting the information and that is what the browser does patch is i hardly see patch very rare so mostly you will see either get or post and if you're working with apis and you're sending files you will be using put and if you're deleting some files or resources it will be delete so post get these are the basic things that you need to understand how they work and if you know put and delete that is additional benefit and which you will not really need for this kind of work that we are doing here you will not need to know that but still it is good to know and this uh this website by the way this http.org this you will find in a lot of documentation because it's very easy so for example if i open http bin dot org slash get so here it will just simply return and response which will return what you sent okay and similarly there is post and there are a lot of other things so this this website is again going to be very useful now uh yesterday we worked on this quotes to scrap and we will be working with this again yesterday i made a goof up which i need to correct now let's talk about scrappy so scrappy you can create form request or rather post request using two ways so let me open up the terminal let's start with a new one so let's open scrappy shell directly so in scrappy there are two things let me monitor everything yeah i need to keep this open so that i'm using obs to broadcast this and i need to make sure that my audio and video is going okay anyway so so far things look okay if there is something odd let me know in the comments so in scrappy i you have from scrappy import request so this is the what this is the main um class and main object that we work with scrappy every time request and there is form request so we can make use of both the things to create this kind of post request so we can use request and we can change the method to post or we can use form request so let's see both of these one by one and let's see how we can actually login in uh course dot at this course to scrap login so this is a test site and this test site is actually created by the folks who created scrappy itself so it's a good site whenever you want to do something and with scrappy this is you'll see this site coming again and again but this is actually meant for learning all right so which one do you want to start with first the form request or a request so let's start with form request so form request is actually easier for this kind of scenario so the first thing first we need to understand what is happening behind the scene whenever we are sending a form request or whenever we are sending a post request so in this page so this site will take any username doesn't matter so let's uh call it yeah fine let's call it user and password is pwd and note that the developer toolbox and the network tab is open and now i'm going to click login so okay let's refresh this because i've been playing around a lot with this site so i'll talk about that error which happened later so i'm going to enter user and password as pwd and let's press enter and let's see what is happening here yeah my password is pretty weak i know so let's see let's focus on focus on the first request so the first request is going to the login page and it is a post request so now we know that it we have to send post request so that is the main purpose of this video today's video and here we can see that 302 is the status code which was written by the server so 100 200 300 400 500 so these are the five levels of http status codes which the server can return 100 is informational and i hardly see it 200 is found good so we are always looking for 200 anything in the range of 300 is a redirect it can be 301 302 now why i like scrappy it starts from here scrappy will take care of 3d01 and 302. you will not even notice in the code what happened was it three zero two or two hundred scrappy will take care of all the redirection so it returned status three zero two and this three zero two said that you need to go to this page and the browser automatically send the second request and these subsequent requests were sent because they were having a reference in this page so we sent this post request browser set 302 which is found and let's click on this and see what all information was sent i don't know how to i'll figure out maybe later how to you know assume this information even further so maybe some other time so right now let's look at was it what is happening here so this is the request url which we saw here as well sometimes some of the website will send some post requests to third party websites completely different domain so that you need to take care if you want to do that here these are the only two things that we need to take care of response is what was written by the server so let's forget this request headers so this is important so let's look at that so these are the standard request headers the most important one which we want to take care of here is this content type so whenever there is a form submit or a post request this is what is going to be the content type application slash x ww form and url encoded in most of the cases not always so whenever we are mimicking this post request we have to make sure that we are passing this content type otherwise it's going to break right but scrappy will make things easier for us so we'll see that but if we are doing it in the raw manner in a raw way using request we have to send this explicitly and a user agent and other things are or they are all optional or at least for this particular site there may be sites which where you want to send all the information and cookies this is another reason why i like scrappy scrappy will take care of handling cookies if you want to turn them off you have to explicitly just add one line in the settings cookies enable equal to false and they will be turned off otherwise cookie handling will be done by scrappy you don't have to worry about it nothing now let's move on let me collapse everything and let's focus on this form data all right so this is the form data that we have so here in form data you will see that this is the user that i entered password that i entered but there is this token so this is csrf token and this is cross site request forgery csrf so again i'm not going to go into detail but this is one of the security features where certain tokens are passed whenever there is a form request so these can be very simple things so these can be a simple tokens like this or they may be advanced technique involving capture and all that but that is not in the scope right now so right now our task is very simple we want to send all the form data we want to capture that and send it so how do we send it so in case of quotes dot scrap quotes dot to 2scrap.com it's actually very simple we'll just take this look at the source code okay and press ctrl f find and not here let's refresh this login so this is this is the one i'm just trying to quickly locate it am i looking at a wrong page yes or because i have already logged in so this this i did this stupidity i did yesterday live and again today same thing so basically that took and you will find on the page where this this is written okay so this form is displayed so this is where this token will be found so control u and let's search for csrf so there we have it all right so yeah so this value we have to send when we are creating the form and how do we get this value so we have to start scrappy from this page all right so whenever we pass this page in the parse method we will look for this particular token and we will store it and then we will send the form post request so let's i'm just going to copy this and on the shell ctrl l to clear everything patch now this site is straightforward it will not ask for any user agent or anything so makes it very simple now let's go back to the page and this is what we need to find it's input type hidden and this is the name so this name is unique so we can use xpath or css doesn't matter all right so uh always start with get all to make sure that you are getting exactly one and because we have taken an attribute this has to be surrounded in square brackets all right so we can see that our selector is working correctly so we have exactly one and let's put get and let's take we need to extract this value so how do we extract an attribute in css selectors double colon attr and and we have it here all right so this is the value which we will be getting dynamically and what we will do is what are the other things that we were sending here so we have user and pwd whatever i just want a copy of the form and this is invalid csrf doesn't matter right now so all i need is this so this one is actually very simple so this just needs username and password so what we are going to do is we are going to create one form so this is just a variable and it's going to be a dictionary so we can call it whatever we want and this dictionary will have csrf token oh by the way um side note please watch that video which i published on monitoring kikron products and if you want kikron product any of the keyboard uh today's a good time you can copy the script run it and tonight they are going to tomorrow i think tomorrow they are going on sale the k2 keyboards will get an email alert whenever it is available so i'm going to copy this username this is another key so what i'm doing here is i'm creating so this can be anything for this side and password is again going to be anything doesn't matter or this particular site and then closing of the dictionary verify everything looks okay so now we have the form data and now we are sent we are ready to actually uh send our request okay so i'm going to clear everything and remember that we are going to start with form request so form requests are actually very easy so that's why i'm starting with this and then i will show you how to do it with request so form request has one method prom so this is something stupid about ipython shell that i'm using it just it gives the tab completion but gives lot of other information that we have to get rid of so anyway so we are going to create one a new form request and let's call it r now this will accept multiple things but let's start with the bare minimum very basic thing so it needs the response remember that we already have response object available we are in scrappy shell we are not in ipython but we are on scrappy shell and we already sent a request to the login page so and we saw that we were able to extract that csrf token so we have a response object available and this form data in this argument we just have to pass in form that we created our post and let's fetch this so this fetch method will actually take care of sending the request and let's see let me zoom out or zoom in what is more readable i'm trying to figure out okay so what is happening here you can see that there should have been a line break to make it easier so i'm going to anyway highlight so let me see so what is happening here is debug redirecting 302 to this page so when will it redirect when it has successfully logged in if it is not successfully logged in it will give you some kind of error message for example here we got this error message and it was 200 remember so it is 302 that means we have already successfully logged in and now wherever it got redirected it is 200 and at this point if you want to look at the response that we have we can quickly call response view response and it is going to open another browser and as you can see we are logged in because we can see here that we have this text logout which is visible here so this text will be visible only when we have successfully logged in so now we can actually write one quick selector here so let's press ctrl shift i and now this one is firefox by the way if it looks little different than the usual so all we have to do is we have to look for anchor tag which contains this text okay so let's go to the shell and response dot css or xpath whatever is comfortable for you so what we are looking for a tag which contains okay single colon contains log out and we have something so what we can do is we can check for the presence of this a tag if there is this anchor tag which contains logout that means we are successful so using this we can actually create our spider and let me quickly show you one of the spider that i have written here okay so i'm going to do something just give me a moment because i have written some code earlier and let me show you how it looks just one moment i'm going to clean up certain things before i show it to you all right so this i'm going to let me give it a name so this is bom request not py okay oopsie i did not want to copy paste but let me show you how this when we put everything together how it looks so this is the spider and we started from this login page we created a form we saw how we did it just now so there is this header so actually not this one and let me show you another one yeah so this is the one so we created a form we do not need the header here and we created form request passed in response and form and when we are writing the spider we have to actually provide a callback as well otherwise how will we know whether we we were able to login or not and here i am looking for this logout text here i've used xpath in the shell i just showed you how to write the same thing in css so this is how you write using css note that this is single column so this typo is very common of the single colon versus double colon it's very common so if you are looking for let's say text now you will put double colon and then call the get method so this one right now has single column for contains and double colon for extracting the text and we are doing the same thing here if this element is present that means we are logged in if not we are not logged in so this was about the form request now let's talk about the the yeah talk about request how we can do exactly the same thing using request plain scrappy request so just give me a moment thank you mario nice to hear that all right so let's do exactly the same thing so we already have the form object we will need it so at this part i'm not going to change it now because we will be working with request if we look at request let's look at the help for a request okay so this request is okay typo so um in the init you can notice certain things number one the first parameter is your url so even if you don't write url equal to something the first parameter whatever you provide in the constructor is going to be the url the callback by default is none and notice here that by default the method is get because most of the times you will be working with get so here the change that we are going to make is instead of get we will send post okay then there will be headers so if you remember i showed you that let me close this and open chrome and when we were looking at this we talked about a specific header and this header is content type so this content type header is important because this is what informs that this is a web form this is a form that we are sending so this we will have to provide and then body is where we will be sending the actual form data that we are sending the username and password remember earlier even if you're joining late so the analogy that i use is get request is light up like a postcard where everything is plainly written where it is going what message is there everything is written plainly everyone can read it so get request has everything in the in the url itself like query strings you know ampersand key equal to value all that so that's how get requests are sent but form requests are like a closed box or envelope where you just have the address of the person so you just have the url and the content is inside that box so this content username and password that is what we are going to pass here in body and cookies meta and all those things we don't have to take care so right now we are not concerned with anything else i'm going to press q to quit from this help so let's start creating our request so we will create one request object and what is the url so let's copy this from here now this will be useful now remember that we talked about form request versus request so when you are using form request and specifically you can create form request directly or you can call from request dot from response so this is how we did earlier right so for what we have seen is form request dot from response so this method you will use only when you are ascending to the same response okay but if you are sending to some other url the only change will be that you will use something like this if you are working with form request but right now we are working with plain request so this url is the first parameter and this is what it is all right so after this we need headers so we can pass in all the headers or the most important one this is the one which is the most important so copy the entire thing and no it doesn't like i wanted to give it a line break okay now it takes the line break so the second parameter that i'm going to send is headers so this headers is going to be a dictionary and this has to be in a key value format right so this is going to be the key and this is going to be the value now the next thing that we need to pass in is body so this body is where we pass in the content that we want to send so url not url the username password and that csrf token this is where we need to pass in in the body but now here is the problem we have created a dictionary and this dictionary if we try to pass it it's not going to work so let's store it in a variable and we have an error that this must be a string or byte like object but we got dictionary and which one is dictionary this form it's talking about this form so when we were using form request we were directly able to use dictionaries but form requests can handle dictionaries but request cannot handle dictionaries so we need to convert that dictionary into specific format and what kind of format let's go back to the get request and remember that in get request everything goes as key value pair in the um in the url itself so what we have to do is we have to convert this dictionary into key value pair in a specific format which is used in urls so for that we will have to take help of a library which is a url lib okay so this url lib is part of python so we don't have to install anything so what i'm going to do is i'm going to import this url lib and let me show you what happens so if we call a url lib dot parse dot url encode method and then here if we give this form so let's look at the output now if you look at this output now the csrf token is very long and it is coming till this point so this is the csrf token then we have ampersand so this ampersand marks that this is the next key value pair and this is user name equal to this particular username then another ampersand and then password equal to this password so now this looks like very much like what you send in get request and this is what we have to pass all right so let's open this same request and we will copy this right from here and let's paste in okay and let's check whether what is the type of r r is post so because i missed one thing we need to supply the method so this method has to be post all right so um is there anything which is not clear so far so if you have some questions and so far whatever we have covered please go ahead ask them in the chat so now if we look at the request this is a post request which is going to this url and all the details are actually hidden inside that box now are we there is one problem which i see here response dot url i think we are already logged in so we'll have to log out so remember that we created a selector for checking this yeah so we are actually already logged in so we need to log out and whenever we do a log out what we are doing here so let me do a fresh fetch so that we are log out yeah see now we do not have that logout button listed that means we are completely logged out so let's start from the beginning so how we were creating form let's so this is another good thing that i like about ipython versus the regular console remember that now the csrf token must have changed so we need to execute this selector again and get a new form so now this is the new csrf token that we have and now we can create this request so this is the request and now just want to show you what happens if we look for logout we don't have it that means we are not logged in now let's send this request and if everything is okay yes we can see that this is a 302 redirect and now we have after 302 we have 200 and if we look for logout we have log out here if we do not have any value written by log out that means we are still not logged in now we are logged in that's why we can see log out and this is specific to this side if you're working with any other site you have to see how what people are doing so anyway so far this part i already covered in yesterday's video minus some goof ups that i did so now i think you must be very clear about the logging in and logging out and sending form request versus request so knowledge of this is important because in certain situations you will need this now let's move on to the practical part okay i always love very practical examples and this is one of the examples now i like this particular example because this is a dotnet site okay so a dot net when it started in 2001 by the way um i have a dotnet background i was for a very very long time i was a dot net developer web form developer so this kind of thing i have worked up on so uh this particular task was there was in fact this was very fresh i think it was i saw it yesterday or today on one of the freelancing sites this is a new thing just fresh but the technology is thankfully old because i wanted to show you something old and that's what we have here so the task was that there was a list of values which need to be searched in the title so let's search for organic okay and see that let's wait i think organic was too broad term yeah so it returned a lot of results and there is pagination and all that and what that task was is to get these codes and the title so this was the task and it was not just organic it was a list in the excel sheet and let's look at the basics here the first request that we see we can already see that this is a post request i'm going to zoom in how do i hide all these things i want this upper portion to be small or gone anyway so let's focus so this is the url where this post request is being sent all right the status is 200 okay that means whatever is there is on this page if you look at the response or in fact even easier if we look at the preview we can see that all the data is here so it's not dynamic or javascript or anything like that so we can directly get the data from here so this is an important step that we have to take now headers are standard headers content type is different here you can see that this is multi-part form data and the earlier example where we saw it was url encoded something something okay and then we have boundary and all that now the cookie is it's a big cookie but we don't have to worry about cookies because scrappy will take care of cookies and the usual headers nothing to worry about here now this is where the dot net one begins i'm going to collapse everything so dotnet web forms so they have this specific hidden values so this event target underscore this is double underscore by the way if it is not visible double underscore even target event argument view state so this was the good part and the bad part about dot net a huge chunk view state and a view state generated and view straight encrypted so this was this is not all the times you will see that but these are the standard things that you see everywhere and whenever i see this dnn i know that this site is using.net nuke that anyway that is not important but interesting and then there is one more thing request verification token so what we need to do is we need to get all these values dynamically so how do we get all these value dynamically the first thing that you will notice is even though we search we are on the same page and in case of dotnet this event target event argument view state this will be there on all the pages so right now if i just copy view state and press ctrl u to look at the source code and we have it here right here on every page there will be event target event argument view state this huge chunk this big chunk of the site it's a big chunk of the text and yeah so this this these three are the usual.net things and what are the other things that we have here we are on the wrong page let me close this yeah let's focus on this one so then we have we search for organic so it's right here and we are searching so anyway and this is not complicated now this double underscore dnn variable this is something we need to figure out from where it is coming or how it is being generated in this case it's right here dnn dnn and then autocomplete off and this value so this value as it is it is going there is nothing complicated about this page once you know how to do it everything is easy and this request verification token let's look into the source code this one as well and we have it here so all we have to do is we just have to create the proper selectors and we will have everything sorted out so there are two ways to do it i can type in every command and i can show it to you or i can type in few of the commands let's let's see how it goes so anyway let's start with this let's go to the shell ctrl l to clear and the first thing that we will do is we'll directly try it without headers if it is not too concerned about headers we should have 200 yes we have 200 and if we look at the response by in fact let's look at response.txt and if we scroll up we can see that this is a big chunk of text now let's go back to the source code here's the source code and let's create a selector for a view state let's start with view state because that is always the most irritating part and if if this is broken your form will not be submitted so let's look for view state so it's actually very easy so the id and name both are view states so you can use any of them so let's start with usually id is unique but let's start with name because in the earlier example in quotes there we used name so let's use name here also doesn't matter response dot xpath and i used css earlier now what we are looking for we are looking for any element where in square brackets the name is view state and because we are looking for attribute now i if you are not comfortable with xpath let me give a quick explanation double slash means a search everywhere or search everywhere in the entire tree if you just put slash it will only search the root note and if you put double slash it will look at all the elements star means it doesn't matter what is the name of the element if we want to be specific we can write that look only for input or we can just say star and in the square brackets we put what kind of filtering we want and all the attributes in x path are started with at the rate so this at the rate name equal to view state so this is what we are looking for so let's close the bracket close the single quote and let's see and yes we have something great so we got view state and what is the real value view state so this is in this value oh i zoomed in too much anyway so the value that we want is in the value attribute and how do we access the attribute in xpath at the rate and what we are looking for is this particular elements backslash sorry forward slash at the rate value okay and let me call get and we want this page is going to be this screen is going to be filled yeah so this is the complete view state that we have so now what you can do is actually i am too lazy to type too much so what i usually do is i'll just take this view state and create like that so i have my view state which is stored in this same variable now similarly see i am i'm going to save some typing and what i'm going to do is we have to find even target okay so what i'll do is i'll change this view state to event target and the variable name again i am going to change as event target so see what i'm doing here exactly the same selector i'm looking for even target where the name is even target and i'm getting the value and let's check if we have something no even target was anyway blank right yeah there is no value here event argument so in fact if you are feeling too lazy like me what you can do is you can create standard functions in fact maybe i can update that scrapper helper library that have written and because if it is a dotnet page it's going to be the same so probably we can get all those values right so what else did we have let's come back here event argument event view state or in fact can i be little lazy and copy paste some of the code that i wrote earlier so this is the quote this is the code that i wrote earlier nothing new about it it's exactly what i have shown you and if i press enter you can see that we have everything we have use date event argument event argument view state encrypted a request verification token and all the expats are exactly the same i'm looking for this event argument okay and where is my value okay so let me update it we need to get the value out of it what i'm going to do here is i'm going to show you the spider the complete spider that i've written so what i've done here is so maybe we will learn some things about visual studio code as well so what i'm going to do here is i just noticed that i did not get the value and we need to get the value right that i have to add slash at the rate value right what i'm going to do here is i'm going to press and hold alt key okay so this is exactly the place where i want to paste it i'm keeping alt key pressed i have not let it go this cuts it is at wrong place so i'm going to click it again and put it here and put it here and here so now i have multiple cursors and when i paste it at the rate values there but there is one additional column so i'm pressing delete and all the lines are now updated now i'm little skeptic so what i want to do is i want to paste everything here and to make sure that everything is correct or not so what i'm going to do is i'm going to press alt and shift okay and i'm going to create a line here so alt shift and i just created a line now what i'm going to do is i'm going to press ctrl shift right arrow you can see that i have it here or what i can do is i can use print and right arrow and close it and then copy everything and paste it here and print it and whatever this none was supposed to be none so it's not a problem and here i need to undo again so we have all the variables and now i'm creating the form in exactly the same manner that we did earlier and this all these variables have created and the name is whatever was here i've taken the names from here so whatever was here i've taken all these things and in fact i think i skipped this one there's no point so i skipped some of them and this is our search phrase apple's is here the scroll top this four zero two nine was when i created this roll top and right now it is blank so i'm deleting it not really required so this header i mentioned that we will need it because this is different so um maybe form requests will take care of it but i don't want or will take care of it let's go and check the source code so if headers is not there then it is going to set the content type as this because this is the most common one and this is not what we have so that's why it's important that we send this content type and this i got from here in request headers so where is the content type here so this is whatever it was here i just copy pasted and created one dictionary and now i showed you already how we create a form request the first argument is url then we have the form data and here we can supply a dictionary if we are working with form request we can supply a dictionary now header is again supposed to be dictionary and then we have a callback which is this part stable method and was what this parse table method is doing it's nothing complicated it's simply looping through all the rows and just printing out all the values from the first td and the second td and separate it with this so nothing complicated so it just means that if this method is successful this will be printed so let's copy the url copy path and go to command prompt and let me open a new one and this time let's uh how do we do it i'm thinking whether to show you a few more things or keep it very simple let's keep it very simple in fact let's leave it at simple all right so i'm going to run scrappy run spider remember that we are working with standalone spider so this is run spider and just enter and if everything is okay we should have all that list so we don't have that list so we have an error here i must have coofed up somewhere so where did i goof up url encode parse table now what we want to do is we want to debug the spider so probably it is a good thing that this error actually occurred because there are two ways now ah got a nun type so when we actually printed there was something which was none right yeah here we have none so probably these none values are we need to we need to check these none values why do we have these none values and we need to either handle them or remove them so this is the view state it should not be none i'm just thinking that if it is none we can probably let me copy everything from this page ctrl l yeah when you paste it there will be first line which is not indented and the others are indented now let's look at the form so see these event target event argument these are none none so instead of none let's make it blank we can do here is the problem areas are these to even target an event argument so even target if given target else blank if event blank now let's run the spider once again and see if we have the error we did not have the error but this post did not go successfully i messed up something so view state let me undo everything so this is just i am actually not paying attention to whatever i did i'm just running the previous version of the code now we have an error it's the same error okay so this anyway was required so there is a one more way to debug see i'll let okay this is probably a good opportunity to talk about one more very useful tool so that a tool is postman so this postman we used to in fact rather i used to use it a lot when we were working with apis when we were as a developer when i was working with api so this is postman and this is typically used with api this api whenever you are working with apis you use this tool so what are apis nothing but a fancy name for special server request which return data in json format probably that is how you can define it so what we can do here is in the developer toolbox just right click the request whatever request you are sending do a copy and this girl don't use the cmd one this always creates a problem so copy curl as bash okay and then go to postman and click on import button by the way i was already playing around with postman so it looks little different for you it will look little different in the recent version they have made some changes but eventually what you're looking for is probably you need to create a new workspace and you have to work inside a workspace and eventually you need to click on import button go to raw text and paste this everything paste everything and click continue import and we have it here it's supposed to be post request let's see this is the one it's a post request right click copy copy is curl bash import text continue import what is going to get should be post okay in raw let's look at request not request headers so let's look at view source form data oh quick form data copy everything paste it here and let's see if we get a response response that we need easier here i'm doing some goof up somewhere actually that's why it is usually better to pre-record or rather create the script in advance so from request aspx this is the one all right let's do some debugging first of all we need to put the values right here at the rate value now we should have values and let's put one more thing here or is there any any thing else that i'm missing let's [Music] clear everything and let's do a fresh search let's search for apple organic apple apple whatever in the waterfall you can actually see how much time which request is taking so this is the post request which actually contains all the yeah so now we are trying to mimic this okay we need to have event argument event target view state and then we need a view state generator which we do not have here so that is again something important without this the request will obviously break so this code is actually not complete you state encrypted so this we have here this apple we have it here this do we have here or not i don't think we have it here so i'll add it okay scroll dnn variable we have it request verification token without this it will definitely break we have it here a header we have let's run this and see if we still have errors sps request still we have a type error okay because we have two things blank here so this i'm going to write you state else if a view state is not nulls else will be blank and this i'm going to do with everything press none i'm going to pass view state encrypted it's not possible view state is again not possible okay so let's debug how do we debug so let's import couple of things from scrappy dot in fact i wrote a method inside scrapper helper i wrote a method run spider i'm just going to copy this here paste in here and this spider name is aspx spider what i'm doing here is okay let me remove this timing thing and settings again we don't need settings so this probably we can skip okay either way spider or let's remove everything and crawl and let's keep something very very simple so what i've done here is i've just imported crawler process created a new instance and started the crawl method okay so i'm going to put a breakpoint here and i'm going to put a breakpoint here as well just f9 and then i'm using visual studio debug so just press f5 yes treat it as standard python file so let's see what we have no just stopped without errors because we did not even call this let's call this method f5 and now and we have stopped on the break break point so i'm going to press f10 and stopped what happened did it even come here this is so small that i cannot hard to read close everything start once more do face these kind of problems so we are on pars method and we jumped directly to parse table we did not go inside parts method and why did it happen okay it was just compiling am i missing something okay i did not have process dot start okay view state now we are going to keep an eye on the left side and we'll see that okay so why view state the selector we did not call the get method see we did not call the get method it was such a stupid thing to do we have not even called the get method so what we are getting here is selector so all right let's stop it and and just showing me two times you state generator views regenerator used encryption encrypt okay looks okay so hopefully it works now reactor not okay this a reason is that it's starting from two places i'm going to comment this out or we can write that if name is main then only run this so this reactor not startable uh if you not restartable if this error you see anywhere that means there is something wrong and you are initiating the spider two times or you are running two spiders so it can there is something wrong in starting the spider okay so there is no error but it did not run so let's see uncomment interesting i was hoping for this to run smoothly today let's see ctrl shift d to that i can actually look at the local variables and unfortunately this doesn't get bigger than this and collapse all these things i don't want globals but all right we will look at form how we what we have in the form so that should give us some idea so this form what do we have here many ways to look at the values i like to use debug console so this is what we have inside form view state event target event argument looks okay so far and these are okay so these errors are not related to this file so we have header form request bom data headers is there anything else that i'm missing okay it is coming to parts table okay to see it is coming to pass table so that means i must have done some goof up here we have rows so now we do not have rows that means i did not create this selector properly so let me okay let me change this selector so her request is going through this selector i just copy paste it and there are some numbers so probably that is causing the problem so contains where id contains instead of full match i am going to [Music] do a partial match here and contains this i'm closing okay and now i'm being very confident and i'm running it blindly without debugging uh okay reactor not startable i want to impact this is how it should be if underscore name equals underscore underscore main then only run this so and then you will not have two instances of spider running and then you will not have this problem all right so let's see in fact i remember someone be asking me to show from the scratch how i deal with all these problems so i was feeling confident but it my confidence failed what i'm going to do now is if you want to exit the stream it's okay now what i'm going to do is i'm going to do some debugging some more debugging and this time so now what i'm going to do is not here yeah here i'm going to use inspect response to create the correct selector so this inspect response is actually part of crappy shell from scrappy dot shell import inspect response and this is what i'm using here and it takes it takes response as the first parameter and self as the second parameter so don't be confused because it is the second parameter is the instance of this particular class so which is self so now when i run this spider it will stop at that particular point and i will be able to look at the source code so now i can actually look at the response and i can check what selector i have a create and what is the output actually we do not have any output here we do not have a table because we do not have an output so that means that there were probably some of the headers in fact let me take all the headers request headers i'll take everything from here all right and let's come back to the code and what i'm going to do here is in headers i'm going to call the function sh dot get dictionary headers and now the code is going to be tough to read because i'm going to paste in a lot of information like that right so what it is going to do is it is going to send all the headers this is going to get all the headers and we are going to send all the headers without thinking which one is important and which one is not and i'll see what happens exit and run again and we'll see what is the response i'm doing something stupid because uh this spider actually i created earlier and i verified that it is working as expected and i'm just rewriting the code and now it is not working that means i'm doing something stupid somewhere oh no we do not have the result we are doing something wrong somewhere else going to undo this headers in fact what is the point of stretching this live stream so what i'm going to do is i'm going to end this stream and i'm going to update this code and i'll look at where i must have made a some typo the overall concept remain the same there is nothing else which is new there is one more thing that i want to do before i end the stream i've taken that window to the other monitor so that i can easily copy paste all the form data i just want to make sure that i'm taking everything so even target i'm going to copy paste everything and make sure that i'm not making any typo so event target is there argument is there view status definitely there generator i just want to make sure maybe i'm missing something yeah see this view state generator is very important i have created this but i did not add it to the form so i just pressed by the way control tonight what what did i press alt shift and down down arrow so there we have it i think this is definitely one possibility that this an error so view state encrypted all right so i'm just going to quickly check all the values so this one i did not pass because this was blank so let's send it anyway and apple comma is missing so let's put comma and looks like we have everything now let's format oopsie what just happened oh my doctor all right let's exit clear and run again let's see what we have this time so this is the post request and we have 200 let's look at the response still we do not have something stupid that i'm doing anyway i'm going to end this stream now and uh probably um i'll i'll make this all the fixes which are required and i'll post it the link so that's all for now uh do let me know i'll probably create a community post and where you can give me your ideas about next live stream topic i'll try to do a one more live stream tomorrow around same time and still need to think about the topic but yes we'll do something fun so that's all for today see you tomorrow have a great day
Info
Channel: codeRECODE with Upendra
Views: 1,307
Rating: 4.909091 out of 5
Keywords: python web scraping tutorial, Python Web Scraping, selectors in scrapy, web scraping python, how to scrape data, browser scraping, scrape web pages, website scraping, python scraping, screen scraping, data scraping, Python Scrapy, web scrapping, CSS Selector, web scraping, web crawler, web spiders, webscraping, scrape, scraping, pandas web scraping table, pandas tutorial, web scraping with python, python projects for intermediate, python tutorial, python webscraping
Id: ATpE8T-nfY4
Channel Id: undefined
Length: 87min 4sec (5224 seconds)
Published: Thu Mar 11 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.