Extracting Info from Cookies - Dynamic Site with Python Scrapy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so one of my student is scrapping aside and scrapping this side and he got one problem he is facing one problem and i looked at it and i found it very interesting and i thought it will be a video which will do the justice okay so this is the page you click on search you can enter a criteria you can just leave it blank so this has to be replicated in scrappy okay so let's see how the traffic is flowing so how all the things are being passed so we need to press f12 and click on search again and now there will be lot of things which will not be replicating because we opened the site first right so where we have already opened the site so if you really want to understand what is happening we have to use incognito tab so i'm going to open incognito tab press f12 so what we need to make sure is this preserve log is checked because we will be going through multiple pages and this disabled cache is also checked okay so let's open this site now we are recording each and every piece of network request okay so it's loading okay so the page has looted i'm on xhr so let me go to all so you can see that there are a lot of things which are going here let's click on browse you can see that it's loading and all that so the real data will be in the in one of the api calls so most probably it will be in patch xhr so that is the first place where i'm going to look at and i'm usually look at the largest size okay so not necessarily it will not always have the data but in this case let's go and have a look so this is having the data actually yeah so this looks like it is containing all the data that is required fine so let's go to headers and see what are the things that we need to send in the headers so these are the request headers okay so the cookie will be handled by scrappy so that we do not have to worry about so what are the other things so there's this token jwt okay so this jwt this is the problems how do we find it and how do we send it so the first thing that i usually do is i'll go copy this partially you know so i'll uh not you know sometimes what the websites will do is they will take two three hidden fields and then while sending the request they will combine them so it's instead of like looking for the entire text i just take uh some logical copy so here i see a period so i'm just copying this part okay only this part and the first thing that i do usually is go to the source so i'm just simply opening the source enable line wrap and we can see that there is nothing here but still let's do a find for this to connect not there okay so now instead of wasting a lot of time i'll show you exactly how i found the source of this so what we want to do now is we want to search through all the requests from the beginning to the end and we want to find out the first occurrence of this this particular text okay and because we recorded all the traffic right from the starting we have all the requests here so make sure that you are clicking on a network tab and press ctrl f or command f and don't search anywhere else this the window this section on the left side this should open okay and this is where you need to paste in this text and press enter to search it now it will show you all the request where all this text has been appearing okay so there are a lot of places so now what we have to do is see whenever i click on this this is highlighted okay so what we want to do now is uh this is unfortunately this is not exactly in the in any specific order that i can figure out okay not exactly in the waterfall at least so we need to find the request which was the first one okay so i've already spent some time on it so let me show you where it came the first time okay so let me so basically you just have to go through all of them and some of them are collapsed so you'll have to expand and we can see that this jwt was passed in the cookie here right so let's go through all of these and i know that this one okay so this one the request to the login and if you scroll up we can see that this is pretty early so this request is the one which which was the first request where this this text appeared and it appeared in set cookie of response okay so here we can see that this was in response headers and this jwt was actually received in the set cookie so whenever this this url was called we received this token in the cookie so what we have to do now is we have to replicate the same thing in scrappy there is one more thing that you need to note here is that the status code was 302 so whenever there is a 302 that means that if this response is actually going to redirect to some other site okay some other url so what i'm going to do is let me show you exactly how i did so first of all i'll right click copy and copy as c url if you're on windows i use a c url bash so now i'm going to open postman so this postman is a very interesting tool and very useful tool for api testing and right now what we are going to do is we are going to play at the api level so that's why this tool is very useful all right so uh this is some old request which is opening up so what i'm going to do is click here on the collections and click on import okay go to raw text and just paste in this curl request okay so this curl request contains everything it contains the url it contains all the headers including this if there is any cookie uh yeah so cookie is also there so this is if we are making this first request then we will not have cookie so that also we need to check but for the time being we'll just continue just like that let's close it and let's click on sent and let's see what we have in the response so we can see that in the response we are having two cookies here and one of them is jwt so now we know that if you want to get this jwt we have to send a request to this url now we do not have cookie okay so if this is going to be the first request then we will not have cookie so come to the headers here and uncheck this cookie all right so this is going to be similar to what if we don't sign cookie so let's send the request again and this time also we are getting this jwt in the cookie that means cookie is not going to be a problem so this can be our first request easily which will get us this jwt okay so now the next step is i'm going to click on this copy [Music] so i'm going to click on this code button okay and a python request is already selected but now what i have is a code using the request url okay and generate a scrappy spider and paste in this code and start from there okay so let's open terminal so i'm just going to desktop and let's create a one spider okay so scrappy giant spider and let's call it drs and fourth parameter as dummy and let's open this drs dot py file what i'm going to do is i'm going to copy certain things from here so this url i can copy so allow domain not required instead of [Music] start url i'll have this okay you know what i'll do is i'll directly you know i will not send any header at all and let's see what is the response so i want to see what i get in the parse method so how do i do that so for that i am going to import scrappy dot shell import inspect underscore response this method okay and in the parse method i'm going to call this method and the first is response and the second is spider so spider is always going to be self so response is going to be the first parameter and the spider which is going to be an instance of this class is going to be the second parameter so that's all i'm going to do and what will happen is this url will be called the default scrappy headers whatever are there is going to be sent and let's see what we get here in the response okay and before i move on because we know that we are working with cookies so i am going to enable this in custom settings because we are working with the [Music] so cookies are by default enabled so this cookies enabled equal to true is not required okay but i'm showing you just so that you know that this option is there the other option which is there is cookies debug so by default this is set to false so this is the setting so what we are going to do is we are going to set it to true so this what will be the result every time there is a cookie received or sent it will be printed on the console okay so this will give us a good idea what is happening with the cookies so let's go to the terminal and let's run scrappy run spider run spider drs dot py okay so let's see and there is going to be a lot of text i'll have to make things little bit smaller okay so now let's scroll up and see okay let's crawl up all the way so this is where we started the spider okay and see everything i will show you everything you should be reading everything when whenever you are going deeper into scrappy you should always have an idea what is happening so scrappy started the versions of lxml and all the other libraries which are being used are here for example python 397 okay then a reactor twisted reactor is being used and this is what scrappy this is what actually makes crappy very fast okay now it is saying that it is overriding settings okay so what are the settings which is being written overwritten cookies debug equal to true so you can see here that cookies enabled is not shown here because that is not overriding that is by default true and there is a spider loader a one only so this is additional then adult extension telnet this is for debugging and the middleware so all these middlewares are being enabled okay so if you are working with some custom middlewares or pipelines so uh this area will be useful okay so this all these okay so these are the extensions so and and these are the middleware so middleware's an extension you'll have to deal with only if you are doing some customization so here we can see one cookies middleware is here so this takes care of this is already inbuilt we do not have to worry about so but if there is some custom handling of cookies then we'll have to write this cookies middleware anyway so let's move on so let me show you so here see what is happening so i need to make it even little bit smaller so unless you are reading it you're watching this video on a bigger screen you will have little bit of problem but still i will try to [Music] so here we can see this message that i received cookie from 302 and this and what is this cookie so this cookie contains jwt so this means that we got the cookie this is a good news but this is was this was a three zero two so three zero one three zero two is a redirect so now it is redirecting okay so now it is redirecting to another and another url so it is redirecting and it resulted in this one but the good news is this cookie was sent so you can see that this is this was sending cookie okay so earlier we were having we were having receiving cookie so here we were receiving cookie and here we have sending cookie okay and this was in 200 so this is where our spider is right now this is where it is so if you look at response dot headers okay so these are the headers which were received in the response now if it was not three zero two this was a special case but typically if this was not three zero two when we received the cookie you will see that one of the entries here will be about cookies but it's not there right because the cookie was received in one request before it so what we need to do we need to check the request so the request dot headers if you look at request dot headers so we can see that cookie is here so this request dot headers it came up in probably one of the newer versions in the older version we had to use response.request.headers not plural but singular so here we can this also we can see that the user agent which was actually sent is this okay just as it's just an extra information but now we can extract the cookie all right so how do we extract the cookie so note that this is b okay so b is for byte this is not a string so let's do request dot headers so request dot headers so this is a dictionary and the dictionary key if we look at cookie so what we can do is we can directly look at this but this may throw exceptions so it's better to call get method and pass in this cookie there we have it we got the cookie now this cookie is a byte string okay so because we have a b here but still we can use these string methods so we can uh split based on this column okay now right now if i just call split just like that it will not work because this is bytes and this is directly we are sending the string so a byte like object is required not str so how do we do that so one way is to just put a b here and it will work now we have a list so from the list we can take the last item so now what we need to do is we can decode it okay so just a decode it so the default decode will work or if you want to specify the encoding we can do that but this is how you get your cookie and of course we can then again strip to remove this empty space and all that so now you know how to get this complete cookie copy this go to your visual studio and now you know that here you will have this gwt string you still need to get you know take care of removing this but that you can simply do dot replace right so dot replace with blanks and there you have it so again i shared this because this was a specific scenario but this gives you some idea about how cookies are working and how you can use this network find a window so i hope you found this useful all the best
Info
Channel: codeRECODE with Upendra
Views: 200
Rating: 5 out of 5
Keywords: python web scraping tutorial, python scrapy tutorial, Python Web Scraping, selectors in scrapy, web scraping python, how to scrape data, browser scraping, scrape web pages, website scraping, python scraping, scrapy tutorial, screen scraping, data scraping, Python Scrapy, Scrapy Spider, web scrapping, CSS Selector, scrapy shell, web scraping, web crawler, webscraping, scrape, scraping, scrapy, python web scraping, cookies scrapy, web scraping with python
Id: Tk5fPldIow0
Channel Id: undefined
Length: 18min 1sec (1081 seconds)
Published: Mon Oct 04 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.