Multisite Scraping - Passing Data from One Site to Next

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so today what we are going to do is we are going to scrap some information from one side and we will take that information to the next site and what we are going to do here is for example lets take this book so we'll collect the name then we will go to the next site and we will search for that book and we will collect the information from here and we will create one single output let's get started so i do not want to focus a lot on the scrapping part so i'm going to copy paste some of the code here and just want to check my meters audio meters look okay all right so scrappy gen spider and let's call this multi site and the fourth parameter is going to be the domain so start url so that's i'm going to skip and just put an x and let's open this up in visual studio code so i'm going to open one new window here and let's open this multi site and it's opening in on the second monitor so let me bring it up here and all right so let's go full screen and zoom in so now as you can see that the allowed domain has x and started start url is http slash x slash so this always creates a problem so there is absolutely no point in giving the fourth argument with anything realistic you have to come here and edit it anyway so let's start with books to scrap so i'm going to copy paste the url so this is our start url i'm going to comment out allow domain for now i'll show you how it can be useful later so if we open this page here and if we want to check i've already created one xpath so i'll directly just paste in that x path and show you what it is doing so ctrl f so at this x path as you can see that it's simply looking for all h3 and then inside all h3 there is a tag and this is going to select everything and you can press enter to move forward or shift enter to move backward so you're on the first one shift enter to move to the last one so this can this way you can quickly verify that your selector is working correctly so what i'm going to do is in the parse method i'm just going to do something simple so what i'm doing here is i'm getting all the links and storing it in a variable and then i'm running a loop on all those links and what i'm doing here is i'm calling response dot url join because the links that we get will be relative and we want to convert them into absolute urls so this is the shortcut and it works very well please do not and never use plus you know do not use string concatenation or any any kind of slicing or anything just use response.url join it will take care of all the scenarios i have not seen it failing anytime and then we are sending the result to this parse book method so this book method is going to run for each book so let's say it comes here and this is again actually a very simple site it's not dynamic so we can quickly look at the markup and we can see what is the xpath or css selector that is required so you can see that this is h1 and if i press f double slash h1 and you can see that there is exactly one h1 so it's not complicated at all and let's take the price as well and the price is very simple it's price color are not that simple because if i just say dot price color so now i'm creating a css selector you can see that there are actually more than what we hope for so there are more and there are seven selectors which are being created so we need to verify and add one more layer so if we scroll up our product page will not work now this product main so this will work so what we can do is actually we can look for product main and inside that will look for price color so note that there is a space here so you can leave a space and it will select all the children and grandchildren and all those things shivraj twitter i am not planning to cover because they have an api and we are supposed to use that api and senior coder how to add data in database okay i okay all right i'll make something using scrappy and selenium okay so these are actually different topics so i'll probably cover in different topics different videos database i think i've already covered in one of the videos anyway i'll double check and if there is something interesting we can put them in database okay all right so we'll probably start with sql lite which is inbuilt and we can move on to the fancy databases and selenium i've created some videos on selenium outside isolated as in not including scrappy but there is a way to include scrappy and selenium so that probably we can do yeah but guys thank you so much for your suggestions i uh the ideas are very important so anyway coming to a css selector so css selector is another video that is lined up so here i'm creating css selected so i'm looking for product main class and then i'm looking for all price colors so if i just put a space it will look for all the price color so whether it is direct element inside that or multiple level nested it will take care of it if you put a greater than here then it will only look for direct elements twitter i'm not going to cover database yes i will cover of course if something changes in the future we don't know and have i covered splash in detail there is not much to talk about splash and by the way i'm going to launch a new course a very soon if you're already registered in my free course you will see a mail so i'm going to make some changes to the course and yeah i'll send out the details on mail so if you are not on the mailing list you just join the free course and you will be on the mailing list and it's all up to you and i will continue to create quality videos on youtube so if you don't want to take any quotes it's fine on youtube you will have a lot of videos the only difference between course and youtube is that on youtube i just pick random topics and cover a few things and if we go from course then course has a flow and you can go back and forth and see what is there and not i never force anyone to purchase any course and i never will all right so coming back to our selector so we have the selector for price and uh the book name so now i'm going to copy some code here all right so let's create this method parse book so now we have book name and price and now once we have this information now we are ready to send this information to our next method and the next method is actually going to call this librivox all right so let's look at the first link and let's copy this link okay copy link address and i'm going to come here and on the top just to keep things simple let's write something like libri underscore url and just pasting this url and let's look for the part where it is actually changing so if you look carefully you will see that whatever book you search it goes to this parameter q so q equals this book name so what we can do actually is we can create a scrappy request using this url and we will just change the name so what i'm going to do is i'm going to delete that book name so that i can build url dynamically all right so um yeah we'll talk about proxies and splash so i'll uh yeah i love all these ideas so uh this what we are going to do is now there are two ways let's say that we want to add this the secret garden to this string there are two ways number one is to use this plus but this is not a good way so the better way is to use these curly braces and what we can do here is we can create the url like this library url dot and then we can call the format method and here we can supply book name so this is actually safer if you have an integer it will be converted into string and there are a lot of things which it can handle and if that someone is trying to mess up with your data it it's it's a very safe and nice way to do it so this is the recommended recommended practice to create a safe screen a string concatenation okay all right so this way we can create the new url okay so now we have the new url we have the book name now how do we create how do we create a request all right so it's a simple the first part is simple yield scrappy dot request and the url is going to be this one right so this is the url so this is the url and now there is one more thing we want to make sure we want to make sure that if there are specific headers because we remember that we saw that when we were looking at this this was one xhr right and what we got in return was adjacent so if this is not a normal scenario right so if it is not a normal scenario that means there are certain headers which are being sent so let's look at headers and if we look at headers so let's look at all the headers cookies is okay we don't have to worry about cookies so user agent this is one thing and this one is important x requested with xml http request so only if this header is sent then you will the server will know that okay this is an ajax request and xhr request or ajax request same thing so this is how it is and we the server has to return the json response so this header is important so let's set it so what are the parameters that request can accept it will accept the url all right it can accept headers so headers is going to be a dictionary it can accept callback so you see in the tooltip we have all the options so it can receive method which is by default set to get so we don't have to twitter i'm not going to and talk about it's better not to talk about twitter and facebook and all those social media they have very strict policies against scrapping and that's this area where i don't enter into a rule of thumb is if a website already has a publicly accessible api then it means that scrapping is not allowed so coming back to our code headers are set to none so that we will supply we don't have to worry about other things other things and then uh take a note of this keyword cb key w kwr gigs so this is something we are going to use so headers so let's put the header so i'm going to do some copy paste here but i think so far it should not be complicated so we just have the headers and there will be of course one callback method so callback is going to be let's say self dot parse and that the the librivox site is for audio books so let's call it parts audiobook all right and yeah that's all and now uh this method i will create a request and we can create this sparse audiobook method self and response and right now i'm just going to say pass so now at this point we can take two directions either we can look at this method and we can see how the response is available or we can focus on the most important topic for today which is passing the information to the next method now here you have two option one is the traditional option which is now not recommended but still if you're looking at any of the old code and that is what you will see so let me show you the older approach first so instead of directly yielding it let all right so the older approach is to use meta alright so you just create a dictionary with all the information that you want to pass you want to pass book name all right so here is your book name you want to pass price all right so you just need to create a key and then all you have to do is pass in the actual value and once you have this so this actually can be you know anything doesn't matter this variable name be kenny can be anything but one it what you need to make sure is when you are creating the request you need to send me meta equal to your meta values or whatever name you have chosen all right so format everything looks okay so what we have done actually is we are sending the information from this method in the meta keyword and how do you read it in this one so here no change required directly you just have to write let's say book name now here you will have this meta available as a dictionary and you don't this is available in all the all callback methods so meta is usually um it's used for a lot of things but we are using meta okay fine so let me show you meta dot get because if you want to read dictionary values we just read it like that and that is our book name so that's how we can get information from a meta so it's actually response dot matter what did i spell it right yeah so response dot meta so this response dot meta is a dictionary and here we can get this book name so if you want to look into what is there let's make use of inspect response so not self we have to pass in the response and then self so this inspect response method has to be imported and this one is inside scrappy dot shell okay so from scrappy oops from scrappy dot shell import inspect response and here we are going to look at this response should be available now even if i have made a typo well this is one thing i don't like really about visual studio code the this error which is reported by the lender it takes time paicham is superior that way by charm gives you errors very quickly so anyway let's come here and run the spider so scrappy run spider and multi site let's see what we get so it should if everything is okay so now see what it is doing right now is it's going through all the books and for all books it has created the request so response so we have a response response 200 and let's look at response dot meta so now response dot meta has some information now you can see that this book name and price is the information that we sent right so if we come back to the code this meta book name and price is something that we sent but meta has some other information as well right so this is some additional information which is inside meta so yes it can be used so if we use this command that response dot mata dot get book name so of course it's just a dictionary and we should be able to get it like this or using the square brackets convention we will get it either way but as you can see that a meta is containing more information than what is what we sent it means that meta is used internally for a lot of things and what if we have a custom variable name a download timeout we will end up overwriting that or messing up or that will overwrite i don't know which one will actually overwrite which one but the lesson here is that meta is used for a lot of things so probably this is not the best thing so in one of the newer version of scrappy so i don't remember exactly which version it was 1.8 or after that there was something new which was introduced and this was cb keyword arguments right so it's the same thing instead of meta you are sending this a dictionary to cb keyword argument but what changes is let me show you by executing it okay so let's limit this loop so let's limit it to just two right now if i just do exit it will keep on going for a lot of time so let's close this let's go to go back to that and this time what i'm going to do is if i'm being too fast i just did this slicing so that i get only two results okay and four in fact let's do only one let's start with that and what we will do is we will come here to the inspect response right we should come here and we should be able to examine now let's see what happens if i try to run this spider so scrap p run spider multi-side and let's see what happens you saw that we have an error here we can see that there is this type error here and if we scroll up what is the type error the parse audio book got an unexpected keyword argument name book name so you know what is happening right now this code did not even execute so it's a compilation problem so why it is happening to understand this what we can do is we can go and have a quick look at the spider this source code so i'm just going to press ctrl and i'm going to look at the spider class okay and what i'm looking for is just let me show you this one this parse method here so parse has self and response and then star star keyword arguments and here we are writing parse audio book self and response but we have not written what are those additional keyword arguments so what basically it means is that parse method or any callback method so i hope you are already comfortable with the idea of the callback method so the methods in scrappy they do not execute one by one these methods are called when the response is available so they are not executed in sequence and that is actually the power of scrappy so what is happening here is we have multiple keyword arguments which we can write and whenever we say cb keyword argument what it does is whatever is the book name or price or whatever key values sorry dictionary keys so this dictionary key has to be provided here now see at the signature itself has changed now why the signature has changed because we are sending additional keyword arguments so we are sending book name and price so here we have book name and price if we change the key name to book so this has to be changed to book and now we can simply print the book and price so let's come back to the terminal let's see what happens now we have inspect response so we should stop at that time but we did a print just before that so here we stopped at the inspect response but the secret garden and then 15.08 this value is printed here so there is no point uh executing this spider further for now i'm just going to do exit so what is important right now is to understand the concept that if you have to pass on multiple values you can just write inside this this kind of structure and then send them in book and price and something like that and for those of uh those of you who know how to use items like if you're working with proper scrappy structure and let's say this instead of this metavalues this is your item okay so what you can do is actually in cb keyword arguments you can send this complete just like that you can send so you can send item an item like that you can send the entire dictionary or entire object of whatever you are working with so here now you have the item available so then you can do your item dot get and then book name and the same way we were doing with meta so this item is a dictionary so depending on how complex your project is because sometimes there will be like a lot of data which needs to be passed from one response a one callback method to the other in this case we had only two so we created keyword arguments if you have a lot of data to be passed on you can probably create a dictionary or a specific item class and then you can send on that object completely all together in one go so both the ways are valid depending on the complexity you can use so meta just keep it for reserve things in fact for example if you want to use proxies for example there will be some um like scrapper api i've already created one video on scrapper api so in that case you just need to pass meta value and disk you know proxy name so there is a proxy middleware built inside scrappy so that will take care of it so probably i'm going to out of the way but the most important thing to understand here is don't use meta for sending custom headers custom values use rcb keyword arguments so what i'm going to do now is i'm going to copy paste a little bit of code because that's the point of this session today so at the point of the okay all right so now we know that what we will get here in response is a json object so what we are going to do is we can close this okay so what we are going to do is if we just call response.json now earlier we had to use respond.text and then call the import.json and then convert into json object but in the newer version i guess scrappy 2.3 if i remember correctly they introduced this adjacent and this is directly built in if the response is json it will directly convert this into data into a json object so by the way json object is nothing it will be either a dictionary or a list depending on what is inside that data so now it is not a string and if we go back to librivox site we can see that the key which contains the actual data is result all right so i'm skipping the error handling so what i'm going to do now is this data and what was the case all small results so this results this will contain the actual markup but now this markup is a string so here comes another twist right so how do we select from this markup so now this is where i say that learning beautiful soup is useless unless you have to modify other people's coat absolutely useless to learn beautiful soup because in beautiful soups you can create selectors from strings but you can do that using scrappy as well so what you have to do actually in case you are not getting my point let me show you what i mean so inspect response let's put it back so response and then self and let's open the terminal clear everything so run spider and just give it a second so the screen went blank for a moment but now we are back i hope so if i just look for response.json not jones jason so what we have is a dictionary and here i can simply look for results and we have all the results now we can spend some time on looking at this but basically we can see that this is a partial html which contains li so all the list items which are displayed on this side so these ones are li and this is what is we have available so what we can do as a very simple actually so we can import this selector object so from scrappy dot selector import selector with s capital now this selector object we can create an instance here all right so selector and this can take a direct text and this is where we have all the text right and this is where we have the text so if we open our terminal and this is what we had right so if we look at the type of this it's a string and now what we are doing here is we are converting this string into scrappy's selector object so we can save it in any variable name and then we have all the s dot css and x dot x path and all those things which are available so what i typically do is i save this output so let's do the same thing on the console and i'll show you my way of how to make sure that i'm creating the correct selectors so here our data is not defined because and the data is response dot json data is response dot json so now if we look at s this is selected so s dot get s not data yeah so if you look at this data now it has covered it surrounded this in html and body tags but doesn't matter we already have li class equal to catalog result and all that so we can use selected dot let's say x path and we can look for all the li items just like that we will have all li items returned and then we can fine tune and if you want to have a proper structure then what you can do is you can take this data all right so this data which contains the actual data and if you look at the type of data it's string it's a dictionary okay so we can call a json and we can import json and then we can call json dot dump s for string so this will convert the string into this will convert this dictionary into a string and then what we can do is we can open a temporary file so with open any file name doesn't matter so this is libri dot html and this write mode as f and f dot write and what do we want to write we want to json.dump s for string and then data and close this and now if everything is ok we should have this libri.html so we can open this cdy and code so now we have this open here so we can format it and look at the exact structure so right now i'm showing you a very zoomed out kind of view zoomed in kind of view so otherwise so in fact uh i took the whole thing so what you can do actually is you can be very specific and instead of taking the whole thing let's do something like this from data you can just select the part which contains the html so which is results and now let's look at the same file so and to close it and open it again so now we have a very clean html so once we look at this html we can create our selectors and we can also look at what happens if there are results are not found so this we can do very simple instead of secret garden let's put something which we know that there will be no results so what happens for those kind of scenario see no results found so clear everything and then run it again and let's look at the response what do we have in the response okay it's not actually i need to do something so that it's so right now when i refreshed it actually sent a full request and what i want is just the xml http or asynchronous this one so we can see that the status is success and inside results we don't have an html what we have is no results found so now what i'm going to do is i'm going to copy paste the code which i wrote earlier and i'll walk you through that code so this is my actually in this code you will see that i've done a lot of error handlings and all possible kind of scenario so here i have the parse audio book function i've taken the data and the first thing that i'm checking is whether the success this key this key is containing the word success or not if it is containing success if it is not there that means there is an error and i am directly yielding the book and price and audio book availability is set to error right now if it is success so there are a few more checks so let's not go into that so let's come here directly so what if the results are available so here i'm creating a selector and here i've created one dictionary so where i have the book i have the price which i got from the previous method and in the availability i have this value available and i've created an empty list here and then at this list i am filling using this response which i got from this particular page so uh these are my results catalog results then i have the book i have the download button spam so this is actually the size so all this information is available here so that is what i'm doing here and let's do a quick run and let me show you in what ways you can use this information so let's look at this information so there are many many different ways to you know store this information and output this information it completely depends on what you're looking for so here okay this is the in uh this is the result so what i'm going to do is actually i just want to make sure that i'm not making any errors so and then i'm going to send this output into us json file so clear everything and then let's send the output to yeah same and json file okay so let's call it multi-site json when why i'm showing you json because this is a good case for json because there are indefinite number of audio books available for each book let's see what is inside that json and let's format this now this is my output so i just actually came up with a hypothetical scenario so this particular book the secret garden is available for 15.08 audio books are available and these are the details of all those audio books so this meta is actually the tags which are on librivox site this is the download size and these are all the options now for example this one doesn't have any audio book this one doesn't have any audio book this one has a few available audio books which has some options not all of them you can see that our code is completely dynamic and it is able to get whatever information we want to show it so that's all i wanted to show you today the importance of cv keyword arguments so that was the most important point i wanted to show you this one the cb keyword argument how you can pass values using this and another thing which i showed you is how you can create scrappy selectors and from the text you can actually extract this information so uh that's all for today guys uh i will come up in some interesting video next time till that have a nice day bye
Info
Channel: codeRECODE with Upendra
Views: 532
Rating: 5 out of 5
Keywords:
Id: i7nZBtDnYAo
Channel Id: undefined
Length: 43min 22sec (2602 seconds)
Published: Wed Mar 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.