Web Scraping from Bad HTML: Python Scrapy and XPATH Magic

Video Statistics and Information

Video

Captions Word Cloud

Captions

welcome to another practical example of web scripting and today we are going to scrap a site where the data is not very well structured so there are multiple ways and different kind of challenges that you will face so let's look at our example the first example that we have so this is one of the site that we are going to scrap today in fact this is only the one site that we are going to scrap and it's a pretty simple website so there is a listing and it's a long list and there is no pagination here and whenever you click on any listing you go to a new page and what you have to do is you have to get the name the address phone number email website this this data is what that person was looking for so that's what we are going to look and the problem is when we do a right click inspect and you can see that all of this text is inside one p tag separated with some br and you can see that there is new line character in all those things so what we are going to do today is we are going to write some script and let's try to do it so i always like to start with the shell so whenever you are working with scrappy shell you are not overloading the website so this is very essential because the first and only rule of web scrapping is do not harm the website so let's open the shell and by the way whenever i say shell i like to work with windows terminal so let me fire that up so i got it here i'm going full screen okay let me clear everything and go to temp directory i got some shortcuts here so those shortcuts i am going to talk about those shortcuts that i'm using in some later video but today let's stick to the website that we are going to scrap all right so the website is this and the first thing that we are going to do here is we are going to take this first page where we have all the things listed and let's open the shell so there are two ways to launch this shell so we can just write scrappy shell and then pass in this url in double quotes it's not essential but we can do that or the other way is to just open scrappy shell without any argument and once everything is loaded so what we can do now is we can use the fetch command and now in this fetch command we can pass in the url or we can create a scrappy request object and we can pass on this so for now everything looks okay so let's look at this site and the structure of this listing let's go to inspect and we can see that this is the id of this div is my list so this makes things very easy and inside div there are simple straightforward anchor tags so this is a very simple listing and it should work i'm going to clear everything and let's write a simple selector so i'm going to start with css and remember that was an id so whenever you see id you need to start your selector with hash so now here you can choose whether you want to write it like this do hash my list or you can skip this doesn't matter and then we are looking for all the anchor tags right so let's close this and just press enter and we can see that we have a huge list so let's clear everything and this time what we are going to do is we are going to pull out the attribute href so this is the syntax so what we are going to do here is we are looking for this particular div which contains this id we are looking for all the anchor tags inside that and from all those anchor tags we are looking for the attribute href and let's call get all method and we have a huge list so we got all the links address but the problem is that we do not have this as absolute url so we'll have to convert this into absolute url so we can simply call response.urljoin so if i take let's say if i'm taking only the first one okay and what we are going to do is in fact let's use getall instead of get all let's use get so we get only first result so what we can do is very simple we can surround this in response dot url join so whenever we do this we will get absolute url and the good thing is if it is already absolute url it will not mess it up so it's it's a very safe function it's much better than using string manipulations and adding and removing things so now i'm going to take the same thing and surround it in patch method so i'm going to pass this and fetch method so see effectively what we are doing is we are executing all the commands from the shell and this way what we are doing there are two benefits the first benefit is that we are not overloading the server and the second benefit is because we are just sending one request at a time and secondly we are simply going to write all the code here and later on it will be very easy to copy paste all the code from this to a file all right now sometimes i get this question that why not use jupyter notebooks so whenever you are working with scrappy you are going to use standalone spider in some cases but mostly you will be working with projects so when you are working with the projects you have multiple files to manage so in that case a jupyter notebook is not going to help you will need an id you can use vs code sublime you can use pycharm whatever you like doesn't matter as long as you are comfortable with the ide it should be fine so now this page has been executed this page has been loaded so let's confirm the url so if we just call response.url we will get the exact page where this url is pointing to so we can copy this and paste it here so we are on this 13 honey s or singapore i guess singapore private limited so this is going to give you the url or what we can do here just took a moment to check whether the stream is okay okay so ram hello all right so stream looks okay if there is any point where you are not able to hear me or look at the screen just let me know so anyway coming back the second thing that i wanted to show is instead of response dot url if you just call view response so here what you will see it exactly what the browser got okay so it opened in on the other monitor but now if you look at the url you can see that this is c user stamp app data local temp so basically this has created a temporary folder a temporary file in my temp folder and it's showing exactly the same page so what you see here is exactly what scrappy is looking for looking at all right excluding the images in javascript so that's not counted so anyway now let's build up the selectors let's right click and look at the let's use the inspect okay so ctrl shift i so we are on firefox so i'm going to switch to chrome it doesn't actually matter but most of the people actually use chrome so i'm going to use chrome on a side note sometimes if you see that and request is empty always cross check in firefox because i have seen multiple instances where chrome developer tools will show that there was no response and firefox will show you that there is a small response maybe just okay or something like that hello rashidul all right so let's come back here right click inspect now here we can see that this is the name which is inside this strong and this is the class so this is h6 all right so the first thing that we can do is ctrl f right here note that the focus is here if i come here to the page and click here this control f will open this find window which is going to search inside this that is not what we are looking for so just click here click on elements or click anywhere and then press ctrl f it will open this window right here so now what we are going to do is we are going to look for h6 directly okay so let's type in h6 and we can see that we have two results here so the first is actually opening and the second one is closing now here we can use css or xpath so let me show you both ways so let's come here i'm going to clear everything and let's start with css so if i just say h6 and look for the text and note that i'm going to call get all just to be sure that there is one or more than one okay so anyway we are getting one that means there is only one h6 now let me show you how it works with xpath rather let me write it from the scratch so response dot x path and in x path the syntax is going to be double slash h6 so whenever you say double select double slash h6 it is going to look for all the h6 elements all right so let's write get all and we have exactly one element because there is no other h6 now in css selected the syntax was double column text here we are going to use the text function right so you can use either of these two methods to get the exact text which is contained inside this h6 good morning from brazil okay excellent i'm in india so it's good evening good evening to you okay so let's come back to the page so we have created the selector for the company name so this one was easy not complicated one right so the problem that we have is actually this one the address so which is inside the p tag and which contains a lot of things now if we start with the same approach and let's look at how many p tags we have so i'm going to switch to x parts now okay and let's look for all the p tags how many tags we have lot of right so cls so p tag just looking for p tag is not going to work so we need to refine it further so how do we refine it further so let's look for the parent of this element okay so this is the parent so what we can do here is we can look for a specific class or we can copy this entire thing okay so let's come back and let me show you what i have copied i've copied this entire thing okay the whole class now let's use this use this and write the x path now what we are going to do here is we are going to use double slash and again we have two choices now we know that this is a div so we can either write a div and then in the square brackets we can paste it but remember that in x path if we are looking for an attribute we have to use at the rate symbol so this is one way or the second is if you just want to look for all the elements you can use star here okay so we know that it is a div so let's start with div and let's see how many we get just one perfect so we got the div now we know that this p is inside this div right so what we are going to do now is we are going to add slash p and let's see how many we have we have a lot of p tags so what we are going to do is the data that we are looking for is in the first p tag okay so what we are going to do now is we are going to take the first tag now remember that in python the indexing begins with zero so right now if we look for 0 there will be nothing because in x path the indexing begins with 1 so now if i put 1 we get the data so what we can do now is we have the p tag so we can look for the text inside that p tag so let's look for the text now you can see that we have multiple things okay so i did not call get all so we are getting the selector right so can you let me know in the comments if this zoom level is okay or you want me to make it smaller so that it is readable on one line or maybe even bigger let me know in the comments i'll appreciate that so anyway so i'll continue with this zoom level yeah so this looks fine to me so what we are going to do now is let's call get all and let's see what we have we got all the parts of the address correct so let me walk you through what we are doing okay so there is a comment in russian i don't know what is that all right so cool so thank you for your confirmation now what we are going to do now is we are going to take this so you will notice one problem here that we have backslash n so what we are going to do is we will take all these items one by one perfect thank you so we are going to do is we will take all these items and one by one we will clean them up before combining into one final part all right so let's move on to the next problem that we have the next problem is this telephone number so let's see how's the structure so this is again inside p tag and then in strong we have t e l and then just like that we have the telephone number lying around all right so now this is another problem that we want to solve okay guys thank you so now we need to extract this so there is no class there is no specific id so how do we make sure that we are getting the telephone number because if we come to the next part so the fax number is structured like that like that and so it's very similar email is again like that so there is no id or no specific class so how do we look for these elements so the answer is actually very simple what we will do is we will look for this text t e l colon okay so how do we look for this kind of text let me show you how so what we are going to do is response dot x path okay and this time i am going to use the strong okay so well let's look for all these strong elements where the text is equal to t e l okay so in square brackets we are going to write text call and add this brackets equals t e l note that this is case sensitive so if i just write t e l colon this is unfinished little okay i did not put the closing bracket closing a double quotes so right now it did not return anything so there is no result because this t is capital so now we have the selector we have reached to that particular element there is one more thing that i want to show you here so sometimes you will not have this exactly there will be some line breaks below this and above this so in that case what you can do is you can make use of contains function so let me put this in notepad i want to show this like all right so right now this is our path and sometimes you will have scenario where this tel is you know it it will also have some line breaks or some additional text and you want to look for a partial match because if we write it like this it will look for exact match so if you are looking for partial match so what you can do is inside these square brackets you can call the contains function now contains function of takes two parameter okay so this function is x path so whatever wherever you are using xpath whether you are using in scrappy or you are using in any other tool whether in selenium or everywhere its xpath is standard its same there is nothing scrappy specific here so this contains function takes two things the first is where do you want to look for the text okay so we are calling the text function here and the text that we are looking for is tel okay note that i'm using a partial i've removed the column here okay so i'll show you what is happening here once more now why we are using like this because whenever we want to get text from an element we use something like this okay so so and okay these are my pillar words i use them a lot so we use the text method to extract the text from an element okay so that's why what we are doing here is so let me show you in action get all now see what is happening here i have provided only t e l and it is still got me this element right and if i provide the earlier one here here we have to provide the exact one here if we remove this column it's not going to result anything so if you exactly have whatever you want to what you are looking for you can use equal to otherwise you have to use the contains function i hope it clarifies few things so let's move on to the selector that we were creating so in fact we already created that selector so this is our selector right and if we call get all so get all is just to make sure that we are getting exactly one element so now we have this element so what we can do is a very simple we can use slash dot dot now y two times dot it's very similar to the directory changing for example you do cd dot dot and you go one level up it's very simple and very similar to that now it is going one level up so one level up of strong is this p tag so now we have reached the p tag now once we have the p tag here we can call the text function how's that we got it right now pex and website and everything this is very same so fax is empty so let's take email for example if i take the same selector and instead of text this tel if i paste in email we have the email and what else do we have website oops so website come back here and instead of email let's put in website and it has to be exact so i have to put in the column as well and we have the website right so this was the general idea now let me show you one more trick and i will by the way show you the complete code how and how it works and of course i'll share the code now we were looking at the address okay so the address selector is like this okay so this is the address selector so let me store it in a variable address now this address is the objective is to clean this up okay so what i'm going to do is i'm going to import scrapper helper as sh what is this scrapper helper this is one collection of useful function for scrapping which i wrote and published on pi pi you all can download and use and in fact the it's uh available on github and in fact i invite you to add your functions functions that you think that are very commonly used and are useful for others so the latest edition was this headers function it's not related but i'm just showing you okay headers so what it will do is it will give you all these standard headers for example anyway so we have the address here and now what we have to do is we have to run a loop so what do we have in address these items so we can use python shortcut so for x and address and in scrapper helper we have a function cleanup okay and this has to be in square brackets so this is just a python shortcut for running a loop and now you can see that we are calling cleanup on every item of this list and we have a complete clean code okay clean address so we can assign this to the address so we have the address now if you always have only three items you can put them in separate columns like the address one address two and the city or something like that or what we can do is we can call another useful method so this method is dot join now this method is from string it works for strings and it takes a list all right so if i just put empty string dot join and a list this will be the output or if i take a comma this will be the output or if i take a comma and a space this will be the output so this is another useful function which i personally use a lot so this combined with the address this is going to be your complete selector so this will give you the clean address in one cell so now i'm going to bring on the screen the complete code that i have written because if i write everything and copy paste it's going to take a lot of time so let me walk you through what i have written this is the final code which i am going to share so importing scrappy and scrapper helper and this is the starting url okay so this was the first page yeah this is the one so we are going to start from here so that is in my start urls so this site worked without passing in user agent so did not bother and this was the first selector that we created if you remember we just looked for all the links and it gave the links to these but it was not in absolute url format so i called response.urljoin and i simply ran a loop and a return scrappy request and the i created a new callback method past detail and here whatever i have written so far is i have covered everything in this video so we have address parts and this cleanup method and then we have name address phone packs email website and then i'm yielding them in a dictionary so this is actually a standalone spider and let me see if i can show you the output so by the way uh one more tip whenever you are opening up csv files don't open in excel because what excel will do is it will take phone numbers and it will mess them up so i believe you might also have this problem it will uh excel will convert phone numbers into you know it will treat them as numbers and it spoils everything so the program that i use to open csv is libreoffice so it's open source office alternative but the good thing is whenever you open a csv file you are presented with this dialog so you can choose the character set so this is utf 8 and separated by comma and the best part i can select the columns here for example phone number and this column type i will set it as text or if you want you can just select the first one and shift select the last one and set everything to text and now you open this i did not run it completely i stopped it mid run so still i got 42 yeah so we can actually if i let it run it will give the complete result but there you have it you have the name address phone number fax email and i always like to keep this link and how do you get the link response dot url so if there is something wrong with the csv file i can always just go and take a look for example this one doesn't have website so i can take this link and go to the page i can specifically go to this page and i can verify that yes website is actually not here in this page so it's not a problem so that's it that's it for today if you have some questions you can drop in the chat and i always welcome suggestions so that's all i'll see you in the next one till then have a great time

Info

Channel: codeRECODE with Upendra

Views: 1,210

Rating: 5 out of 5

Keywords: python web scraping tutorial, Python Web Scraping, selectors in scrapy, web scraping python, how to scrape data, browser scraping, scrape web pages, website scraping, python scraping, screen scraping, data scraping, Python Scrapy, web scrapping, CSS Selector, web scraping, web crawler, web spiders, webscraping, scrape, scraping, pandas web scraping table, pandas tutorial, web scraping with python, python projects for intermediate, python tutorial, python webscraping

Id: l0chQxDJJWU

Channel Id: undefined

Length: 33min 22sec (2002 seconds)

Published: Wed Mar 03 2021