CSS Selectors for Web Scrapers | Scrapy, Selenium, BeautifulSoup

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so let's start all right so what we are going to do today is we are going to talk about css selectors we covered x path selectors in the previous video so if you want to go and have a look at xbox selectors you should be looking at that live video which is on my channel now css selectors the purpose is same we want to locate a specific element so for example we want to locate this specific element and we want to extract the this information right now how do we extract this information now we'll be creating some css selectors and we need a way to verify that the selector that we have written is good right so how do we do that like there are some sites like this so in this site you can actually paste in your html so you can see one example and we can write this selected here so for example if we write a which is the name of the tag so this is what is written and if we are looking for let's say there is no class or anything so let's add a class here so let's add something like class equals bold yeah so if we want to select all the elements which contain this bold class you will write something like something like this and it will give you the result but when you are working with real live web pages this is not really helpful so this is something i really don't like to use now we will be working with two things number one a browser and number two console so console is in the command prompt right so terminal so in browser you have multiple places where you can create these selectors and verify right so the first is console the second is elements so let me zoom this up yeah so because this bottom part is not important right so this is a very simple html and this is what we can see in the elements so what we can do is if you want to test the selector we can come here make sure that elements is active and press ctrl f so you have this window in the bottom which is find by string selector or xpath so this is where you can actually test your selector so if you want to select you know a tag or let's take this class heading so if you want to select all the elements which contain this class heading so we can just copy this right in here so note that i have not yet you know pasted in i've not not yet put the dot because when you're looking for a class you need to put a dot still it is searching it because if you remember from the tool tip it says you can find by string so you can type any text and it will work right so this is not really helpful because in this case okay it's all searching for heading but let's say i want to search for h1 so now it is giving me two results here right but this result is just a text which is contained inside this document so what we can do is we can use this console right so if you have anything on the console you can just clear everything and now in console you can test your xpath your css selectors everything so how do we do that if you want to select by let's say if you want to check right so let me zoom this more even further so i'm putting double dollar as we start typing actually so you can see that there is single dollar double dollar dollar x dollar zero dollar underscore so just focus on double dollar and dollar x so if you want to test xpath you use dollar x if you want to use verify css selector you use double dollar okay and inside double dollar after double dollar just put the curly braces at the braces or the brackets and one closing and opening string and then write in your selected so now you can see that it's giving me exactly one element which is h1 and heading right and if i press up arrow it comes back and now i can look for dot heading this is another selector so now it's giving me four elements so this way we can verify from here now there is one problem that if you want to look at the text it will actually give you more information than probably what you need so in that case what you can do is let's say you want to look at the second h2 right so what you can do is you can put in square brackets put one because the counting of the indexes starts from zero so zero was this h1 heading one is this h2 heading three is h2 heading dark and also on so if you like if you write it like that you will get the complete element printed right so now we have talked about all the things that we can do with browser we can use jso paste in html and type in the selector here can come to developer tools elements and console we have seen a lot of both the things and of course because finally when we are writing the code we will be writing we will be if you are working with scrappy and that's what i'm assume assuming right now that most of you are working with scrappy however my content of today's video will be applicable for scrappy for selenium for beautiful soup everything so if you are working with the scrappy you can open scrappy shell so here you can actually type in the exact command of what you are going to write okay so this is the page that we are going to work with so let me come back here clear everything i am going to open this page here by using the fetch command and what we can do here is we can write response dot css so this is the structure and don't forget get all so here once we have this we can write h1 not j1 h1 like this or we can write dot heading like this so these are the various ways of actually testing the selectors whether they are working correctly or not so the usual approach that i like to follow is i test my selector especially when i'm working with scrappy i like to test my selector on the shell the reason what you see here in the browser it may not be really what scrappy is getting because this part can be manipulated can be changed by javascript so if there is some javascript which is actually changing this you will not get your real selector right so this was about this you know verifying that your selector is working fine now let's talk about the various types of selector so we just actually looked at one and in fact two of the selector if you want to select any element by the tag name itself so we just looked here that if you want to select h1 so anytime you can actually clear it so put the cursor here and like that so if you want to get all h2 you can just put h2 just like that it's just the name of the tag if you want to look at all the paragraph you can just write p and it will get you everything right so this is the first thing we are selecting on the basis of tag name and sometimes this works fine actually so the testing site for example books to scrape books to scrape what's the exact url yeah so here for example if i take this and press f12 by the way i'm playing around with this edge browser today so this has base of chromium so this is everything looks exactly like chrome but this is actually microsoft edge so here if i write double dollar let me show you first actually so a light in the attic this is the header and if i inspect this this is h1 right and if i write double dollar and h1 which is coming from history there is only one h1 there is no confusion here right so sometimes it actually just works you don't have to go beyond this at other times you will have to look for the class names right so for example this is h2 heading so if i want to look for all the elements which have dot heading which contain this class heading i can set look at look for this here or we can come to the elements and we can paste in here okay so now it is showing me that it contains three right so this is resulting in three so let me remove this h2 and make it simpler so what we are writing we are writing dot heading so what dot will do it will go and look for all the elements which contain this class so here this is h1 this is h2 which contains one class now here this is interesting this h2 contains two classes right it contains heading and dark but even if we are looking for one class this element will be selected so you don't actually have to supply all the classes so this should be you know easy to search for now in case you are wondering how do we select what if i want to select exactly this h2 element which contains two classes right so we will come to this in just a moment before that i want to talk about few other things so we have talked about tag name we know class name we know okay right so let's look at this one the id so this id is plus now this id is usually used for front-end javascript development so there will be some javascripts which will be using this id so if you know little bit of javascript there will be find element by id this function which is used very commonly so usually id will be only one right so if we look at this id id equals plus this is usually going to be only one in fact if it is more than one it is a problem from the developer but practically there will be sites which contain more than one id but whenever you see id you should be happy and quickly use this right so we'll in this file is for example is good one and what we are looking for is hash so if we say hash plus this is going to look for all the elements so here see it is returning four so what i'm going to do is again i'm going to the console and this is where i'm going to check this so it's going to look for all the elements which contain this id right so this id if it is plus so this is just name of the id if it is plus then it is going to give you exactly this right so one more example let's look at one one more example so let's look at this phone number here right here okay so let's see how we can select this so here we can see that this also is having one interesting id which is phone so what if we look for id equals phone see we got that element so id is usually good now in this case you may also see that there is something called phone as the class name as well so this is also something that you will see at a lot of sites the id and class names they can be same but they will produce different results so hash id means hash phone means the element where id is phone and here it means where all the elements which contain this class name so now it is returning two items so here we will have to look at both the items one by one so i'm going to put zero to get the first and i'm going to put one to get the second so we can see that it both of them contain this class phone that's why it is being written right so uh this was about the id now let's talk about one more interesting thing which is elements uh which is attributes all right so what are attributes for example this id equals phone so this id is the attribute and this phone is the value of that attribute this class is the attribute all right and phone is the value of this attribute so it's for id and class we have shortcuts right hash and dot but we don't have shortcuts for other attributes for example here look at this one so we have style equals color black so here we do not have any specific shortcut to select this attribute but we can still do it so how do we first of all we need to copy so how do we copy we cannot select it like that right so there are many ways one of them is to right click and just click on edit as html right so now we can easily copy so i'm selecting the attribute name as well as its value now i'm pressing ctrl c and pressing escape now let's go to console all right and i'm going to paste it here this attribute name and its value but surround this in square brackets right so what we are doing here is we are supplying this attribute and its value in square brackets so if we if we have to select anything by attribute name we will use square brackets right so this is this is something very easy and remember easy to remember now there is one thing that you can remember one more thing to remember is that you can actually use more than one of these four basic selectors so you can combine tag name and class name right so for example here you can already see right so this h1 heading so i need to write it properly using double dollar so it has to be double dollar and like that i have to write it heading we can write it like this or we can write it like this so what it selects will change but both are valid so this will obviously depend on the page structure sometimes they will select different things sometimes they will select the same thing and i actually attributes are very interesting so i want to talk a little bit more about attributes so let's say that let's look at all these google link values right so right click and inspect and here you can see that this is going to google.com and if this one we inspect this is going to google dot co dot jp so the link text is same but it is going to different variation of google right so now this is something if you want to select right so these are all links then we have this phone number here because i created this page so i remember that this is also an anchor tag so this is also a link but it there is a difference okay so there are multiple differences actually so let's try to find out a selector which can select only these links right so and notice this custom attribute here so data country equals something yeah so this data country equals spain here here also we have a custom attribute data dash country equals to japan and all that so what we can do is we can make use of the attribute selector but skip supplying the value all right so what we can do is in the square brackets we can write data dash country now what we have here we have five results here and if we press enter we can see that it is selecting all the five elements that we want to select so this is a good selector very easy selector here we are supplying only the attribute name we are not supplying any of the attribute values now this is again something which is practically useful let me show you one quick example so i'm opening up amazon right and let's look for let's look for anything so let's look for laptop a laptop is on my mind i need to update my laptop so when we want to select these listings so how do we select it right so if you right click and do inspect so because i've already spent some time on this and actually i'll directly take you to the best selector so you see this now notice how the pitch maybe i will i can zoom this out so the page is smaller and you can just have a look this is one block this is another block this is another block right so these all blocks are divs now they contain this data asin this attribute so if i want to select all these divs which contain a specific listing this is my very easy selected data asin and see it's working perfectly now of course i can you know fine tune and go inside it but if i have to run a loop over all these listings which is typically what we will be doing if you are scrapping all these products then this is a very easy selector and it works perfectly fine hi rashidul hi good to see you all right so let's come back here so uh this is one trick about the attribute selectors now let's go back to the practice element practice document ah man i've been seeing element elements so many times it it's uh anyway so in attribute there are some interesting things that you can do with attribute all right so let's try to find out this mobile phone so this is something also you will see at many places that you have to click on something and then this element will become so if i right click and click on inspect you will see that this just notice this style right now it says display display block and if i click on hide it says display none so the point that i'm trying to make here and and this text is also changing okay these two elements are changing as i click them the point that i'm trying to make here is what you see on the browser will be different than what you actually get using the request module or scrappy what the document that you get actually is very different right it may be very different so it's better to work with whatever you have directly if you're not sure what you have directly you can press ctrl u and what you see here is the real document that scrappy is going to get if we exclude uh the these server side changes right so based on your headers and user agent it there may be few other changes but broadly speaking this is what it is going to be oh by the way one good trick you can use the for example here now this is something very very which if you look for the elements you will not find it but data is right there hidden in you know in this format so if you want to select these kind of things you can also create selector css selectors and they will work for example i'm just selecting this type application json pressing ctrl c going to console okay and using double dollar and then i'm putting this and because this is attribute i have to surround this with the square brackets and there we have it if you look at the first one we have the complete script right so now what we can do is if you are using scrappy let's do this using scrappy actually it will be a good exercise so here is scrappy so i've opened scrappy shell and we are pointing to this particular directory this particular page and this is our selector right so let's copy this here and here i'm going to write response dot css remember that css was created to change the look and feel but what we are doing here is we are extracting application type json this script script element is what we are extracting and if we want to get the content how do we get the content double column text now this double column text is not part of standard css this is a scrappy shortcut right and there we have it we have the complete one right and import json and what we can do is i'm taking a shortcut here this is the raw data so i just want to quickly show you that we can convert this and there we have it if we check the type this is a dictionary so once you have this dictionary you can look for any of the key so if you have a json you should be very happy that wow this is easy task you just need to know this trick and it's really really useful and i've saved tons of time actually so probably i'll create more videos about dynamic websites scrapping dynamic websites and the information is right there you don't need to go to you don't need to make another request no xhr request nothing the data is directly embedded in the page so anyway so this was again something useful a little trick actually so this always happens to me that whenever i'm doing live videos there are new ideas which come up you know so i'll give you more information that than i originally planned for okay so now what we want to do is we want to look for this mobile number and this mobile number is here so there are multiple ways to look at this so there is one specific way yeah jason jason is amazingly cool mario i think everyone should spend uh some time on some of the modules apart from scrappy you should always spend time on json it's json sounds very difficult like um the what i've seen actually is i've actually received this feedback from one of the students that jason i don't want to work with jason because json is javascript and true json is json actually means javascript object notation but it means that it is an object so what json module does is json.loads method that we just saw it converts the json object into python object so we just saw that it created converted into dictionary so depending on the structure it can convert into dictionary or list or yeah so primarily uh dictionary or list so these are the two things that you will see so once you have dictionary or list you can manipulate them like a normal or python object all right so coming back here you will notice something that of course there are multiple ways to select it but i want to focus i want you to focus on href so this href starts with a t e l right so this d e l is a new format which is comparatively new format which is created for responsive website so if you open this website on mobile phone and this href contains tel colon so when you click on this element your dialer will open so it is treated as telephone number so this we can use all right so what we can do is we can write a specific selector which will look for all the href elements which starts with a tel right and how do we do that so this is href so first of all we need to write it in square brackets because we are working with the attribute we are looking for href which begins with tel so what we need to do is we need to use this key which is on top of the key 6 and it is called carrot and then we can write our selector so it will go and look for all the elements where the href begins with tel so here href begins with tel so this way even if you don't know your phone number and you are writing a bigger scrapper you can easily get all your telephone numbers so this is a good way to extract all the telephone numbers now let's look at if there is a begins with there will be something which ends with so the ends with is dollar so something that ends with let's say two not two is actually not a good example not a good example so we have to surround this in string okay so this one you remember that if you are putting in a number you have to surround this in double quotes so that it is considered as string so that will be good example for so i'm trying to yeah so for example if you want to find all the dot co dot uk sites or let's say my task is to extract all the links which are pointing to a ca domain or uk domain or something like that so here that can be very useful if i write it like this it will go and look for all the hrefs which end in ca and of course if we are having one operator for looking at ends with if we have one operator which begins with then we have an operator which contains so this star or asterisk is the operator which will allow you to search for a value which contains so any href which contains ca will be selected any hrav which contains tel will be selected right so you should use them um depending on the scenario actually and by the way if you want to look for something which is a simpler example right here so if i write div greater than p then this is going to look for only the p tags which are directly inside div right so if i write just p or let's write a div p so this p is inside this p is inside a div so we are anyway having only two peas here but if you put a space it will go to infinite number of levels and this one will look for only specific ones now there are one more uh interesting uh one so we talked about parent children grandchildren and there is this plus so this is for this is called adjacent sibling so what it is going to do is it is going to look for all the p's which are immediately after div and they are siblings right so note that there is only one result selected and this is selected because there is only one p which is excuse me which is coming immediately after a div so immediately after a div immediately after a div there is only one p so that is the only selection which is made here so this is something to be remember then we have one more operator now this is called general sibling so see what it is doing so it will look for all the p tags which are siblings of any div tag so in this scenario i have four results and these are all siblings of this div right so notice this p this p is inside this div and that is not covered that is not being selected because that is not a sibling of a div so that is the reason it does not come here so this can also be something good now i want to talk about like last few things actually so i want to talk about using indexing right so what we have here is let's look at yeah let's look at this example so probably this is easier to understand now let's say that i want to select all the countries let's say that i want to select all the countries yes mario i completely agree with you and that's why i'm trying to cover all the basics now if i want to cover if i just write a selector td it is going to select all the tvs right so all six tds they are being selected now this one is actually the ph so that why it was not selected these are six tds which are selected here now what i want to do is i want to select only the first td so what can we do here we can use a colon and then write nth child one right so what it is going to do is it is going to look for all the tds which are the first children of its parent now this statement is little confusing and that's why i would like you to pay little bit attention of them on this it is going to select on the first children first child of its parent tag now this td's parent is tr okay so this dr is having only one td so that is selected this tr is having its first child as td so that is selected now i am paying f emphasis on this one so let's see one more example so if i put td nth child 2 it is it is it will be doing this what it will be doing it will be going to look for all the td which is second child of its parent right so all the second tds are being selected here so now this particular page is very simple and it is working nicely but let's talk about one practical example so try it out and make sure that you understand this concept now by the way if you're looking for the first child you don't have to necessarily write like this what you can do is you can use a shortcut instead of nh child you can use first child but you are already seeing first so you need to remove these brackets so this will give you all the first tds similarly there is last child it will give you the last td right so i think if you know only this much it should be enough there is one one one very interesting trick that i want to show you now this trick is that and now this is very scrappy specific trick so let me bring up scrappy all right and let's bring a selector for all right let's say that i want to find this this mobile phone okay i want to find this mobile phone this particular text or yeah this this text let's assume that this is something else let's say that this is a secret message something like that so by the way i've made the changes only on the front end the actual source code has not changed so anyway what i want to do is i want to find this particular phone this text so what i can do is i have the id here so i can quickly locate this anchor tag right so let's come back to css and let's look for hashtag this phone so we got this selector we got this element now what i want to do is i don't want this i want this so this is inside this div right so what i want to do is i can go one level up but in css there is no concept of going one level up all right so how do we go one level up using scrappy right here so what we can do is we can actually mix and match x path so guys this is a very interesting trick that works how's this so what we did right now is we used xpath syntax a double dot and in fact let's call the text as well and there we have it so what we are doing here is we are chaining css with xpath so you can actually mix and match whatever gets the job done fast right so in some cases what happens is you can actually look for a css first and then use xpath to go up and down or you can use let's say you want to use normalized space function of xpath so you can do that right here so there are many advantages of using this and there is no limit you can use as many levels of css as you want to use let's say you want to use something like this div then you want to look for p you can do that then you want to use one more you want to use let's say now here be careful if you want to select something from the current path put a dot here right if you don't put the dot it's not going to help out a lot so let me clear everything and run this again so see what we are doing here we are looking for div and inside that div we are looking for p and we are getting the text so this is a p inside a div so that is what we are getting right so the idea here is you can chain as many dot css dot x path as you want and you can mix and match there is no limitation so these basics should be enough to get you started now what you can do about today all the css selectors if you want to learn more things you can i like this w3schools page on css so css on w3schools this is a good one so here you will find almost all the things so not the complete css but you can look at a css combinator so this these four operators you can look at the pseudo classes so i have not covered all of them i've covered only the one which are relevant for scrappy or for selecting the elements right so in fact you can look for w3schools css selectors so this is another good page on the same site so here this is a good reference so here you will find a lot of a useful things not everything is useful because for example this column visited this is good when you are writing css this is not good when you are actually writing your selectors for scrapping so that's all for today if you have any specific links and you want me to find a selector for it it'll i'll let you know thank you very much and i'll see you in the next stream
Info
Channel: codeRECODE with Upendra
Views: 767
Rating: 5 out of 5
Keywords: python web scraping, css selectors, css selector tutorial, css for beginners, css selectors guide, css selector chaining, css tutorial, css guide, css beginner, css selector beginner, css beginner guide, css beginner tutorial, css selector guide, learn css, css tips and tricks, css advanced, css advanced selectors, css, web scraping with python, scrapy css selector tutorial, css attribute selectors, css selector selenium, css tutorial for beginners, python scrapy
Id: tgOiEtq0Rns
Channel Id: undefined
Length: 41min 22sec (2482 seconds)
Published: Wed Apr 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.