XPath Tutorial (and How to Use them for Web Scraping)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
XPath is a combination of 2 words: XML & Path. It can be easily understood as the "path" to find   target elements within XML or HTML documents. There is no use wondering why it’s so   powerful and useful for web scraping. XPath helps you to deal with incorrect data,   inaccurate pagination, endless loop and so on. As a query language, it needs to be understood,   to be learned, to be practiced. I cannot practice it for you.  But I can teach you the keys to  make your learning a little faster.  My name is François I’m not a developer.  But I’ve been learning Web  Scraping for the past two years.  And because I am still a bit of a beginner,  you will learn things from scratch.  You will learn how to read an HTML document. What the different categories of XPaths are.  And which one to choose  depending on the circumstances.  Then, you will discover the cheat sheet, meaning  the different XPaths you will be able to write.  Finally, you will be able to use your XPaths  for Web Scraping through a tool named Octoparse.  Before we begin… If you have any trouble   understanding my very French accent. [Speak some French]  You can turn the subtitles on. You can also check the timecode and the   Octoparse downloading link in the description. Here is a website designed for web scraping.  You can easily access the corresponding  HTML document by hitting the F12 key.  Or you can simple do a right  click and click on “inspect”.  HTML has different levels of  elements, just like a tree structure.  If we assume that this is the level  1, it means that it’s the level 2.  That it’s the level 3; 3; 3; 3. Then, 4; 5 and so on.  If you are having trouble understanding how it  works, think about how we go about finding a   particular file on our computer. So, this is an HTML element.  An HTML element consists of  a start tag and an end tag.  And text with angle brackets is called a tag. The start tag is “div”   and the end tag is “div” too. If we take another example: the start tag is here.  It’s “a”. And the end tag is just here.  A bit more difficult. This one: the start tag is “div”   and the end tag is just located here. It’s also “div”.  These tags have a meaning. I don’t know them all.  But for example, an “A” tag represents a URL. A “H1” tag entails a title.  An “IMG” tag implies an image. And each element   is divided into different components. And you can find up to 4 components.  First of all, you’ve got the tag. The start tag and the end tag.  Secondly, there is the attribute. The tricky thing is that you can have 0,   1 or multiple attributes. For example,   in this element, we have one attribute. But, for that one, we have 2 attributes.  And the first attribute is  always next to the start tag.  In that case, the first attribute is “href”. And the second attribute is “style”.  And we can know that because  they share the same color.  And an attribute usually  comes in pairs with a value.  In that case, the value is always  written between 2 quotation marks.  If we take the same element as an example,  the value of the attribute “href” is a “/”.  And the value of the attribute  “style” is “text-decoration: none”.  Last, there is the text content.  It’s the only element which is   also directly visible to the internet user. We can see here that it’s written “The world as   we have created it is a process of our thinking.”. Blablabla… which is the same text as the   one we can see on our screen. Now, let’s figure out how all of   these elements are related to one another. When an HTML element is contained within   an element, the element that contains  the other element is called the parent.  Let’s take an example. This element has a parent.  The parent is that element. Same thing for that one.  They share the same parent. And same thing for the last one.  They all share the same parent  which is “div class quote”.  So, the contained element, or the contained  elements, are children of the parent.  Which means that each element has one parent but  they can have zero, one or multiple children.  Finally, elements that have the  same parent are called siblings.  All 3 elements are siblings. They are all children of the   element “div class quote”. Which also implies that this   element is the parent of all of these. Now, let’s say a word regarding the   different kinds of xpaths. To begin with, I am going   to show you something specific. Let’s take an element as an example.  As you can see, if I click on “…”, I  can “copy” and I can copy the xpath.  Or I can copy the full xpath. And I am going to explain you   why you should never ever click on one of them. In order to specify the location of an element,   XPath uses "/" to connect the different  tags from the top to the bottom.  Which means, if we want to select this item,  it will be something like: “/html/body/div[1]”.  And we will be able to select that element. So far, so good.  This is what we call an Absolute XPath. However, an absolute xpath can   quickly become a lot trickier. Because, if we want to select   this element instead, the absolute  xpath will be something like this.  And it’s way longer. Let’s be honest.  This XPath looks weird. We should rename “Absolute   XPath” by “Long & Confusing XPath”. So, by clicking on “copy xpath” or   “copy full xpath”, you will be able to  automatically write an absolute xpath.  And you will almost never use  absolute XPath for web scraping.  But what do you we have to use instead? We have to use the Short XPath.  The short xpath uses "//" instead  of “/” to reference the element   we want to start the XPath with. As a result, if we want to select   this element, we can write something  like this: “ //span[@class=“text”] ”.  And we will be able to select  this element, among other things.  It looks better than the previous one. In overall, as I will explain it further   later in the video, we can combine the “//”  and the “/” depending on the situation.  So, if I want to select this  element, I can start with the parent.  So, something like this. Then, I create a “/”.  “//” at the beginning but a “/”  to select the following tag.  And I select the first “span”. Just like this.  So, a “/” stands for “selecting the direct  child”; whereas a “//” means that we select   another element below it. What we call a descendant.  In this part, you are going  to learn the list of xpaths.  I tried to categorize them in order to  make your learning as easy as possible.  Another thing is that, if you want to verify your  xpaths the same way as I do, you can use the xpath   helper, which is a chrome extension. The first use is a bit tricky.  You may need to reload your pages. But otherwise, it’s a great tool.  Let’s start with the tag selectors. And the first xpath you are gonna learn is //TAG.  Let’s assume I want to target this  element, which is an “h1” element.  This element only has a start tag and an end tag. It doesn’t have any attribute.  So, I can simply type //h1. And it should do the job.  There is one element selected. I can also do another way.  I can use //TAG_1/TAG_2. Which will mean we will   directly jump into the TAG_1. Then, we select the direct child,   which is the TAG_2. If I still want to   select the h1 element, I can see that the  parent is a “div” element, with a “div” tag.  Therefore, I can type //div/h1. And it’s the very same thing.  You can do the same thing but you replace the  “/” by a “//”, if you want no longer to select a   direct child but you want to select a descendant. So, I can simply type //h1.  And I will end up with the same result. Finally, there is something specific.  If I type //* , I will be able to  get all the elements of my html code.  Because “*” means “everything”. So, I am selecting all   the elements based on their tags. What about the Attribute & the Value Selectors.  Let’s assume I want to select  the “image_container” element.  Meaning, all the images from the first page. I can write //TAG[@ATTRIBUTE]  It’s not accurate enough. So, what should we do?  We should also mention the value. And the value is “image_container”.  Not “image_container xh-highlight”,  because you currently select the element.  But only “image_container”. In that case, I type //TAG[@ATTRIBUTE=“VALUE”]  And it’s more accurate. I have only access to 20 elements.  And if I want to select the link of the images,  I can write “/a”, because it’s the direct child.  Let’s take something a bit more difficult. What about if we want to select the rating?  As you can see, we have an element  with a “p” tag, an attribute of “class”   and a value of “star-rating Three”. So, I can write something like this:   //p[@class=“star-rating Three”] But I will only have access to 3 elements.  And we have 20 books on the page. And indeed, there is something wrong.  It’s because the value is “Three”. Because the rating is 3 out of 5.  But if we select the following  rating, it’s “star-rating One”.  Then, it’s “star-rating Four”. And so on.  So, what should I do if I want to select all  the ratings, no matter of what the rating is?  In that case, I can use this kind of  xpath: //TAG[contains(@ATTRIBUTE, “VALUE”)]  There is something common between all the  values, which is the text “star-rating”.  So, if I type “star-rating”, I should  be able to get all 20 elements.  I can also replace “contains”  by “starts-with”, if I want to.  In that case, it means that the  value has to start with this text.  Last but not least, you can also  write “ends-with”, which is the same   thing but the value has to end with this text. Unfortunately, it doesn’t work with xpath helper.  And it doesn’t work with Octoparse either. So, I just want to tell you that   it’s a valid xpath. But, because it’s an   updated version of xpath I believe, it may  not work depending on the circumstances.  I selected the first book. So, I ended up with the detail page.  And now, we are going to talk  about the text selectors.  It’s paramount to remember this kind of  xpath, because you will use it quite often.  Let’s assume I want to select  the “availability” text.  In that case, it’s a bit tricky  because there is not much info.  Only a tag. So, I can do //th.  But that’s not accurate enough. I have 7 elements instead of a single one.  If I want to select “availability”, I can say that  I want the text which is equal to availability.  Which is the reason why I use this  xpath : //TAG[text()=”CONTENT”]  I can do the same thing but with  a “contains(text())” instead.  If I want to select the price,  I can do //td[text()=”£51.77”]  And I will be able to select 2 elements. Let’s say it’s fine for the example.  But let’s assume I want to select the  element which contains the “£” sign.  In that case, I will create  a “contains” condition.  //TAG[contains(text(), “CONTENT”)] There is something wrong.  It’s because it didn’t close the last parenthesis. And now, I’ve got 3 elements.  Each of this element contains a “£” sign. And same thing as we have seen previously.  I can replace the “contains”  element by a “starts-with” element.  Unfortunately, for the “text” kind of xpath,  it seems that the “ends-with” doesn’t exist.  Order Selectors. The order selectors mainly work   if the elements you want to target are all  siblings, which is the case of our table.  If you take a look, you can see that all the  “tr” elements are siblings of the parent “tbody”.  Now, let’s assume I want to select the first row. I can say that I want to target XPATH[1]  Then, if I want to select only this  element, I can create a direct child.  So “/td”. And it should do the job.  Now, if I want to select no longer the  first one but the last one, no matter   of how many elements, of how many siblings  there are; I can replace “1” by “last()”.  And I will get the number of reviews. If I want to select… not the last one   but the “last()-1”. So, this row instead.  I can do “last()-1”. A bit trickier.  What about if I want to select all the  rows from my table but the first one?  In that case, I can say that I  want to select XPATH[position()>1]  And it doesn’t work because I have  to open and close parenthesis.  And in that case, I have 6 elements. Of course, I can replace this sign “>” by   that one “<”. In that case,   I am selecting the first 2 “tr” elements. I want to end up with something a bit specific.  What about if I want to select the  first element which contains a “£” sign.  Here it’s a bit different. Why?  It’s because this element is not a sibling  of this element or that one or the last one.  So, the process should be slightly different. I am gonna write the xpath as usual.  So we want to target //p[contains(text(), “£”)] And here it is.  And it’s not as I expected. So, I’m going to replace the “p” tag by “*”.  In that case, I have 4 elements. Let’s assume I want to select the first one.  I cannot do something like this. What I have to do instead is (XPATH)[1]  And I should be able to select the first one. Or the second one.  Or the third one. Or the last one.  The Sibling Selectors are also very  important selectors you have to remember.  They are particularly useful if you want to  target a variable text next to a static text.  Let me explain this part. Ok, let’s take an example.  Let’s assume that I want  to select the “Books” text.  The “Books” text is a sibling  of the text “Product Type”.  So, what I am going to write is an xpath  with a text which is equal to “Product Type”.  Then, I select the following  sibling, which is “Books”.  So, let’s do this one. I select XPATH/following-sibling::TAG_SIBLING  And if I want to select the following sibling,  the first one, I simply type an order selector.  In that case, it doesn’t matter  because there is only a single sibling.  So, no matter of what you do,  you will only get one element.  You can also select the preceding sibling,  which is the same thing but backward.  This is a kind of xpath you will rarely use. The following sibling is much more frequent.  But let’s assume I want to select the  “Poetry” text, based on that element.  So, I select this element based on that one.  As you can notice, this is an “li” element with  an attribute of “class” and a value of “active”.  I write it this way  XPATH/preceding-sibling::TAG_SIBLING[1]  What about the parent selectors? Let’s assume we start with this   element, with this xpath. So, we should be able to   write something like //th[text()=”UPC”] And I want to select the element “table”.  So, I want to select the parent  of the parent of the parent.  If I want to select the parent  of an element, I type XPATH/..  And I will be able to select this element. As you can notice, it’s now “xh-highlight”.  It has a value of “xh-highlight”, which means  it has been selected by the xpath helper.  I repeat the same process another time. And I only have access to one element.  And this element is a “table” element. I can also directly select the ancestor.  Meaning, I no longer need to  repeat the same process 3 times.  In that case, I still write  XPATH/ancestor-or-self::TAG_ANCESTOR  The Child Selectors. What I mean by child selectors are specific   selectors which contain one or multiple children. Let’s come back to our listing page.  And let’s assume I want to select the article tag. I can write //article  And I will get 20 elements. If I add //TAG[*]  It means that I want to select an “article”  element, multiple “article” elements which   at least have one child. Which is the case here.  It also means that, if we select another element… For instance, that element,   which doesn’t have any child… I will type something like this.  I’ve got 20 elements. But if I add my previous condition,   I will no longer select these elements. Because my condition is no longer valid.  Let’s come back to our “article” element. I can also specify the tag of my child.  So, I can say “p” and it will work. Indeed, if we take a look at the “article”   element, I’ve got a “p” child. Just here.  These 2 selectors are a bit specific. This is one of the first time I’ve seen them.  So, I hope I will be able to  explain them to you in a clear way.  I want to select the entire element, the  element which contains all 20 elements.  So, I’ve got a “section”. I’ve got an “ol” element.  That’s great. I can say that I   want to select //TAG[count(TAG_CHILD) + CONDITION] And I will get the “ol” element because the “ol”   element has got 20 “li” tags. But if I want to select a   descendant and no longer a direct child… For instance, I want to select the “article”   elements, which are below it. It will no longer work.  In that case, I have to add a “//”. Finally, let’s say a word regarding   the Operator Selectors. Meaning we can create   an XPath with a combination of factors. We can create 2 conditions. A CONDITION_1   and a CONDITION_2. We can say I want   to select this condition or the other one. We can select an element with a value which is   different from the one we have mentioned. Same thing with numeric values.  We can say we want to select everything but not  the xpath with a condition we have specified.  And we can end up with selecting  2 xpaths at the same time.  So, let’s jump into it. Let’s assume I want to   select this warning element. As you can notice, this element   has 2 attributes and 2 values. So, I can say something like   //TAG[CONDITION_1 and CONDITION_2] I can replace the “and” by an “or”.  Meaning, I will select this condition  or that one or both at the same time.  Let’s come back to our listing page. And let’s assume I want to select the ratings.  But not the “one star” ratings. I can write this xpath: //TAG[@ATTRIBUTE!=”VALUE”]  And I’ve got 54 elements, which is way too much. Maybe, I can create an “and” condition.  So, “and”. It will be something like   “contains(@class, “star-rating”)]” Et voilà.  It sounds good to me. As I have mentioned it,   there is also the xpath for numeric values. To be honest, you don’t need to remember that one.  I just want to show you an  example thanks to ChatGPT.  But it’s really a unique case  that you will almost never see.  What can we do next? We can write the same xpath   but with the “not(CONDITION)”. I do XPATH[not(CONDITION)]  And I’ve got 54 elements. So, if I still want to select   the ratings but not the one star ratings, I will  add the “and contains(@class, “star-rating”)]”.  This is another way of  writing the very same xpath.  That’s actually a pretty good exercise, because  you can write different xpaths and still end up   with the same result. Et voilà.  Finally, I can write 2 xpaths at the same time. So, let’s change things a little bit.  I want to select the ratings. But only the four-star and the five-star ratings.  So, that shouldn’t be too complicated. Here is an example of a five-star rating.  So, here we go: XPATH_1|XPATH_2  And we’ve got 8 elements. It sounds correct.  Now, we are going to find out how to write  XPaths from scratch for Web Scraping.  We are not going to code. We are not going to use Python.  But Octoparse instead. Let’s dive into how   to write XPaths for Paginations, for  Loop Items and for Extract Data steps.  Before we begin, I should remind you  the difference between a short XPath   and an Absolute/Long XPath. Actually, there is something   specific related to Octoparse. The definition of an Absolute   Xpath for Octoparse is different from  the one we have talked about previously.  There is a small change. Octoparse draws a parallel   between Absolute XPaths and Relative ones. Absolute XPath is used when you want to   extract data from the web page directly. And Relative XPath is used when you want   to extract data from a loop item. It means that in Octoparse,   an Absolute XPath can be short, can be concise and  can be formatted the same way as a short XPath.  Now, let’s see how we can create our  own workflow thanks to Octoparse.  We are going to create a pagination,  a loop item from scratch.  And we are going to select a single  element which will be the title of the book   from the detail page. So, let’s jump into it.  I copy my URL. I paste it here.  And I click on “Start”. As we have said, the first thing   we are going to do is the pagination. The pagination is a loop.  So, I add a step. I create a new loop.  This loop will be renamed to “Pagination”. And I’m going to select a single element.  The reason is that the pagination  is always the same element.  We have to click on the “next-page” button each  time, in order to go from one page to another one.  We are looking for an “a” element. Because an “a” represents a URL, as we have   mentioned it at the beginning of the video. We can proceed in different ways.  I’m going to start with the “li” element. It will be something like this.  I always verify my xpath thanks to Xpath Helper. I’ve got one element.  And I select the direct  child which has an “a” tag.  I end up with this. I copy my xpath.  I paste it here. I’m going to add a bit of timeout here.  And the element is selected. Now, I’m going to click on it.  I click on the element. I’m going to call it “click to paginate”.  And here it is. The difference   between an absolute and a relative XPath. In that case, we are targeting a relative XPath,   because it’s relative to the xpath  we are written for the pagination.  I click on “Relative XPath to the Loop Item”. I’m going to “Wait before action”.  And I’m going to load the page  with AJAX, with a timeout of 10s.  Maybe, I will create another video in order  to explain you the usefulness of AJAX.  And if we click on “pagination”  and “click to paginate”,   we should be able to end up with the page 2. So, as you can notice, the pagination works.  We have succeeded. Congratulations.  Now, we can create our loop item. When I say “loop item”, we hear the word “loop”.  Which means we have to create another loop,  which will be located between the pagination   and the “click to paginate” button. This one is a bit more specific.  We are going to select  “Variable List” as a loop mode.  Which makes sense because we are  looking for a list, a list of elements.  We are going to write an XPath which  targets all 20 books from the page.  Here, it’s pretty straightforward,  because each book has an “article” tag.  Let’s write that one. I’ve got 20 elements.  Here we go. Now, I have to do something else.  I have to click on the URL in order to  go to the detail page, for each element.  So, I’m going to write a relative XPath, a  relative XPath to the loop item, to this XPath.  So, let’s see if I type “//a”. I’ve got 40 elements.  And I only expect 20 elements.  So, let’s try to be more specific. So, “//article//h3/a”.  So, this is the loop item. I select a descendant with an “h3” tag.  And I select a direct child with an “a” tag. And I’ve got 20 elements, which is good.  In other words, I select this part of the XPath. And I’m going to add another “click item”,   which is relative to the loop item. I paste my XPath.  I click on “apply”. And let me check something.  Ok. I have to select “Open in a new tab”.  Let’s check it out. Here we go.  And now, we select the title which is,  I’m pretty sure of that, an “h1” element.  So, I add an “Extract Data” step. I click on “Add custom   field”, “Capture data on the page”. I write my XPath here.  Not “//a” sorry. It’s “//h1”.  And I name my field, which will be “Title”. I “confirm”.  I add a slight timeout. And I click on “run”.  And we are going to figure out together whether  the workflow we have just created works.  And it sounds good to me. It’s going to scrape each element, every 3s.  This is the end of the video. I hope you have enjoyed it.  If it’s the case, you can give a  thumbs up and subscribe to the channel.  If you want to scrape B2B leads, you can  also click the link in the description.  Finally, if you want to get any kind of web  scraping services, you can ask me for a quote   by sending me an email. See you next time.
Info
Channel: François from Octoparse
Views: 2,220
Rating: undefined out of 5
Keywords:
Id: dQByAdJOrr4
Channel Id: undefined
Length: 36min 1sec (2161 seconds)
Published: Wed May 17 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.