XPath Tutorial (and How to Use them for Web Scraping)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

XPath is a combination of 2 words: XML & Path. It can be easily understood as the "path" to find target elements within XML or HTML documents. There is no use wondering why it’s so powerful and useful for web scraping. XPath helps you to deal with incorrect data, inaccurate pagination, endless loop and so on. As a query language, it needs to be understood, to be learned, to be practiced. I cannot practice it for you. But I can teach you the keys to make your learning a little faster. My name is François I’m not a developer. But I’ve been learning Web Scraping for the past two years. And because I am still a bit of a beginner, you will learn things from scratch. You will learn how to read an HTML document. What the different categories of XPaths are. And which one to choose depending on the circumstances. Then, you will discover the cheat sheet, meaning the different XPaths you will be able to write. Finally, you will be able to use your XPaths for Web Scraping through a tool named Octoparse. Before we begin… If you have any trouble understanding my very French accent. [Speak some French] You can turn the subtitles on. You can also check the timecode and the Octoparse downloading link in the description. Here is a website designed for web scraping. You can easily access the corresponding HTML document by hitting the F12 key. Or you can simple do a right click and click on “inspect”. HTML has different levels of elements, just like a tree structure. If we assume that this is the level 1, it means that it’s the level 2. That it’s the level 3; 3; 3; 3. Then, 4; 5 and so on. If you are having trouble understanding how it works, think about how we go about finding a particular file on our computer. So, this is an HTML element. An HTML element consists of a start tag and an end tag. And text with angle brackets is called a tag. The start tag is “div” and the end tag is “div” too. If we take another example: the start tag is here. It’s “a”. And the end tag is just here. A bit more difficult. This one: the start tag is “div” and the end tag is just located here. It’s also “div”. These tags have a meaning. I don’t know them all. But for example, an “A” tag represents a URL. A “H1” tag entails a title. An “IMG” tag implies an image. And each element is divided into different components. And you can find up to 4 components. First of all, you’ve got the tag. The start tag and the end tag. Secondly, there is the attribute. The tricky thing is that you can have 0, 1 or multiple attributes. For example, in this element, we have one attribute. But, for that one, we have 2 attributes. And the first attribute is always next to the start tag. In that case, the first attribute is “href”. And the second attribute is “style”. And we can know that because they share the same color. And an attribute usually comes in pairs with a value. In that case, the value is always written between 2 quotation marks. If we take the same element as an example, the value of the attribute “href” is a “/”. And the value of the attribute “style” is “text-decoration: none”. Last, there is the text content. It’s the only element which is also directly visible to the internet user. We can see here that it’s written “The world as we have created it is a process of our thinking.”. Blablabla… which is the same text as the one we can see on our screen. Now, let’s figure out how all of these elements are related to one another. When an HTML element is contained within an element, the element that contains the other element is called the parent. Let’s take an example. This element has a parent. The parent is that element. Same thing for that one. They share the same parent. And same thing for the last one. They all share the same parent which is “div class quote”. So, the contained element, or the contained elements, are children of the parent. Which means that each element has one parent but they can have zero, one or multiple children. Finally, elements that have the same parent are called siblings. All 3 elements are siblings. They are all children of the element “div class quote”. Which also implies that this element is the parent of all of these. Now, let’s say a word regarding the different kinds of xpaths. To begin with, I am going to show you something specific. Let’s take an element as an example. As you can see, if I click on “…”, I can “copy” and I can copy the xpath. Or I can copy the full xpath. And I am going to explain you why you should never ever click on one of them. In order to specify the location of an element, XPath uses "/" to connect the different tags from the top to the bottom. Which means, if we want to select this item, it will be something like: “/html/body/div[1]”. And we will be able to select that element. So far, so good. This is what we call an Absolute XPath. However, an absolute xpath can quickly become a lot trickier. Because, if we want to select this element instead, the absolute xpath will be something like this. And it’s way longer. Let’s be honest. This XPath looks weird. We should rename “Absolute XPath” by “Long & Confusing XPath”. So, by clicking on “copy xpath” or “copy full xpath”, you will be able to automatically write an absolute xpath. And you will almost never use absolute XPath for web scraping. But what do you we have to use instead? We have to use the Short XPath. The short xpath uses "//" instead of “/” to reference the element we want to start the XPath with. As a result, if we want to select this element, we can write something like this: “ //span[@class=“text”] ”. And we will be able to select this element, among other things. It looks better than the previous one. In overall, as I will explain it further later in the video, we can combine the “//” and the “/” depending on the situation. So, if I want to select this element, I can start with the parent. So, something like this. Then, I create a “/”. “//” at the beginning but a “/” to select the following tag. And I select the first “span”. Just like this. So, a “/” stands for “selecting the direct child”; whereas a “//” means that we select another element below it. What we call a descendant. In this part, you are going to learn the list of xpaths. I tried to categorize them in order to make your learning as easy as possible. Another thing is that, if you want to verify your xpaths the same way as I do, you can use the xpath helper, which is a chrome extension. The first use is a bit tricky. You may need to reload your pages. But otherwise, it’s a great tool. Let’s start with the tag selectors. And the first xpath you are gonna learn is //TAG. Let’s assume I want to target this element, which is an “h1” element. This element only has a start tag and an end tag. It doesn’t have any attribute. So, I can simply type //h1. And it should do the job. There is one element selected. I can also do another way. I can use //TAG_1/TAG_2. Which will mean we will directly jump into the TAG_1. Then, we select the direct child, which is the TAG_2. If I still want to select the h1 element, I can see that the parent is a “div” element, with a “div” tag. Therefore, I can type //div/h1. And it’s the very same thing. You can do the same thing but you replace the “/” by a “//”, if you want no longer to select a direct child but you want to select a descendant. So, I can simply type //h1. And I will end up with the same result. Finally, there is something specific. If I type //* , I will be able to get all the elements of my html code. Because “*” means “everything”. So, I am selecting all the elements based on their tags. What about the Attribute & the Value Selectors. Let’s assume I want to select the “image_container” element. Meaning, all the images from the first page. I can write //TAG[@ATTRIBUTE] It’s not accurate enough. So, what should we do? We should also mention the value. And the value is “image_container”. Not “image_container xh-highlight”, because you currently select the element. But only “image_container”. In that case, I type //TAG[@ATTRIBUTE=“VALUE”] And it’s more accurate. I have only access to 20 elements. And if I want to select the link of the images, I can write “/a”, because it’s the direct child. Let’s take something a bit more difficult. What about if we want to select the rating? As you can see, we have an element with a “p” tag, an attribute of “class” and a value of “star-rating Three”. So, I can write something like this: //p[@class=“star-rating Three”] But I will only have access to 3 elements. And we have 20 books on the page. And indeed, there is something wrong. It’s because the value is “Three”. Because the rating is 3 out of 5. But if we select the following rating, it’s “star-rating One”. Then, it’s “star-rating Four”. And so on. So, what should I do if I want to select all the ratings, no matter of what the rating is? In that case, I can use this kind of xpath: //TAG[contains(@ATTRIBUTE, “VALUE”)] There is something common between all the values, which is the text “star-rating”. So, if I type “star-rating”, I should be able to get all 20 elements. I can also replace “contains” by “starts-with”, if I want to. In that case, it means that the value has to start with this text. Last but not least, you can also write “ends-with”, which is the same thing but the value has to end with this text. Unfortunately, it doesn’t work with xpath helper. And it doesn’t work with Octoparse either. So, I just want to tell you that it’s a valid xpath. But, because it’s an updated version of xpath I believe, it may not work depending on the circumstances. I selected the first book. So, I ended up with the detail page. And now, we are going to talk about the text selectors. It’s paramount to remember this kind of xpath, because you will use it quite often. Let’s assume I want to select the “availability” text. In that case, it’s a bit tricky because there is not much info. Only a tag. So, I can do //th. But that’s not accurate enough. I have 7 elements instead of a single one. If I want to select “availability”, I can say that I want the text which is equal to availability. Which is the reason why I use this xpath : //TAG[text()=”CONTENT”] I can do the same thing but with a “contains(text())” instead. If I want to select the price, I can do //td[text()=”£51.77”] And I will be able to select 2 elements. Let’s say it’s fine for the example. But let’s assume I want to select the element which contains the “£” sign. In that case, I will create a “contains” condition. //TAG[contains(text(), “CONTENT”)] There is something wrong. It’s because it didn’t close the last parenthesis. And now, I’ve got 3 elements. Each of this element contains a “£” sign. And same thing as we have seen previously. I can replace the “contains” element by a “starts-with” element. Unfortunately, for the “text” kind of xpath, it seems that the “ends-with” doesn’t exist. Order Selectors. The order selectors mainly work if the elements you want to target are all siblings, which is the case of our table. If you take a look, you can see that all the “tr” elements are siblings of the parent “tbody”. Now, let’s assume I want to select the first row. I can say that I want to target XPATH[1] Then, if I want to select only this element, I can create a direct child. So “/td”. And it should do the job. Now, if I want to select no longer the first one but the last one, no matter of how many elements, of how many siblings there are; I can replace “1” by “last()”. And I will get the number of reviews. If I want to select… not the last one but the “last()-1”. So, this row instead. I can do “last()-1”. A bit trickier. What about if I want to select all the rows from my table but the first one? In that case, I can say that I want to select XPATH[position()>1] And it doesn’t work because I have to open and close parenthesis. And in that case, I have 6 elements. Of course, I can replace this sign “>” by that one “<”. In that case, I am selecting the first 2 “tr” elements. I want to end up with something a bit specific. What about if I want to select the first element which contains a “£” sign. Here it’s a bit different. Why? It’s because this element is not a sibling of this element or that one or the last one. So, the process should be slightly different. I am gonna write the xpath as usual. So we want to target //p[contains(text(), “£”)] And here it is. And it’s not as I expected. So, I’m going to replace the “p” tag by “*”. In that case, I have 4 elements. Let’s assume I want to select the first one. I cannot do something like this. What I have to do instead is (XPATH)[1] And I should be able to select the first one. Or the second one. Or the third one. Or the last one. The Sibling Selectors are also very important selectors you have to remember. They are particularly useful if you want to target a variable text next to a static text. Let me explain this part. Ok, let’s take an example. Let’s assume that I want to select the “Books” text. The “Books” text is a sibling of the text “Product Type”. So, what I am going to write is an xpath with a text which is equal to “Product Type”. Then, I select the following sibling, which is “Books”. So, let’s do this one. I select XPATH/following-sibling::TAG_SIBLING And if I want to select the following sibling, the first one, I simply type an order selector. In that case, it doesn’t matter because there is only a single sibling. So, no matter of what you do, you will only get one element. You can also select the preceding sibling, which is the same thing but backward. This is a kind of xpath you will rarely use. The following sibling is much more frequent. But let’s assume I want to select the “Poetry” text, based on that element. So, I select this element based on that one. As you can notice, this is an “li” element with an attribute of “class” and a value of “active”. I write it this way XPATH/preceding-sibling::TAG_SIBLING[1] What about the parent selectors? Let’s assume we start with this element, with this xpath. So, we should be able to write something like //th[text()=”UPC”] And I want to select the element “table”. So, I want to select the parent of the parent of the parent. If I want to select the parent of an element, I type XPATH/.. And I will be able to select this element. As you can notice, it’s now “xh-highlight”. It has a value of “xh-highlight”, which means it has been selected by the xpath helper. I repeat the same process another time. And I only have access to one element. And this element is a “table” element. I can also directly select the ancestor. Meaning, I no longer need to repeat the same process 3 times. In that case, I still write XPATH/ancestor-or-self::TAG_ANCESTOR The Child Selectors. What I mean by child selectors are specific selectors which contain one or multiple children. Let’s come back to our listing page. And let’s assume I want to select the article tag. I can write //article And I will get 20 elements. If I add //TAG[*] It means that I want to select an “article” element, multiple “article” elements which at least have one child. Which is the case here. It also means that, if we select another element… For instance, that element, which doesn’t have any child… I will type something like this. I’ve got 20 elements. But if I add my previous condition, I will no longer select these elements. Because my condition is no longer valid. Let’s come back to our “article” element. I can also specify the tag of my child. So, I can say “p” and it will work. Indeed, if we take a look at the “article” element, I’ve got a “p” child. Just here. These 2 selectors are a bit specific. This is one of the first time I’ve seen them. So, I hope I will be able to explain them to you in a clear way. I want to select the entire element, the element which contains all 20 elements. So, I’ve got a “section”. I’ve got an “ol” element. That’s great. I can say that I want to select //TAG[count(TAG_CHILD) + CONDITION] And I will get the “ol” element because the “ol” element has got 20 “li” tags. But if I want to select a descendant and no longer a direct child… For instance, I want to select the “article” elements, which are below it. It will no longer work. In that case, I have to add a “//”. Finally, let’s say a word regarding the Operator Selectors. Meaning we can create an XPath with a combination of factors. We can create 2 conditions. A CONDITION_1 and a CONDITION_2. We can say I want to select this condition or the other one. We can select an element with a value which is different from the one we have mentioned. Same thing with numeric values. We can say we want to select everything but not the xpath with a condition we have specified. And we can end up with selecting 2 xpaths at the same time. So, let’s jump into it. Let’s assume I want to select this warning element. As you can notice, this element has 2 attributes and 2 values. So, I can say something like //TAG[CONDITION_1 and CONDITION_2] I can replace the “and” by an “or”. Meaning, I will select this condition or that one or both at the same time. Let’s come back to our listing page. And let’s assume I want to select the ratings. But not the “one star” ratings. I can write this xpath: //TAG[@ATTRIBUTE!=”VALUE”] And I’ve got 54 elements, which is way too much. Maybe, I can create an “and” condition. So, “and”. It will be something like “contains(@class, “star-rating”)]” Et voilà. It sounds good to me. As I have mentioned it, there is also the xpath for numeric values. To be honest, you don’t need to remember that one. I just want to show you an example thanks to ChatGPT. But it’s really a unique case that you will almost never see. What can we do next? We can write the same xpath but with the “not(CONDITION)”. I do XPATH[not(CONDITION)] And I’ve got 54 elements. So, if I still want to select the ratings but not the one star ratings, I will add the “and contains(@class, “star-rating”)]”. This is another way of writing the very same xpath. That’s actually a pretty good exercise, because you can write different xpaths and still end up with the same result. Et voilà. Finally, I can write 2 xpaths at the same time. So, let’s change things a little bit. I want to select the ratings. But only the four-star and the five-star ratings. So, that shouldn’t be too complicated. Here is an example of a five-star rating. So, here we go: XPATH_1|XPATH_2 And we’ve got 8 elements. It sounds correct. Now, we are going to find out how to write XPaths from scratch for Web Scraping. We are not going to code. We are not going to use Python. But Octoparse instead. Let’s dive into how to write XPaths for Paginations, for Loop Items and for Extract Data steps. Before we begin, I should remind you the difference between a short XPath and an Absolute/Long XPath. Actually, there is something specific related to Octoparse. The definition of an Absolute Xpath for Octoparse is different from the one we have talked about previously. There is a small change. Octoparse draws a parallel between Absolute XPaths and Relative ones. Absolute XPath is used when you want to extract data from the web page directly. And Relative XPath is used when you want to extract data from a loop item. It means that in Octoparse, an Absolute XPath can be short, can be concise and can be formatted the same way as a short XPath. Now, let’s see how we can create our own workflow thanks to Octoparse. We are going to create a pagination, a loop item from scratch. And we are going to select a single element which will be the title of the book from the detail page. So, let’s jump into it. I copy my URL. I paste it here. And I click on “Start”. As we have said, the first thing we are going to do is the pagination. The pagination is a loop. So, I add a step. I create a new loop. This loop will be renamed to “Pagination”. And I’m going to select a single element. The reason is that the pagination is always the same element. We have to click on the “next-page” button each time, in order to go from one page to another one. We are looking for an “a” element. Because an “a” represents a URL, as we have mentioned it at the beginning of the video. We can proceed in different ways. I’m going to start with the “li” element. It will be something like this. I always verify my xpath thanks to Xpath Helper. I’ve got one element. And I select the direct child which has an “a” tag. I end up with this. I copy my xpath. I paste it here. I’m going to add a bit of timeout here. And the element is selected. Now, I’m going to click on it. I click on the element. I’m going to call it “click to paginate”. And here it is. The difference between an absolute and a relative XPath. In that case, we are targeting a relative XPath, because it’s relative to the xpath we are written for the pagination. I click on “Relative XPath to the Loop Item”. I’m going to “Wait before action”. And I’m going to load the page with AJAX, with a timeout of 10s. Maybe, I will create another video in order to explain you the usefulness of AJAX. And if we click on “pagination” and “click to paginate”, we should be able to end up with the page 2. So, as you can notice, the pagination works. We have succeeded. Congratulations. Now, we can create our loop item. When I say “loop item”, we hear the word “loop”. Which means we have to create another loop, which will be located between the pagination and the “click to paginate” button. This one is a bit more specific. We are going to select “Variable List” as a loop mode. Which makes sense because we are looking for a list, a list of elements. We are going to write an XPath which targets all 20 books from the page. Here, it’s pretty straightforward, because each book has an “article” tag. Let’s write that one. I’ve got 20 elements. Here we go. Now, I have to do something else. I have to click on the URL in order to go to the detail page, for each element. So, I’m going to write a relative XPath, a relative XPath to the loop item, to this XPath. So, let’s see if I type “//a”. I’ve got 40 elements. And I only expect 20 elements. So, let’s try to be more specific. So, “//article//h3/a”. So, this is the loop item. I select a descendant with an “h3” tag. And I select a direct child with an “a” tag. And I’ve got 20 elements, which is good. In other words, I select this part of the XPath. And I’m going to add another “click item”, which is relative to the loop item. I paste my XPath. I click on “apply”. And let me check something. Ok. I have to select “Open in a new tab”. Let’s check it out. Here we go. And now, we select the title which is, I’m pretty sure of that, an “h1” element. So, I add an “Extract Data” step. I click on “Add custom field”, “Capture data on the page”. I write my XPath here. Not “//a” sorry. It’s “//h1”. And I name my field, which will be “Title”. I “confirm”. I add a slight timeout. And I click on “run”. And we are going to figure out together whether the workflow we have just created works. And it sounds good to me. It’s going to scrape each element, every 3s. This is the end of the video. I hope you have enjoyed it. If it’s the case, you can give a thumbs up and subscribe to the channel. If you want to scrape B2B leads, you can also click the link in the description. Finally, if you want to get any kind of web scraping services, you can ask me for a quote by sending me an email. See you next time.

Info

Channel: François from Octoparse

Views: 2,220

Rating: undefined out of 5

Keywords:

Id: dQByAdJOrr4

Channel Id: undefined

Length: 36min 1sec (2161 seconds)

Published: Wed May 17 2023