XPath query, IMPORTXML & Google Sheets - Advanced Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
since I recorded my import XML function video I had a lot of requests from people to do more detail tutorial about import XML function there are a couple of things I want to clarify about this so a lot of people think about import XML function as an alternative for them to use as a web crawler and crawl information from a page it may work in a very simple cases and it's very useful but if you really want to create a real powerful web crawler let me tell you that import XML function is not best way to go so you probably want to learn some sort of programming language if you want to go more in-depth into crawling web pages maybe start with Python or nodejs is a great option as well a c-sharp you know depending on what you're trying to accomplish if you want something simpler I guess started Python and go from there but as far as import XML so there is really not much to tell about the function itself import XML is very simple function now the part that's more difficult about it is learning about the XPath query which is the query you used to actually find the element on a page you're looking for to learn about XPath I'm going to show you some things but again XPath is there's a lot involved in this right so and I'm going to show you some common things that you probably need and if you really need more in detail understanding of XPath you really need to start looking into documentation so here I'm looking at Microsoft's documentation of XPath reference and that's what I'm going to be using today to explain you how I'm actually deciding to use the path that I'm deciding right so in this particular example what I'm going to try to do I'm going to try to use the channel the YouTube channel and I will try to extract all the titles for the videos and how many views each video had how many days ago it was uploaded as well as the image for each video something like that so we could also do the link for each video and so on right but I'm sure at some point you're going to get the idea of where this is going well that's what I'm going to try to accomplish so therefore the URL where I'm going to be crawling the information is going to be the channel URL so I'm going to go ahead and copy that and then I'm going to go to my Google sheets and type my function which is import XML as you may already know and I do the code because this is a string put the URL for the channel because that's where I'm getting the information from and finally the second one is our XPath query in quotes as a string so I'm going to close this so in between this prints not parens within this between these codes we have to provide that XPath query to actually harvest the information we need out of the page now to find out what information we need we'll have to do well first of all you have to know some HTML right so you may look into HTML if you want to really understand what I'm going to be doing here but first of all I'm going to go under documentation for Microsoft XPath reference and I'm going to go under syntax to hopefully show you where we're going with this so you can see how this is the documentation so now if you're new to Microsoft documentation this may be a little intimidating if you look at this at first because they're not the most probably user friendly documentation writers in Microsoft but it's once you kind of get a hang of it it's not really that bad so I'm going to go here under so I went other XPath syntax right so what I'm going to do I'm going to navigate under probably these operators is where I'm going to find most of the information for the beginning I need so there are these operators and special characters right so you can see that our operators the main two operators we need to know here first of all is this slash right so you can see that the slash that represents a child element and two slashes is what we call a recursive descent I'll try to explain you what that means in just a little bit now what that means in this particular case so let's try to do this I'm going to right click and inspect this video to see what that looks like and I'm going to find the box in my HTML that holds the entire video so I'm going to scroll up see this now if I look at the background it selects all the videos I want one that selects an individual video that this li seems to select an individual video on that page and this each Li as you can see is a video listing on that particular page so right so first of all I'm going to try to find what I need I need to find this li right and inside of this ul and inside of that a ul inside of the Li first of all let's try to find the title of this so its UL li and then there is some sort of div element right and there's another div and another div and then we have a span box and then there's a link of some sort there's an image down here and video time we have a button where is the actual text - let's see or the Textus for this thing it's video time overflow content so there it is it's inside of this content div box so I'm going to close this one to kind of get to a little easier to understand way so there it is that text title here is inside of this h3 box and inside of that h3 you see there's this hyperlink which is that title by clicking on so if I open that hyperlink you'll see that the I / link text is the title of that particular video so to get to this if I was trying to get to it from an arbitrary place let's say I want to start from this particular ul right and then inside of that you well I'm going to list this Li so I'm going to start with ul and Li and as you remember this means a child right so that means this list element needs to be a child of ull another by the way when you do a single slash that means a direct child so what a direct child mean is that this you li needs to be right within this ul directly so to give you an example of that seat right under this Li we have this div and then we they have this other div and then with this other div and then we have this other div and then inside that other div we have this h3 so now if I'm trying to say that I want to grab the h3 inside of this li because this h3 is not a direct child of the list element if I simply just do /and h3 it should not work so if i do / h3 which would represent the direct child and do another slash after that right after h3 see there is an a element which is what we're trying to get we're trying to get the text within that a element so I'm going to select that a element so this should not work if I'd enter see I'm getting imported content is empty the reason that it's empty is because this h3 is not a direct child of our list element because h3 it's within a div and div is within a div and div within a div so if I wanted to do with simple slashes I would have to do UL li div div div and h3 and a now I don't want to do all of that so I want to say that insiders Li on some level within it there should be an h3 element so this is where I do this two slashes that means on some level within it it's going to be an h3 and directly inside of that h3 should be a hyperlink which is our a element which we can see it is right there so at this point when I change my XPath query to this I'm going to hit enter and now you can see we're actually getting all the titles of our videos and apparently there are 30 videos right now on the channel now we can see all the titles for each one of the single videos so hopefully what you learn here is what the slash means the slash means a child two slashes means it's a child not a direct child a single slash means it's a derailed direct child so that's what we have for a start with our XPath query okay so so far so good we were able to get all the titles now the next thing I want to see is how many views each one of these you head on a separate column right so let's try to see where this 35 views is I'm going to go here so there's this 35 used it's within this list element which is within this unordered list element and which is again inside of something else and so on right but what I want to get into right now is a lot of times when you're trying to extract an element from the web you may need to use classes to find the element that you're looking for so in this particular case if you look at this unordered list which has this meta information for views and how many days ago it was uploaded you can see that the class for that unordered list item is this which is YT - well you can see it I'm not able to read that thing but that's going to be the class so what I'm going to show you is how to find something by the class of the element so to make my life easy I'm going to simply try to copy this existing import XML escape and we're going to put it right next to it that's going to be our views now we have to modify this thing right here to find the views so now what I'll have to do is find let's say what I'm let's see what I'm trying to find so I'm trying to find an unordered list element which is the UL element which has a class of this right okay perfect so I'm going to say so there should be ul and that you will need to have an attribute and the sign of the attribute is the ad sign so if I go back to the documentation see that's the sign of the attributes the SI attribute prefix for an attribute name perfect so that means that if you put it but that means we're looking for an attribute and in this case the class again if you're know HTML you know that the class here is an attribute of our unordered list element so the class the target that I'm going to use the attribute sign and I'm going to say the attribute we're trying to find is class and that should be equal to and I have to put what it should be equal to in single quotes I can't do double quotes because if I did that that would interfere with my double quotes for this function string so right here I'm going to say what I'm looking for so I'm looking for this class go ahead and copy that and put it within those single quotes so I'm going to look for an element unordered list which has a class that equals to this and then inside of that element see we have a list item which has that 35 use and then we have another list item which has did three days ago so right now now I'm going to go Li which is one level below but at 35 use so again going back to this the list item is a direct child so that means I'm going to do one slash and Li and extra hit enter and we can see how we get 35 views and three days ago thirty-four views five days ago and so on so that kind of made not okay I guess once you're looking at it but what happens is because we have to list elements here which are direct childs of our this class element that we were able to find on the page it extracts this and also it extracts this as well and it puts it right below on the next row right here so what I need to extract right now is just the views and the views is the first list element within that unordered list with this class so instead of getting all the list elements I just want the first list element so to say that I just want the first one not all of them I can simply open my square bracket and say one that means get me the first one only so I'm going to hit enter and you can see how we were able to get all the views so if I scroll down you see there it is they're all different views for each one now I also need how many days ago it was uploaded and I want that to go in a separate column three days ago that one so I'm going to simply copy this because a lot of this is going to be the same I'm going to paste that and the only thing I really have to change here is because here I was getting the first list element that will direct child of this now I need the one right next to it which is the second one which is three days ago so instead of just saying Li one I'm simply going to say Li two may hit enter and again control arrow key down to make sure that works seems to work fine we get exactly the same number of things we have for our videos and usually what I do this I usually scroll the way down and if the number of records doesn't match that's a good indication that I'm not doing something right so I should go and check what's happening but right now I say 35 use three days 34 views three days there it is so 35 you is 34 and so on that's what we have and by the way this views may start changing as the time goes by because you know I'm not refreshing the page but so far so good so we get confused we got how many days ago we got the title of the video now the next thing we may possibly want to do again the significantly the most common things you may want to look for like so far you've learned how to find an element right here and how to find a direct child or just a child and also now which is a very common something you want may want to look for a class so that would be a way to look for a particular class on the page great so the next one is going to be what are we going to get we're going to want to have I'm going to try to find the link the image link for this video right because each video has a thumbnail now what to get the link to the thumbnails so let's see so each one of these videos so we found each part of these videos if you remember below by saying you al Ally and then within that we found an h3 that's what we did and it kind of works so we didn't even have to class which is kind of nice I don't like to complicate things if I don't have to so that's our list elements it selects that now inside of that list element we have a div and we have this div and then we have this div and then we have this right and then we have this link and there it is there seems to be the image if I keep scrolling down that's going to be the image and this image the actual link to the image file is the source attribute so if I just go inside of the store sadly the source attribute link and I simply just go here and paste should be able to see that that opens that link now this link also has some parameters attached after this question mark which we which shouldn't really matter which is which we should be able to open the JPEG there it is so we're able to find the actual image file by doing this fair enough so now how are we going to get that image itself right so what I'm going to do I'm going to say that there should be an unordered list and list item and inside of that list item on some level in it there is going to be well I could just say an image in this particular case because there seems to be no other images but I guess I'm going to have to try to make this a little more complicated so I can show you a couple of other things that you may need and that's gonna be what do I want here let's try to find something interesting you will Li okay so this is what I'm gonna do because I want to show you something else here so I could simply just say ul a lie an image and then get the attribute but that's going to be I guess too easy or I guess I let's start with that and then I'll show you the more complicated way of doing things so I can show you how to actually get partial class because sometimes you're going to have multiple classes attached to it so you like in this case I have this list item and one class is channels content item and then there is the white tea shelf grid item two classes and you may want to be just looking for one of those classes and if you do equals it's not going to find just one of them it's going to look for both of them together which is not going to work so we'll try to find a single class within multiple class element so for now I'm just going to go with ul Li unordered list list which is what we're done and inside of there since there are no other images I don't really have to swipe to watch so I'm going to go the image that's what I'm going to find and I don't want to find the text in this image right so first of all let me go to that level so I'm going to copy this paste this up copy the formula escape back here paste now ul Li so far I said inside of that Li there should be an image on some level it's not a direct child so we'll use two slashes so we have the image element within that right so inside of this image we want we don't want to get an image with a certain attribute so for example before we got an L you unordered with lists that had a class of this but we didn't want to find the actual class we wanted to find an element with that class so now what we want to do is to find the actual attribute not an element with the certain attribute value but the actual attribute of the eleven what I want to find is the inside of this image we know there is a source attribute I want to get the actual value for that attribute which is going to be that URL for the image so for that I'm going to do another slash so I'm going to say instead of this image and pay attention because before when I was saying that image has an attribute I was directly putting this right answer to the image now what I'm saying I'm saying inside of this image there is an attribute which is my slash again and the attribute sign that I'm looking for so the attribute is SRC that was the name of the attribute but but I can type SRC because that means an element so the attribute sign is the @ sign so SRC we are looking for the element attribute element image and inside of that image there is an attribute called source and we want to get the value of it so SRC so there it is I'm going to hit enter let's see what we get so as you can see we get some images but we're getting too many of them so apparently we we're also able to get some extra things in there there should not be a part of it it seems like the videos really start from here from this blue one on the thirteenth row so we should probably have like thirteen thirteen extra records down here which seems like that's exactly what we have but it turns out that there are some images inside of you else instead of inside of allies in other places so this is where we have to be more specific so I guess that brings me to my next thing that we're going to actually get it by a class so what I'm going to say I'm going to say not only it's going to be a ul and Li but I'm going to say that that is ul that Li that I'm targeting needs to have a particular class and that class is going to be its from the name of it it seems like channels content item that's going to be a class for each video so that's exactly what I'm going to use I'm going to say that this text needs to be inside of the class now you can see that the class I could probably get this whole thing and see equals to dad but I'm thinking that maybe they are not using for each video the second shelf grid item although they might be but since I want to show you an example of a partial class this is what I'm going to grab so class so we need the list element that we're grabbing to have this particular class so I'm going to go back and go here now what I have to do is say that it it's not so before what I was doing I was saying add class equals to the value in quotes whatever that class was but right now it does not equal to that we have to find something that contains that piece of class that we need and again I'm going to go back to documentation here to make sure that I'll show you how you actually find this thing and I'm going to go under XPath functions to show what functions I have available so I'm going to go under expec functions so here since we're dealing with text I'm going to use string functions so I'm going to go inside of that see what functions we have so here some functions to behave so we can see we have start with function we have substring afters and all the stuff one of those is contains so I'm going to click on that contains one to see how the function actually works so see that function it says it's a boolean the string one a strength too so it basically contains that's the function we open the parentheses in there string 1 string 2 and it says that string 1 is a string that might contain the second argument so the first one is basically where we're looking for and the second one say my contain that might be contained the second one is the actual things that we're looking for and this is the text inside of which we're looking for so the contains functions function in this the XPath works this way so the first one is where we were looking for that string and the second one is the actual string we are actually looking for and probably there's an example so we're done here well apparently as I said not the best documentation probably but there it is at least we have this so I'm going to go back and inside of this square brackets I'm going to say contains and as this is a function we're going to use our function opening and closing the first one was where are we looking for that string that should be contained and I'm looking for that inside of the class attribute so I'm going to do attribute class that's where I'm looking and after the comma I'm going to have to show what I'm not trying to find inside of it so that's going to be a string so therefore single quotes and that's what I'm trying to find within this attribute and see if it contains it so I'm going to go ahead and do this and hit enter and you can see that now we're not getting that add those extra items because I was more specific I found the class that contains that right and then I returned all the links to those images and we got exactly 30 so that's the way you can so now we not by watching this video you should not only know how to target something with a specific class that equals to something you also should know how to target something that contains a particular class in it so that's our contains function and again as you can see if I was trying to see find something that begins with something I would do start with function right now would go here and see what does the function how does it work it turns true if the first argument string starts with the second argument string so again the same kind of thing so it's going to be the own instead of contains it's going to be start - width and then the first one is going to be where it could be so if it was inside of a class you would do the attribute sign class if it wasn't something else they would do add whatever that attribute is and then that whatever you're looking inside of that string so that's our string right here so that should give you a preview about how you should use this documentation and find something so some functions you can use here so getting back to here there is a little extra cleanup I want to do here so if you remember I said that well this gets me the link that was in here so that's the link right here for the image but I don't want to get this parameters after with the question mark and after because all the URLs may have parameters which you pass after the question mark after the URL I don't want that I just want the straight link to the image so I'm going to do a little cleanup here and I'm going to start by using equal split function so I'm going to split this image link text and I'm going to split it using that question mark because as I said parameters are after the question mark so if I enter you see how it separates this and the whatever's after the question mark it puts it in a right cell now I don't really want to keep this stuff on a right so I'm going to simply put this inside of our index function and after comma just returned the first one is my or in my array which should be the actual link so I'm going to double click to send this down that should get us the extra image so if I want the actual image to show up in my spreadsheet I could use a function called image and you can see that function called image the first argument is the URL for the image so I could use this I could just put that whole thing inside under the image function - I guess there could be a better idea but to show you what's going to happen I'm just going to close that the rest of the arguments are optional so you can see how we put the actual image right in our spreadsheet by doing this now I'm probably going to go back here and modify this and make sure I just put my image function right around this so I don't have to use another cell this is weird how this doesn't actually send down when there are actually elements below it's kind of stupid but anyways there it is we have all our images for each one of our videos now nice three nicely extract it to our spreadsheet so again if you were working with e-commerce you could same way get the prices for product the titles the image for the product as you can see and so on so I guess the last thing I may want here is just a link to the actual video itself so the link to the video itself is this hyperlink right here so that's gonna be that's gonna be so again that's the same you ll I side of the channel but I don't need the image I need the hyperlink pretty much so and inside of the hyperlink will find this the link okay that's not going to be a big deal so we should be able to use most of what we did here without having to recreate this whole thing so I'm going to simply just highlight this whole thing copy hit escape come back here paste the function so the beginning should be fine this is on order list and the list item that contains the class this so that we can use all of that without changing a thing because that's still the same so far so inside of that list element we want to have a hyperlink element instead we said we want an image before now I want a hyperlink which is our a element so I'm going to say instead of IMG it's going to be a and inside of that hyperlink we should have an attribute and the link is age ref hyperlink reference attribute there it is that that's what I want to get the value for this age ref attribute so again that should not really be much of any different than our image instead of SRC attribute which is the source for the image we're going to get H ref attribute which is the hyperlink value so it enter that gets us a link to the video now this is not a full link as we can see this seems to be a relative link and there seems to be some extra stuff here so I'm going to have to figure out why do we get 60 instead of it says it seems like we're we get the same link twice so I'm going to have to figure out why exactly I'm getting the same hyperlink twice instead of getting it once so I got here that's our hyperlink right here okay fair enough so that's our a element now let's see is there anywhere else that it mentions to hyperlink it forces it to do it twice that's our a right here so let's see if there's another hyperlink somewhere below here this is a button so seem like it's there probably this one yep that's got to be this one so see there are there there's a hyperlink I know it's probably not because it's not inside of it that's not going to be it or it could be with because vol Li that's the channel and we're looking inside that so on the same level uh-huh yes so this hyperlink falls inside of that too so that's not good so what we that means is that we have to be a little more specific with that ul Li now instead of just directly jumping to this image right after that we could say well it should be in this div that has this class of YT lock up thumbnail right so let's pick that in a as a middle step so I'm going to copy that class to make sure I have it and that's a div element with this particular class so after the ally we're going to do another step which is going to be the div and then we're going to go to the link great so I'm going to go back here ul Li after the Li it's not a direct child but it's a child nonetheless so it's a div element and it should have a attribute class which need to be equal to again in single quotes that so I'm going to hit enter and I'm going to try to expand this so you can see what I did here so there it is so I did see you on order list list item that contains this class and then that should inside of it have a child div element which should have a class attribute that equals to this we is this whitey whatever thumbnail and then inside of that at some level that should be child hyperlink and we want to get the href attribute of that hyperlink so let's see there it is we only have thirty now that's the correct number of links now the only problem I guess if we could call this a problem is that apparently these are not full hyperlinks there are relative links so if I opened one of these videos let's stop that so you can see how instead of getting the full one that says HTTPS youtube.com and so on I'm just getting the slash watch the question mark and all that stuff after that so to make this a full hyperlink will simply have to do a little joining here so I guess I'll just use concatenate function in this case so I'm going to use concatenate and the first piece of my concatenate is going to be that first part that was missing which is a domain name and comma and I want to concatenate the rest of this hyper link after that close my parentheses that should now give us the full link let's just control arrow key down now if I click on one of these let's say this one should be this view look up with array formula function so I'm going to click on it click on this that should open our video okay and it does no problem so you can see how we were able to extract the link to the video the image for the video the link for the image for the video the view count how many days ago it was uploaded the title of the video and every single thing we needed from this particular page so if we were trying to be very specific about it if I assume like this minutes or seconds whatever those things are see there's like how long the video is you could also extract this obviously using the same method so if I go under the thumbnail right under the thumbnail there should be a spin which has another spin which has another spin or we could go with a class and again I usually tend to try to not to complicate things that can be easy to do so for example you know we could for example just use the video time class with the span element and if there are no other span elements on this page with video times that should probably just take care the whole thing without having to go this within that within that within that within that so let's try to do that I'm just going to try to get everything this every span element that has a class video time which has a span element right inside of it right so that should be something we could do so I'm going to go ahead and copy this let's grab let's try to extract video times as well since we went this far so I'm going to just paste it here for a second and I'll just copy this whole link escape back here paste so we want so as I said I don't want that I want to try to not to complicate this so I'm going to look for all the span element that have a class again it's an attribute class so each class equals what was it video time I guess and see I keep doing this but you have to use single quotes because it's going to break your quotes here video time and close this so that's going to target that span elements with video time then inside of that there should be another span if you remember so we had expand class video time inside of it we have another spin the reason I didn't use this directly because it doesn't have any particular identifiers on it so therefore I have to use this one so that should hopefully work let's see unless there are other elements in there and video time so I got that right so there it is that gets made a video time for each one of those videos so again if you know HTML this should definitely make a lot of sense to you and shouldn't be difficult from this point on to understand so again to get the documentation and understand the references if you need something in particular make sure you go to XPath reference and so there are other places where you can find XPath documentation from what I've seen this is the one that was the fullest reference out of everything else I was able to find so far if you have a better documentation please let us know where this but here you're going to find everything under either XPath syntax or functions so most of the functions you're going to need that you're going to have to just find something on a page it's probably going to be your string functions that's where you want to concentrate because there are things like number functions and things like that in XPath where you can actually do some numerical operations and stuff like that but if you're just harvesting a page you may need it in some cases but I guess you're not going to need it if you get to that level I hope you'll figure that out we also have this note set functions as you can see again could be interesting to find out especially like this last function for example so the so the last function an example of where you could I guess use it if you remember like here I believe we should be able to use that when I said let's get the second list item right so what if I just set the less function is that it gets me the same thing so the difference between like using number two and using this last function which is another interesting function is that when I was looking at those list elements this was the first one this was the second one right now if you had like multiple list elements in each one of these videos and there was no consistent there was no consistent way of knowing how many you're going to have but the first one was 35 views and the last one out of the list item was like three days ago and there was something in the middle but sometimes it was in the middle sometimes it wasn't it could lose the last one to just grab the last one which is what I did just switching to the last function which is an interesting function again could be handy sometimes so again just look at this documentation if you need more details right and if you really want like a full XPath tutorial maybe if there are enough requests I'll do one that's going to cover a lot more detail about all these different syntaxes and operators you can use here but what I've just covered in this video should give you most of what you need to do most of the extraction out of a web page so again as I said in the beginning of the video they don't recommend actually relying on XPath to extract things out of the webpage because the more forward we go especially with new frameworks like react view and you know where the actual Java Script is handling the content on a page this is not going to be able to handle it because when this import XML goes to the page the page needs to be loaded in the HTML if it then reactively loads it on a page it's not going to be able to get those anyways let me know if you liked this video and you know if you have any suggestions just leave a comment and thanks for watching
Info
Channel: Learn Google Spreadsheets
Views: 116,058
Rating: undefined out of 5
Keywords: XPath, query, IMPORTXML, Google Sheets, Advanced Tutorial
Id: pwZ44kAeiOo
Channel Id: undefined
Length: 49min 7sec (2947 seconds)
Published: Tue Jun 13 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.