Web Scrape in Google Sheets: IMPORTXML Function (Part 2)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to part two of web scraping with google sheets in the last video we covered import html import data and import feed but in this video we're going to cover the final command import xml which can help us extract data we otherwise couldn't before so import xml takes two arguments a url and an xpath when you call import xml it fetches an xml document with the content from the website you're trying to parse through and xpath can help us navigate through those elements and attributes of these xml documents so if you're unfamiliar with xpath that's totally fine since we won't be going too in depth in this video but it is pretty powerful so if you want to learn more i'd highly recommend checking out the w3schools tutorial that walks through the basic syntax and i'll go ahead and link this in the description but i'm going to assume that you don't know much xpath so we're going to be using a chrome extension called selector gadget that you can go ahead and download and that'll actually fetch the appropriate xpaths so once you've got that downloaded we can go ahead and get started the first example we're going to be working with is craigslist and we're going to want to scrape all of these listings so we'll start by grabbing the url at the top and pasting it in our google sheet then we can go back to craigslist and start grabbing our information so when you have selector gadget installed you should see it at the top right of your screen up in chrome you can click on it once and now as you move your mouse over your browser you'll see different elements on the page being highlighted essentially if you start clicking on data selector gadget will try to interpret what you're trying to select and then generate a tag associated with all of that information so for this example we want to grab the listing name the price the date and the location in parentheses and the quickest way to do this instead of breaking that up into those four different categories we actually want to click on whatever parent node that contains that information so in this case instead of clicking on just the text here that would only get us the names of the listings but it wouldn't get us the other information so i'll clear that selection and instead try to click on the container that contains all that information which looks like it's down here so i'll click on that once and you can see all these yellow boxes populate and it includes the name of the listing the date the price and the location so once we've got our appropriate data selected we go down to the bottom right where it says xpath and then this box will pop up we'll copy this information hit ok and go back to our spreadsheet and paste that in a cell so now we can call the import xml command pass in the url as our first parameter and our xpath as a second parameter and we can see all the listings populate in our spreadsheet we've got the date the name of the listing and the price but also the location in just one cell now if i want to get the price specifically as its own column i could maybe do some regular expression but i'm going to show you how we can just get the price selected with selector gadget first i'm going to format this column just so it cuts off the text and i'll add our prices in this new column over here so back to our craigslist page we'll go ahead and clear the selection and now like i mentioned we just want to grab this price so we click it once but you see that the price is selected two times for each listing now both at the top here and down here in the description so just to explain more about how selector gadget works anything that's yellow if you click on it'll deselect and anything that's not yellow it'll try to select so we want just these bottom prices selected but not the top ones so i'll click on this up here and you can see it only keeps our bottom prices selected so we'll go back down to the xpath button copy this and go to sheets we'll paste this new xpath up here and essentially write the same command as before import xml with the url and then this new xpath of just the prices and now you can see we have all the data we wanted from those listings now theoretically you don't need to use selector gadget to actually scrape this information you could go into inspect element and grab the xpath manually so i'll just show you what that looks like so over here i'll right click hit inspect and you can see that there are these four tags the span the time the a tag and another span all within this paragraph tag so this is what i meant by selecting the container and if we didn't have selector gadget and wanted to replicate this result we would just need to specify that we want all the information within this p class result info so i'll go back to our spreadsheet i'll get rid of this for now and instead i'll make this slash slash which essentially says that we're not going to be in the root of our nodes this is going to bring us down to any node that'll fit this description the p tag where the class name is going to be result dash info and you can see that even though these expats are different they do give you the same result now one thing you need to be wary of when you're scraping websites is that the data that you scrape might not contain all printable characters and that's a little hard to see because if they're not printable they won't show up but essentially one thing you should do is wrap your import xml in this clean command and then wrap that in the array formula command which you can use the command shift enter shortcut to call and when you hit enter it should give you the same output but now it's removed all of those unprintable characters so the next example is going to be techcrunch and i just wanted to solidify that idea of selecting the container so we want to grab all of these articles with the headline the date and author and the description so i'll grab selector gadget again and instead of just clicking on one of these attributes i'm going to try to find the container that has all of them so that looks like it's up here and i'll click once and for some reason it's only selecting our premium content articles so i'll click on another one of these articles that isn't selected and now you'll see it's turned the other articles all yellow so we'll come back down to the bottom right where it says xpath copy this and then paste this in our google sheet we can write the same command as before where we pass in the url and the xpath and you can see that it populates our spreadsheet with all the articles from techcrunch and then the last example we're going to combine both the power of selector gadget tool but also the power of knowing xpath to web scrape this yelp page so just like before we can start by opening up selector gadget tool grabbing the big container that contains all the information we want clicking on it once but now everything on the page gets selected so we'll just click somewhere to deselect and select your gadget's pretty smart so it figures out we just want the restaurant listings and we'll follow the same processes before grabbing the xpath copying it going to our spreadsheet we also do need to grab the url so let's copy and paste that and we'll write import xml the url and the xpath except now google sheets has scraped all the information of each of these restaurants into one cell per listing and we could do some text cleaning to separate out these texts into multiple columns but there's a better way to do it just here so we'll clear our selection and we'll start by just selecting the restaurant names so let's do that first we click on that it also selected this more so i'll deselect there and now it's selected burgers so i'll deselect that oh it got some pages at the bottom so let's deselect those too so now we'll grab this x path copy that and paste that here and these are going to be our restaurant names here we go and now let's grab the phone number and address so we could split it up again by doing the phone number separately from the address but it looks like they are all in one attribute so let's try selecting this div tag that seems to contain all these nodes grab the xpath copy it we'll make a column just on the right so just so this isn't confusing i'll cut this off we're going to paste this new xpath of the address and phone number and we'll write our import xml formula right here great and then that seemed to cooperate pretty well but the last thing i want and this is where xpath is going to come in is the rating now if i were just to grab this rating and use the xpath and do the same process as before we'll go ahead and cut these off i'll paste that here and write our import xml command we just get an empty cell we've written the command but nothing populates so let's go further into the code and see what's going on let's right click and hit inspect element and it looks like this is the div tag that we selected and inside the div tag there's the image of the stars but there's actually no text or anything that indicates what the star rating is we do see though that there is this attribute called rei dash label equals 4.5 star rating and if we were to go down here inspect we can see that it's now four star rating so the data is there we just need to figure out how to get it well again this is where knowing xpath helps because we can go back to our code and actually select the attribute by adding a slash at the end and then an and then the name of the attribute we want to grab the text for which is aria-label and i'll hit enter it should rerun this formula down here and we'll give it a second to load but there we go it gives us all the star ratings and if we just want to extract the actual numbers we'll run our regex extract function and we can pass in slash d and possibly a decimal and then possibly another number and you can see we get our data so i think selector gadget and knowing xpath together can be really useful along with knowing this import xml command in google sheets i certainly won't say that this is always the quickest solution but it'll definitely be quicker than opening python or r up in some situations and you can do a lot of easy data cleaning just from within google sheets which is another huge pro and if you're still curious about the capabilities of web scraping either in google sheets or even an r i'd recommend checking out my other videos right here and here but i just want to thank you for watching and i'll catch you in the next one
Info
Channel: Dataslice
Views: 61,726
Rating: undefined out of 5
Keywords: web scrape in google sheets, web scraping with google sheets, importhtml, how to automate web scraping using google sheets, get data from website into google sheets, importfeed, importdata, how to automate web scraping, web scraping, web scraper, web scraper intro, web scraper tutorial, data scraper, web scraper video tutorial, online web scraper, web scraping software, web scraper chrome tutorial, google sheets, extract data website, web scrape without code, scrape craigslist
Id: 4A12xqQPJXU
Channel Id: undefined
Length: 8min 57sec (537 seconds)
Published: Tue Aug 04 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.