Web Scrape in Google Sheets: IMPORTHTML, IMPORTDATA, & IMPORTFEED Functions (Part 1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
you might be wondering why you'd want to use google sheets to web scrape well here i've got a giant table of apple species and by the time this sentence is over i'm going to web scrape the entire thing from scratch using google sheets with just one line of code web scraping from google sheets isn't always the best solution because it requires some level of messing around but it can be a really fast solution given the right circumstances so my goal for this video is to give you a broad overview of how web scraping with sheets works there are four main commands import html import data import feed and import xml we're going to cover the first three in this video but we'll cover import xml in a future video because it's kind of tricky to work with so let's get started so the first command we're going to cover is import html which allows you to scrape both tables and lists so i've got some data here from this wikipedia page a list of earthquakes and it has a few different tables on the page but the one i want are the largest earthquakes by country or territory so the first thing i need to do is figure out which table index this is on the page because there are a lot of different tables here i need to figure out which one this is so i've got some code for you in the description that you're going to go ahead and copy once you're on the page you can right click go to inspect open up this console window and paste this line of code and essentially it'll pull out all the different tables on the page and as you hover over them it'll start highlighting different tables that are actually on your web page like i mentioned i want to grab this table of the largest earthquakes so once i pasted that line of code i'm going to go ahead and hover over these different table rows until i get the one that i want to be highlighted and we see here that the index number on the left is three so we just have to remember that number but now we can go ahead and open up a new google sheet so i'm going to type sheets.new which will quickly bring me a new untitled spreadsheet this is a little tip you can also do docs.new or slides.new to open up a new google doc or new slide but anyway now i can call that import html command and it's going to take in three arguments so the first is the url of the website so i'll grab that next is either a table or list depending on what you're trying to scrape so here i'm scraping a table and then last is that index number we figured out that it was index three that highlighted the table we want so we'll put three right in here once you hit enter it should take a few seconds to load your data but it'll suddenly all appear inside of your spreadsheet now we could be done at this point and just download it as a csv put it in our python and do whatever text cleaning and analysis we need to do but what's great about google sheets is it does have a lot of functionality for doing text cleaning and data cleaning so right now our magnitude and date columns look kind of messy so what we're going to do is before we start modifying them is hit command a to grab all of our text we're going to command c to copy it and then do command shift v to paste just the text values we're first going to go to our date column which you can see is kind of messed up right now and format it as a date which should fix a lot of that for this magnitude column right now the numbers aren't super clean we've got some text down here some dashes what we're going to do is create a new column on the right we're actually going to format this to text first just plain text we're going to call this magnitude dot cleaned we're going to use our regex extract function to first select this column and we just want to grab a number that might have a period and then another number right after it we can double click on this and you can see it populates our column with the magnitudes and we could continue to clean this up or do more analysis but i think you get the picture so import html doesn't just work on tables it also works on lists although i will say it's a little less useful so here i've got a recipe for some noodles and let's just say i want to grab the ingredients and the directions into my google sheet well i'm going to follow a similar process i'm going to start by grabbing that code from before except i'm going to change the table to be ul comma ol which stands for unordered list or ordered list here we have an unordered list these are just bullet points and down here we have an ordered list but again this will be in the description so i'll go ahead right click and inspect go to the console window and paste this line of code and like i said i want to grab the ingredients and the steps to make it so i'll hover over these different rows to figure out ingredients we see that's number 14 and then if i scroll down steps to make it is 15. so i'll go back over to our spreadsheet and we'll call the import html command again we'll pass in the url as our first argument now we pass in list as the second argument and again we pass in our index so we'll start with 14 to get the ingredients so we've got our ingredients i'm going to go ahead and expand the column and i'll copy this paste it here but now replace it with 15 to get our directions so i guess this could be practical if you had a lot of different recipe urls that all followed the same standard procedure but that's honestly pretty unlikely so you can see how this would be a little less useful since you have to manually figure out the index now ideally you would use one of these id tags to figure out what are the ingredients or what's the main recipe and you can actually do that with the import xml command which i'll touch on in a later video but for the time being if you're looking for a more robust method i'd recommend checking out my tutorial series on web scraping with r where i cover a few different examples so now we're going to move on to the next command which is import data and that allows you to scrape csv or tsv files directly into your spreadsheet so i've got this csv file from github that i found i'm going to go ahead and go to the raw just so we have the actual csv file and i'll copy this link go back to our google sheets and then write this line of code import data and paste our link so when i hit enter it'll take some time because this is a longer csv file but eventually it'll populate my whole sheet with all the data and i mentioned earlier how google sheets is great because you can do a bit of text cleaning or data cleaning well it can also make some charts really easily so here i had some population data and if i wanted to visualize what this looks like over time i can make a new chart i'll set the x axis to year aggregate the data make it a line chart and just like that we have our population every time so pretty easy to do and then the last command we're going to look at from google sheets is import feed which can scrape an rss or atom feed directly into your spreadsheet so here i've got the nasa rss feed they have one for breaking news so i'll go ahead and open this up i can copy this link go back to google sheets and write import deed paste our feed in and again it'll take a few seconds to load but eventually the feed will populate into our spreadsheet and we'll be able to see all of our data here now again i don't know how useful or practical these commands would be for you personally but i think it's worth knowing that the functionality exists and i think that there are certainly some scenarios if you have the right type of data where google sheets can really be helpful so again i'm going to cover that last import xml command in another video because it can be pretty powerful although it is pretty confusing to use but anyway thanks for watching and i'll catch you in the next one
Info
Channel: Dataslice
Views: 139,117
Rating: undefined out of 5
Keywords: web scrape in google sheets, web scraping with google sheets, how to automate web scraping using google sheets, importhtml, importfeed, importdata, get data from website into google sheets, how to automate web scraping, web scraping, web scraper, web scraper intro, web scraper tutorial, data scraper, web scraper video tutorial, online web scraper, web scraping software, web scraper chrome tutorial, google sheets, extract data website, Web scrape without code
Id: Q5yVoTqD06M
Channel Id: undefined
Length: 6min 27sec (387 seconds)
Published: Tue Jul 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.