Web Scraping in Google Sheets || IMPORTHTML, IMPORTXML, IMPORTFEED, IMPORTDATA

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone in this video I will show you several ways to scrape data from websites the easiest way to do so is to use a function named import HTML here I am trying to scrape this Google doc site if you scroll through this site you will notice it contains multiple tables like this one I want to scrape it and properly organize it in my spreadsheet let's see how to do this I have used the import HTML function and passed two arguments table and its index number let me explain this if you click on the help icon you will see it says it needs the URL then query type the query type could be either table or list in our case we have seen that the doc site contains tabular data that's why I have provided table as the query type next is its index number the site can contain multiple tables so providing two means it will give a second table similarly I have done for other tables this one is 14th this one 15th and this one is 16th if you are wondering what to provide for the index number I would suggest start with one and keep guessing there is a short script that you can write in the console which will output all the tables along with its index number let me show that to you right click on the page and then select inspect now go to console and type the script here you can see it has listed all the tables along with its index numbers similarly you can do this for list also but believe me it's not not worth to do so since it will not allow to paste the script you will have to type in the entire script anyway I will provide the script in the description now let's web scrape this news site from Tesla I have inserted a similar formula but it's not working let me explain to you why this import HTML expects that the site should have a table or list in it let me show you quickly if it has a table or list in it or not I will rightclick and click on inspect now here you can see there is no HTML tag for table or list if it contained tables then you would be seeing table tags TR tags TD tags if it contained a list then you would be seeing Li tags or o tags so how do you web scrape such sites in such cases you will have to use import XML it requires a URL and X paath query XPath query dictates how to Traverse through structured data and what to pick from the structured data so in this case the main objective is to get the X path or query path let me show you how to get that if you notice this blog you will see it has got a post title date and description let's first pick up the title I will select the title and then rightclick and then select inspect you will see that the respective HTML elements get highlighted with the selection highlighted rightclick and then go to the copy menu and then select copy X path paste it somewhere to avoid doing it again now head to the spreadsheet and insert insert formula import XML pass the URL of the website and then paste the X path also convert the double quotes surrounding ID into single quote to avoid parsing issues and here's the title of the first post now how to get the rest of the posts before that we will have to understand how it works first notice the structure of the query path it starts with this ID followed by section section div div form div section and finally span element now let's check whether the same is being followed in the inspector page we will start with this matching ID then section section div div form div section and then span it's following what we have seen in the query path also notice there are two span elements now it's clear that to fetch the post date we will have to grab the text within the second span tag so let's do that in the spreadsheet sheet I will simply copy the formula to the right and edit the span index to two meaning second span element and it has loaded the date correctly let me quickly add a few headings to the sheet now let's grab the post short description either you can repeat the whale process or you can simply see the structure and amend the formula accordingly I will show both ways I am first copying the query path as done previously back to the spreadsheet then I will paste that inside this formula we could have done the same thing just by looking at the structure closely and amending the corresponding formula let's inspect the site again if you look at this then you will see that you have got a div and then P tag inside this section so we can simply replace the span element with div and P now that we have got data for the first post let's do that for all the posts we will have to make this section index Dynamic we can do that easily by splitting this query path and inserting the appropriate cell reference let me do this now you can simply drag this to load all the posts if you are liking this video then please take a moment to like And subscribe this will really help me moving on let's grab the link that will point to the individual post I will simply replace this with a tag and append hre to it I will have to add the base URL to it also Let's test this out it's working now another function on the list is import feed RSS or atom feeds are an easy way to stay up to date with your favorite websites such as blogs or online magazines if a site offers an RSS feed you get notified whenever a post goes up and then you can read a summary or the whole post let's search the BBC website to get its feed URL I will simply copy this now come back to spreadsheet I will type in import feed and paste the copied feed URL and now you have the whole news feed in your spreadsheet you can quickly open the individual news or post by clicking these links next on the list is import data if you happen to know the URL that gives data in CSV or tsv format then you can directly use it with the import data function here CSV stands for comma separated values and tsv stands for tab separated values let me search for popular movies around the world in CSV format here I got the link if I view it raw here's how it looks it's comma separated so we will use commas as a delimiter in the import data function copy this URL and come back to the spreadsheet type in the formula import data and paste the URL for delimiter use comma done here is all our data that was it maybe in future I will make another part that will make use of scripting to handle more advanced use cases and also automate it please consider subscribing to my channel and hit the like button thanks for watching and see you in the next one
Info
Channel: TechLever
Views: 3,149
Rating: undefined out of 5
Keywords: importxml google sheets, import html, import csv, importxml
Id: Z8LwfBWVbQM
Channel Id: undefined
Length: 11min 16sec (676 seconds)
Published: Thu Jan 04 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.