Web Scraping Tutorial - HTML Tables - Python & Selenium [+ Excel File]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everybody and welcome back to the next scraping tutorial in this one i would like to show you how we can use selenium and python in order to scrape um tables and store the results in one excel file actually i've already created one video which is also related to scraping table elements but in this specific tutorial i want to show you how we can use for example the pandas data frame in order to take a look on the data until we save it in our last um yeah excel file at the end so i would say let's just jump into the project and this one will be very straightforward so first i want to use the jupyter notebook as our ide for this project so let me create just one new file let's name it as scraping underscore tables and first and foremost let's go ahead and initialize our web driver in order to be able to reach this page we want to scrape the data from so let's import the web driver from the selenium package so let's go ahead and write down from selenium import the webdriver this is the very first step let's run that and now we can go ahead provide this driver variable and let's just initialize our web driver again i want to work with chrome and inside this brackets here i need to provide the location where my chrome driver path is actually stored let's do this this and don't forget the quotes in the beginning and at the end and now our driver is um initialized successfully and we can go ahead and call the get function in order to be able to reach to this page so one second it was this window here we want to get access to this homepage let's do this back to our um to our ide and let's paste our yeah url string inside this get function all right so this is the next step and afterwards let's as always maximize the window and now let's start um yeah run this code for the first time and let's take a look if we have access to the table we want to get the data from okay so everything is um yeah loading fine so far let's go ahead and let's say we want just to get access to these countries to the population the yearly change and the vulture so we don't want to have the whole data which is inside this table but just the four mentioned columns let's see how we can get access to this specific data and we want to start with the countries so right click and inspect and now let's take a look how we can get access to the country elements but first of all before we start i would also um always recommend to start from the table point itself so you see here we have this table attack we have this id attribute and this example to value let's go ahead and build our xpath expression and let's start with the table at id equals to the value which i have just copied this is the example 2 and you see here this is the unique element which is yeah representing this table and this is actually our starting point and now let's go ahead and traverse all the way down until we find the data we want to so first we want to go down to the t body because you can see here on the left side inside this t-body is yeah is the entry of the first row so let's go down for the t-body and then in the next step we need to refer to the tr so now we are here so t-body travels down to tr in the next step we want to get access to the second td tag because inside this td tag you see that we can go down again and then in this anchor tag we have finally the tax element of the first country let's do this so tr then let's go to the second td and to the anchor tag and you see here that we have all in all 235 results and if you scroll all the way down you see that we have 235 yeah countries and now if we have this expression here let's say you want to yeah click down and you see that here we have access to the countries this is actually what we want and now let's go ahead and store this expression inside one variable but we can make everything a little bit easier for example if you remove the first part here let's remove it provide the double slash in the beginning and you see here that we have exact the same results so this expression is actually a little bit shorter than the previous one let's copy that one go back to our jupyter notebook let's provide one variable for the countries so it's countries equals to driver dot find element by xpath or actually it's elements because this is the list so find elements by xpath let's uh provide the expression here and well actually let's say i want to put this expression in this block and yeah write down one short for loop and let's see if we have access to all these countries so let's say for country oh sorry for that for um for country in countries let's print out um country dot text and let's see if we have access to all these countries let's do this let's run this code and you can see here that we have the text elements which are actually inside this expression here so here we have every country element and the same we want to do for the remaining columns we are looking for okay so for now let me just um yeah remove this one and here let's go ahead and take a look how we can get access to this population actually we don't need to change very much here so we all again start with t body then we traverse down to this tr tag and then in the next step let's go for the td so maybe i can scroll a little bit up so we started here from tbody then we went down to tr and then to the td tag and now we don't want to go to the first one but to number one number two and number three we want to reach this cd tag that means we need to provide this three or number three inside the brackets here and now let's take a look and you see that we have access to the population okay so let me just again copy this short expression go back to the jupyter notebook let's provide one variable population equals to driver dot find elements by xpath and provide the expression inside this brackets okay great the next requirement is to get access to the yearly change that means let's just take a look how to get access to this and now as mentioned before with this expression we are at the third td tag and we need to go to the next one not to number three but to number four that means we just need to replace um three with number four and voila we have access to the yearly change actually this is yeah quite straightforward quite simple uh copy this expression go back to the jupyter notebook and now let's go and provide one new variable let's go for yearly change equals to driver dot find underscore elements by xpath so it's elements and inside this brackets let's provide our expression and now we should be good to go again and the last thing we want to get access to is the world share okay so let's take a look to get here and what kind of modification we need to do so now we are located at td number four and if we go all the way down until this last cd tag um you will see that we have access to the world share so it's this one so on the left side you can see that this vault share is marked and that means this is actually if this is number four when we have this um yearly change it's um no it's this one right this is the number four then we go down to td number 5 6 7 8 9 10 11 12 13 this is actually what we want so let's replace number four with number 13 and i guess it's not 13 it's well is it 12 okay so let's try the number 12 and okay so this was my best sorry for that this is actually td tag number 12 and this is our expression we need to use so copy that one go back to the jupyter notebook and then let's provide our last variable so it's the world share so well share equals to driver dot find elements by xpath so let's scroll down a little bit so this is the last one here by xpath paste your expression here and now you have sex uh successfully grabbed um yeah the four um entries all right so the next step is actually that we want to loop through each and every list here in order to get the tax elements of the yeah of the countries of the population of the yearly change and the world share let's take a look how we can do this first let's create one empty list because afterwards we need to store our results inside this yeah empty list let's go ahead and let's provide this variable here so it's population underscore result equals to one empty list here and then in the next step let's write down our for loop because again we need to loop through the um yeah through this four lists here so let's go ahead and write down for i in range and let's take a look what the range actually is so i've showed you before that we have 235 elements inside every list and for example let's just go ahead and grab the length let's say for example for the yearly change and let's take a look what the number is so length of yearly change is actually 235 and um i can just write down for i in range 235 or i provide this expression here and we use this length function i just i've just copied it let's paste it here and this is actually our range so let me just delete this this stuff here and now let's start and loop through this list and append everything in one dictionary let's say this is the temporary data because in the first step i want to show you the results in the pandas data frame and finally i want to save everything in one excel file okay so let's go ahead and loop through this list and yeah append everything in one dictionary so let's start with the country and here what i want to do is i want to go ahead and get access to every element which is inside this country list so go ahead and write down countries i dot text because we have we want to get access to the text elements so sorry for that so again temporary data no this is actually not what i want to do sorry for that so countries i dot text and i don't know what this is suggesting me always describe here so now we should be good to go okay this is number one then let's go ahead and yeah write down the population so in this dictionary this is our key element here and this is the value element and now we will see in a few seconds how the results is actually here looking like so now let's go ahead and get access to the text elements which are inside this population list this is number two then number three is the yearly change so let's provide this name here yearly change um equals to yearly change i dot text and finally in the last step let's go for the world share so world share and then let's provide also the value element which is wheelchair i dot text and now i guess we should be actually good to go in the next step we need to append everything inside this empty list which was created in this step here let's go ahead and do this so let's go for population underscore result dot append uh temporary data let's take a look how um yeah this actually looks like so first of all um yeah let's run again the cell this is number one and then in the next step what we want to do so you see this is loading right now and in the next step we want to append the results in the pandas data frame and now it's finished because you see we have the line 6 and we don't see this star anymore which we have seen a few seconds ago while the cell was actually yeah executing and in order to get access to the pandas data frame we need to import this pandas library so let's go ahead and write down import pandas as pd let's run the cell again and now in the next step what we actually want to do is we want to take a look on the results but in the data frame of pandas let's do this and let's just provide one variable here let's go for a data frame underscore data equals to pd pd is actually the data frame sorry this is the alias for the pandas right because here we have provided this pd alias for this package so let's go ahead and write down pd a dot and now we need to get access to this data frame and inside this data frame we want to um append this population underscore result uh which was yeah initially this empty list then in the next step we appended the entries from this dictionary inside this empty list here and i guess it will be a bad idea to run the cell or run this snippet here this line in an excel because if i run the cell again here everything would load yeah from the beginning and we can just save some time if i yeah put this in the next cell okay so let's take a look how this is actually um yeah looking like so let's run this and then in the next step let's uh just take a look how um yeah how this is behaving so again and you see here that we have 235 results this is actually what we want to have we have four columns and you see here country population yearly change and the wheelchair actually this is exactly what we wanted to get and now in the last step let's append everything to the excel file this is actually quite simple let's go ahead and write down df underscore data dot to excel this is what we want to do and here um let's just provide one name for the excel file let's go ahead and write down population um let's say population scraping underscore result dot x l s x this is actually what we want to provide and then let's go ahead and write down index um index equals to false because we don't want to have this index elements in our excel output file and yeah so let's run the cell and let's take a look on yeah our final result and you see here that this population underscore scraping underscore result xlsx file was created let's double click on this and let's take a look what we have so now let's make everything a little bit more pretty um let's click on format as table let's go for the um yeah red one and make sure you don't use this filter button and now voila as you can see here we have our result for the tables okay let's scroll down make a cross check if we have all the entries and we have we are here at line two three six and actually we started from line number two so we have our data of 235 um yeah rows we have four columns and yeah this is actually the scraping tutorial how we are able to scrape uh the tables using selenium and python all right again guys thank you very very much for your time for your attention and i see you at one of the upcoming sections thank you very much take care and bye bye
Info
Channel: LX_schlee
Views: 6,580
Rating: undefined out of 5
Keywords:
Id: JLDbAx6LAdo
Channel Id: undefined
Length: 21min 23sec (1283 seconds)
Published: Sun Dec 06 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.