How To Web Scrape Data And Store It In A CSV File

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] do you want to create a side project that uses some content from other websites or do you want to collect information and check when it updates such as checking if the price fluctuates on a product you want over time maybe you want to store information from a website and manipulate it within a csv file these are all applications of programming that would involve web scraping web scraping is an extremely powerful and versatile application of programming that allows you to collect store and manipulate information presented on our website today i'm going to explain the process of how to make a web scraper in python using the beautiful soup and request libraries that then formats the collected data into a csv file the website i'll be demonstrating this with is the postage rates listed on the canada post website a simple use case where you may want to quickly check what the current postage rates are especially if they change during the course of the year so to begin the program let's first import the libraries we are going to be using today today we will be using the request libraries the beautiful soup library specifically beautiful soup 4 and of course the csv library the request library allows us to form http operations and set up http requests the beautiful soup library allows us to get data and interact with it from html files now naturally let's just immediately get the website information loaded into the program to do that we use the request library and use the get method we will then simply paste the url of the website we want to scrape let's store this in a variable called source and we can then manipulate this variable to access other methods that we will need for the next line let's create a beautiful soup object that allows us to interact with the html script of the website html stands for hypertext markup language and it is a standard language for documents designed to be displayed in a web browser we imported beautiful soup as soup however you can use whichever name you want to call it and then inside the brackets we pass into arguments the first one is the website so we put in source up content and the second one is mainly to avoid a user error that would otherwise be outputted while the program would still output the correct output it would like the parser to be specified so we put features equals html parser now before we move on it is a good practice at this point to begin a web scripting program to see if the html actually will load properly in recent years websites have been designed to prevent web scraping to occur and you can check for this before you can continue the program let's write the line print and in brackets web page dot prettify this line we will use to just print the html script of the webpage if the program is able to access it you will see the same html script as the one displayed in the inspect element in google chrome the purify method helps to better display this html script so it is properly indented and formatted as you can see the html script is identical to what is shown in google chrome so we can continue with this program now let's create some arrays to store the information we want to collect so based on this table we want to collect weight and prices so let's create a weight and price arrays to store this information now we want to create the for loop that cycles through each line of this table and stores the relevant information into arrays so essentially in this for loop we specify that the html class we want to filter for is cpc postage table which is also the class that we see the tables in under the html script when we click inspect element the find all method helps us make sure we are searching for every occurrence of the cpc postage table class in the html script now we want to just store the in text displayed on the website into a string variable therefore we create a string variable called string and then stored the text itself using the dot text method we can also pass a parameter such as p which is the paragraph tag that is used within the cpc postage table class to write the weight amounts that is then displayed this parameter lets us extract only that text and not the other text in the class that is separated using its own tags now we want to input the string value of the text we want into the weight array so we use the dot append method and pass the string variable as the only argument in the brackets the dot strip method lets us remove the empty spaces before and after the text we use the exact same logic for the next for loop that we will use to extract the prices however i want to point out some key differences that i have to implement and you will likely also have to make some modifications when you are web scripting a website based on the html code of it so unfortunately the prices are not separated into their own distinct tag that helps differentiate it from the weight amounts so to circumvent this this time i did not include the paragraph tag as a parameter to filter instead i just extracted all the text from within that class and then within that text the price amounts were there however since i just want the price amounts i created a new variable called substring and then use the dot or partition method to cut out the part of the text that i wanted which is the price amount these price amounts have a dollar sign in front of them so i passed the dollar sign symbol as the argument i also wanted to extract the characters after the dollar sign which would be the prices so i put the number 2 in the square brackets next to this expression for the number of characters i wanted to extract after that and once again i appended this into the array but this time i appended the substring variable the rest of this program is fairly standard i create a string variable to store the name of my csv file i then use the methods from the csv library to set up an excel file and to write in it using the dot write methods i used the dot right row method to create the header of the table at the top of the file so i did index number weight and price i then used a for loop to cycle through the array i created and print out each string value located in their indexes into a row on the excel file and that's it let's run the program and then open the excel file to see if i did it correctly and it worked to summarize what we did today we loaded in the webpage you wanted to web scrape from using a combination of the beautiful soup and request libraries we then created arrays to store the information we wanted from the web page the web scraping occurred in a for loop where we passed in parameters such as the name of the class and paragraph tag to narrow down which specific text we wanted to extract and stored it into these arrays for the price amounts we had to do a partition in the string due to the html script being difficult to filter through resulting as having a more manual approach then we created a csv file and used a for loop to cycle through the arrays and write the data we scraped into it thank you for watching and i'll see you in the next video
Info
Channel: buckmasterinstitute
Views: 13,479
Rating: undefined out of 5
Keywords:
Id: OK5c0JD4NwM
Channel Id: undefined
Length: 6min 53sec (413 seconds)
Published: Mon Jun 20 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.