How to web scrape data using no code with Octoparse

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the world wide web or the internet contains an enormous amounts of data and in order to harness such information you have to extract it through a process known as web scraping which essentially means that you are extracting the information from the web pages and then you're structuring it into a spreadsheet form such as in the format of microsoft excel or a csv file and traditionally you have to use python libraries such as beautiful soup and selenium and perhaps you're wondering is there a no code solution that allows you to just point and click in order to perform web scraping so in this video i'm going to show you how you could perform just that with just a few clicks of the mouse and you will be able to perform web scraping using a software called octopars which happens to be the sponsor of this video and so without further ado we're starting right now and before beginning it's worth noting that octopars is currently holding a black friday sale starting from november 17 until december 3. and so the link to this will be provided in the video description and so let's begin so the first thing that you want to do is head over to the octopars website and i'll provide you the link in the video description and as you can see here you could try out the software for free for 14 days and as already mentioned on the website here you could essentially perform web scripting in just three steps so let's have a look at it so the first step is you want to copy the url of the website that you want to web script and then number two you want to select the particular information that you want to perform web scripting on and then finally step three is to run the actual workflow so i'm going to show you all of that in this video so why don't you go ahead and download the software and you could do that by clicking on the download link here and because i'm using a mac i will download this mac version and if you're using windows then you could download the windows version and so after you have installed octopars you want to fire it up and upon logging in you will see this home page of octopars so if you're new to octopars you could either watch this video to the end and as a refresher you could also watch the video provided by octopars which shows you how to use the template mode and also how to use the advanced mode and both of these tutorials will be at the beginner's level so you could follow step by step and so when i first started using octopars i've tried out their templates and so as a youtuber the first thing that i tried out is the youtube template so let me show you so you want to click on the youtube and then this will be the templates that are provided here and let's say that i want information about videos on data science from youtube so let me click on the video information and then you could read this instruction so essentially it is in three steps so actually you have to click on try it first in order to launch the template and then you want to enter the keywords and then you want to save and run let's do it so first thing is to click on try it and then you enter the data science keyword here and let me type in data science so that i know that i use data science as the keyword and the task group here i just leave it at default and then i'll click on save and run and that's it and then wait a moment and so you could choose one of two options which is to run locally on your own computer or you could also run it on the cloud and as mentioned here you could have up to 20 cloud servers working for you so for this tutorial i'm going to run it in the cloud which means that i could close my computer and the task will still be running and then i could open it up several minutes later then the web script data will be ready so it will essentially run in the background so this will be very useful if you're performing a large scale web scraping which might take hours and so it's very convenient as well as you could also perform several tasks at the same time so let's go ahead and click on the run in the cloud and note the name here is called video information data science and so i'll click on dashboard and so this is the task that i have just created a moment ago called video information data science and so currently it is in the running status and so you'll notice here that if tasks have been completed it will have a green color text saying completed and you'll notice that the symbol here is a symbol of a cloud which means that i have run this task on the cloud and so the data is already ready for my viewing which is based on my prior query on the cloud so let's refresh it click here on the refresh so it's still running so give it some time all right so it's starting to extract information from one video now let's refresh it again six videos now so let's have a preview let me refresh one more time okay now it's eight videos so i want to click on here and so i will be able to see the preview data right here wow this is pretty cool so you can see here that we already have some information extracted from youtube and this is based on the template that is provided and so you can see here that all of the information is provided in a structured format and so you can see that you have the video title here right and you could expand or collapse the column you have the link to the video you have the channel name you have the channel link you have the total views for each of the videos here and then you have the published date of the video and so the description was not web scraped and you have more all right let's refresh it okay now we have over 42 videos so it's going to be running on the cloud and so you could give it some time and then you could check back later for the completed task and so in the meantime you could perform other tasks as well so let me show you how to web scrape information from twitter so click on twitter so it's asking whether i want to exit the prior task so i'll just confirm and so let me click on the top tweets here and so before proceeding let me show you the information in the tabs here let's have a look at the sample so this is sample information if you use the keyword of data you're going to see the url of the webpage the tweet website the author who posted the tweet the author webpage url so it's the id of the authors oh and also the timestamp of the tweets the actual content of the tweet is right here tweet content and if the tweet contains any images the url will be provided here and this is the tweet number of likes and this is the number of retweets and this final column here is is the number of review all right and so as you can see this is the sample information and so let's click on try it so i'll just enter a keyword such as data science and it will ask how many times do i want to grow so i'll just go with 20 and here i'll type data science click on save and run and then i'll run it on the cloud click on it and then that's it and so after you have submitted the tasks by clicking on the save and run you could close the window here and so that wouldn't disrupt the task this is from the prior run i'll just close it so closing it doesn't delete the information let me refresh the page and so you can see now that we have two running tasks so the first task that we had created just a moment ago on youtube it's currently running and it has extracted about 154 video information and for the tweet using the data science keyword it has already extracted 132 lines and so it seems that the tweet data is performing much quicker and so while it's running let me show you another example and so let me show you now how to perform web scraping by copying and pasting a webpage url of your interest so this could be any web page that you would like to perform web scraping on and i know that there is an example template for amazon but i'm going to show you in this tutorial so let's go to amazon and let's say that i want to search for books about data science and i'm going to copy the url here i'll just copy it now paste the url here click on start and then you're going to notice that after parts will perform the automatic detection of the contents on the web page give it some time for the auto detect all right and so it has already detected the data and information are appended below so you see here that there's the title of the books and the url of the image of the books and then you have the url of the books you have the author names here and then there's other information such as the rankings of the books in terms of the number of stars and also the number of people who voted and gave stars for the books are provided in this column and then there's several other information such as the price of the book shipping price etc but you know you could extract selected information that you like and so here you can notice that each of the elements of the web page is highlighted in the box shown in paint color but not all information would be those that you want to web scrape so let me show you here let's say that i don't want the image url i just delete it uh this column containing just the word by i'm going to delete it url here i'm saying should i have it i'll i'll just save it it's the url of the book the author names i'll keep it and all of these i'll just delete it so i'll have to delete it one by one but the thing is you could save this task and then you could run it again in a future time so i'll save the stars rating here and this one as well what is this the url of the book but we already have it from before and so this would say whether the book is a paperback or hardcover so i'll keep it okay so there seems to be multiple instances of the book url so i'm just going to delete this because it's redundant okay 3209 is the price shown here and 32 okay so it's pretty much redundant with this column of price i'll delete this 34.95 this is the original price so i'll i'll keep it this column i'll delete it this column i'll delete it get it as soon as december 2 which is the estimated shipping time shipping cost should i keep it i'll just keep it i'll delete this there's 22 okay i i guess it's more buying choices let's say that i'll delete it in this column i'll delete it this and i'll delete the other ones all right okay so i have this information and let's see i could rename the columns here so i'll just call this the let me click here i'll call it the url click on the pen icon i'll call this author name or i just call it author authors call this ratings rating the rating reading count i'll call this book form and reprice so let's call this the price sale and let's call this price original because it's the full price and we'll just call this shipping shipping fee shipping cost and so now you could click on the create workflow all right and so let's perform some processing of the information here so let's say that i don't want the shipping text here so i'm gonna delete it so i'll click on the three dots here and then i'll click on clean data and then i want to add the step so i'll click on it half step i'll click on replace and then i'll enter the text space bar shipping and i'm going to replace it with nothing and i'll click on evaluate so you're gonna see that the shipping text here will be deleted i'll click on confirm but before that actually i would like to take away the dollar sign so i need to add another step i'll replace dollar sign i'll leave with to be blank click on evaluate and now you're going to see that we only have the number click on confirm and now we click on apply and so you're going to notice here now the shipping cost is only numbers now so we're going to do the same for the price sale and the price original clean half step replace dollar sign replace with nothing evaluate you can see that we deleted the dollar sign click on confirm click on apply do the same for price sale clean data ad step replace dollar sign replace with nothing evaluate confirm apply so all of this without any single line of code just point and click so for rating i want to delete out of five stars because they're all redundant right i'll click on the three dotted lines click on clean data add step replace let's copy this paste it here notice that before out of i have a spacebar replacement nothing evaluate and you see that the text gone click on confirm click on apply there you go you got the rating only okay so we're good to go so i'll click on save okay so it says that task name already exists okay let me cancel let me add the name here okay so i'll edit the name here i just double click on it right here because i have previously run it already once but then for that prior run i did not edit the parameters like what i've done now so i'll call it pre-process pre-processed click on save again and now you can see that the task is now saved i'll click on run and i'll run it on the cloud okay go back to dashboard click on refresh and you can see that the two prior tasks that we had running is now completed so data science videos 211 lines were extracted and for the tweets we have 136 and so the data from amazon we're currently at number 16. if i click on refresh we'll see the number updated okay so it's taking some time let's go to the video information from youtube just a moment ago so click here see the data so this is the data that we have extracted i'll click on export data right here and now you can select whether you want to have it as excel file as a csv or as an html so i'll click on export all data and then i'll click on csv and then i'll click on ok you could also click on export to database or you could link it with zapier but for this tutorial we're just going to click on csv click on ok and it will ask for the location save it to desktop so i'll call it youtube data science video dot csv okay so actually i did not have to type in csv because it has already added automatically so you can just type in the name of the file alright cool and let's open up the file all right and so this is the data let me open up in excel folder right click open with microsoft excel all right so this is the data so they're in their respective columns all right and so let's close this task close this refresh it okay now the amazon data has about 112 lines all right and so let's take a look at the tweet data click on it here so this is the tweet data for the keyword of data science and let's export the data and now let's try just exporting it out as an excel file click on ok save it to desktop i'll call it twitter data science actually a top tweet call it top tweets data science save so this time i just typed in the name of the file and it automatically appends xlsx as the file type and then let's open it up in the folder and so it is an excel file let's double click on it and so the format here looks a little bit better than the prior one where we converted it from a csv file because the height of the cell is not too big like the prior run so you could expand the column width by just double clicking on the border of the columns and then the column will expand so you could analyze this data using excel or you could also import it into a python jupyter notebook and perform your data analysis of this tweet data for example i could click here click on filter and i could click on based on the authors untick all and i'll click on data science dojo let's have a look so this will only display information from data science dojo let's have a look at the tweet content tweet content is here so this comes in handy when you want to bulk analyze information from twitter or also from youtube as well let's have a look at the youtube data click on filter see channel name let's have a look at all of the channel names here so all of the videos are from various channels having videos on the topic of data science okay so here's some of the professor's video let's have a look at that click on it and we see two videos here the art of learning data science and how to build your first data science web app streamlet number one okay so this streamlet video is the best performing video on my channel and this video was released 9 months ago and it has almost 50 000 views and so you could perform other analysis as well you could you could sort by ascending or descending order and then you could analyze the views based on various keywords if you like so if you're a youtuber you could analyze such data close this have a look at the amazon data now refresh so we have about 256 lines of data and so in a prior run it has about 1250 and there were 122 duplicates and so octoparts could also remove duplicates for you as well and so that comes in very handy and so let's now take a look at the amazon data all right and so let me extract the data just up to what we have right now which is 282 and we're just gonna export it to excel on okay i'll call it amazon book data science save it to desktop folder click on it okay this is the shipping okay so there's free shipping no shipping cost and so actually we could just delete it this column as well let's have a look at the data here so so these are the books title and then we have the url of the images of the books and then we have the author names we have the ratings of the book we have the rating counts number of times that the books were rated the format of the book whether it's in paperback kindle hardcover sale price the original price of the book so this comes in handy when you want to bulk analyze all of the books on data science from amazon instead of having to scroll through the traditional web page and so you could practically in a few clicks of the mouse as you have seen you could extract information in bulk using octopars and as i've mentioned already earlier in the video octopars is free for you to try for 14 days and so i'll provide you the link in the video description and currently they're having a black friday sale as mentioned already so if you're interested check it out from the link from the video description and let me know in the comments down below what kind of information are you web scraping using octopars thank you for watching until the end of this video if you're finding value in the video please support the channel by smashing the like button subscribing if you haven't already and turning on notifications so that you will be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey
Info
Channel: Data Professor
Views: 39,293
Rating: undefined out of 5
Keywords: web scraping, webscraping, web scrape, webscrape, no-code web scraping, no-code webscraping, no code web scraping, no code webscraping, octoparse, octo parse, data extract, data extraction, scrape web data, scrape website, website scrape, website scraping
Id: F6CEXNb54TI
Channel Id: undefined
Length: 19min 55sec (1195 seconds)
Published: Thu Nov 18 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.