Web Scraping with ChatGPT Code Interpreter is Mind-Blowing!

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video I want to show you how to easily do web scraping using the child gbt code interpreter we're going to be able to scrape any website with a method that I'm going to show you and we're going to do this in a few minutes so let's get started alright unlike other methods that I showed you in the past to do web scraping with chat gbt this one is very straightforward we're not going to use any Plugin or any other method that I showed you before but we're gonna go to the website that we want to scrape for example I'm going to start with Amazon and here I search for TVs and the first thing we're going to do is press Ctrl s and we're going to save all of this as an HTML file so I press Ctrl s or command s if you're on a Mac and then you're gonna get this so we're gonna save this file as an HTML so this one is going to be amazon.com TVs and I'm going to save it so I'm going to press on Save and now I have this file on my computer then we're gonna upload this file to chat GPD code interpreter so we're going go to gpt4 we select code interpreter and then we upload this file so I press on this button and then I select this HTML file that we just saved so here if I see the preview we're going to see the website but now in this simple HTML format and we have to upload this file to tell chat GPT to extract some elements from this file and we're going to use the following prompt from the HTML file extract the name of the product and price and put the data on a table and Export it to a CSV file so basically we're going to extract only the name of the product for example Amazon Fire TV and then the price 349 and then we want to put this on a table and Export it into a CSV file so with this sometimes it's enough but in this case I'm going to add more details and I'm gonna tell or I'm gonna help chat DBT by giving the element where this name of the product and the price is in the HTML file so here I'm gonna right click and press on inspect and we're gonna get this developer tools so here what we're going to provide is the name of the element where this name of the product is located so in this case is this element as you can see if I select this this name or this element this is highlighted in blue so this is the element so I'm gonna copy and I'm gonna paste it here so I'm gonna press here and we have the element of the name of the product now we have to continue with the price so here I select the price and now we have the price which is this one so I'm going to press Ctrl C and then I'm gonna paste it below I'm gonna tell now chargpt that here is the element of one of the products so it's gonna help get the right element so here I type here is the element of one product and then I'm going to take a chat GPT that this one is the element of the price and finally to finish with this prompt which is pretty long I'm gonna deal with the missing data because as you can see here there are some TVs that don't have the price and in case we don't tell chargpt what to do it's gonna duplicate the price of other product for example here we have Samsung electronics and this one doesn't have a price so probably it's gonna duplicate the 157 from this product or the 1696 from this other TV so how to deal with this missing data so here I'm gonna tell this in case the price of the product is missing just leave that price as a new data and with this we're done so I'm gonna send this message and we're gonna see how chargibility and the code interpreter is going to do all of this so as you can see Charlie Bit is extracting the name of the products for example here first we have the Amazon Fire uh 43 inches and probably this is the first one yeah and it's 349 and as we can see is the correct price and the correct product now it's doing some more stuff and yeah now it's creating the table that we want it and we have two columns the name of the product and the price and then it's providing this CSV file with all the data script I'm gonna download this CSV file and I'm gonna open it up so here I have the products.csv that calgpt generated and now you can see all the data scraped so we can see the Amazon Fire TV 43 inches Toshiba Insignia and more Brands and here we can see the price and if we go here we're going to see that all the data was correctly extracted and if that's not the case you can just do some prompting and tell chargpt what was the mistake so you get the right data and the data is not corrupted or it doesn't have any issue and now what I'm going to do is extract the same data about from the second page so I'm going all the way down and I'm gonna click on the second page to show you how you can do this in all the pages I'm gonna do an example with a second page so here I'm going to close this one and I'm gonna repeat the same process so here I press Ctrl s to save this sorry so here now I press Ctrl s this is the second page and now I'm gonna write amazon.com uh underscore TVs too so here I download this and now it's done so I have this file then I go to chat DBT and I'm gonna upload this second page to extract all this data so here I press uh on upload then I upload this second page and now I'm going to the first prompt that I I typed and here I'm gonna copy this and I'm gonna paste it so I'm gonna tell chargpt that this is the second part of the website so I type this is the second page of the previous website use the HTML file to struct the data following the same steps I described before so this are gonna be the same steps but this is just the second page so I'm going to press here and we're going to extract the data from this second page now as you can see it successfully extracted the name of the products in the price from the second page and now it's concatenating the two pages into one data frame to export it into a single CSV file so now I want to click on download products combined and now we're going to have this file now I'm gonna open this file and we're gonna see the preview and as you can see we have more rows so let's see here the second page starts with Visio 40 inch and if we can see here we should be able to find this product so it's here this year for the inch D series full HD and the price is 168 if I'm not wrong and here we can see the same name of the product and the same price so we successfully scraped not only the first page but also the second page and you can continue with this uh with this process with the third the fourth and the fifth page and as many pages as you want and this is how you scrape data from Amazon using this approach with a code interpreter now I'm going to show you another example and in the second example I'm going to show you a slightly different approach to do web scraping all right now we're on the Glassdoor website and here we're gonna extract the data that you see on the left so here I typed data scientists to find jobs for that scientist and what we're gonna do is something similar but we're gonna do uh we're gonna use a different approach so here I'm going to chat DBT again and I'm going to open a new chat and again I'm going here to Glassdoor and I'm going to save this as HTML so I press Ctrl s then I have this Glassdoor job search and I'm gonna type that HTML so here underscore DS for the designs and then I'm going to save this and now we have this file so once we have this file again we go to dvd4 code interpreter and we upload this file so Glassdoor jobsearch.html I open this and we're gonna use the following prompt which is very similar to the previous one but we have some some other things that I'm gonna add it so here from the HTML file look for the elements with the ID below and extract our data so I'm going to use the ID as the identifiers for this the elements that I want to extract and the elements that I want to struct is the name of the of the company then I also want to track the job title in this case data scientist then the location and finally the job salary and I can use the same approach I used before which is right click inspect and copy the element I want to extract but sometimes it might not work and in case it doesn't work you can use a different approach analyzing just the element that you're using in this case I have this element and here as you can see there is an ID and this ID has a very clear name job title and this job title represents the name of this job which is data scientist so here if I copy all this the name of the ID I can just list all the IDS I want to extract so I started with the job title then I can continue with the company the name of the company so here I can go to job employer and then copy and paste it again so here instead of job title job employer and then I'm going to continue with location and the salary so here select again the location then I'm going to use the ID which should be here so ID job location just copy and paste it we have one more which is the salary and again just select the salary and then the ID actually you can use another element not necessarily the ID I'm just using the ID because it's Unique but you can use the data test attribute or the class it might work it might also not work but it depends on the website that you're scraping so here finally I have job salary and as you can see these IDs have this numbers which are not necessary so I can delete these numbers again you can simply copy the whole element and paste it as we did before but in this case we're only using these words and now I'm going to tell Char DBT that to put the data on a table and Export it into a CSV file in case there is missing data because there are some companies that don't have the salary here that for example this one doesn't have the salary so in case there is missing data just leave it as new data so now I press enter now Chad says that there is no element with the specified IDs and that's true because I deleted the rest of the ID and I only left these words and what I can tell chargeivity is to use relics to match that part of the ID so I'm gonna type the following so those are parts of the ID use rackets to match that part of the name of the ID and with this charge if it is not going to match exactly the whole name of the ID but it's going to verify that if the ID contains these words job employer job title location and salary is going to be enough to extract this data so here I'm gonna tell this and hopefully we're going to be able to extract the data that we wanted so as you can see destroyed all the data and even told me that there is some mismatch in the number of elements found for each category so hopefully it left the data point as new and it didn't duplicate any any job title location or salary so we're going to verify this I'm gonna download the CSV file so now I open a preview and as we can see here we have this data so I'm going to Glassdoor to verify if it's the same data so we have here for example UCLA health data scientists LA and 26.20 the hour and yeah it's exactly this one then for example we have a Snapchat 205 000 per year and yeah it's the same so we successfully extracted all the data and in case it duplicated some rows by mistake you can tell chargeability that it duplicated the rows and that it should leave the data or the missing data as null or as Nan all right let me know in the comment section if this approach to scrape websites using the code interpreter is easy and whether you were able to scrape the website that you wanted using the code interpreter all right that's it for this video I'll see you in the next one

Info

Channel: The PyCoach

Views: 216,290

Rating: undefined out of 5

Keywords:

Id: B89Cf4pLNds

Channel Id: undefined

Length: 12min 45sec (765 seconds)

Published: Fri Jul 21 2023