Web Scraping with Python and BeautifulSoup is THIS easy!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you're looking for the easiest way to scrape data with python and beautiful soup then you've clicked on the Right video after watching this video you will know exactly how to scrape multiple pages with python and beautiful soup and make sure to watch this video till the end because at the end of this video I will show you how you can use a proxy server to avoid getting your IP address exposed let's start right away okay let's start with making sure that we have the necessary libraries installed so I have vs code in front of me I'm going to go to terminal new terminal and here I'm going to type pip install and then I'm going to install requests beautiful soup 4 and pandas and then hit enter once you've installed these three libraries we can start with creating the actual script okay let's start with importing some libraries I'm going to start with importing requests which is a library that's actually going to uh fetch the data from the web page and then from bs4 I'm going to import beautiful soup which is a library that's going to extract the data uh that we actually need then I'm going to import pandas as PD and we're going to use pandas to store the data in either CSV or Excel format okay let's navigate to the website that we're going to scrape and that's uh this website books.to scrape.on of books and with the next button you can also scroll through the different uh Pages um so I'm going to ask you to uh scroll to one of the next pages and then change three to one press enter and then this is the URL that we are actually going to copy URL is equal to double quotation marks and then paste the URL here and then in order to retrieve the actual page I'm going to say page is equal to requests.get URL and then let's print page dot text and let's see what we get and you see that we get back the entire HTML of the web page it's possible that you only see a part of it and that's because the HTML is too long and then as I mentioned earlier we're going to use soup only to extract the parts that we actually need so soup is equal to beautiful soup page. text and then you're going to mention here the parser and for now that's html. parser you can use any parser just check the beautiful subp documentation to see which parsers are available okay let's try to get the title print sup. tile. text that's ronos script I see that the title is all products books descripe sandbo and that's equal to the actual title of this page that you can find here so this website includes thousand products um every page list 20 products so there are in total 50 pages so if I change to page 50 you'll see that there are still books here and if I change 50 to 51 then you won't see any books and you get a 404 so what I'm going to do let's go to the source of this page and then you will see here in the title 40 for not found so I'm going to copy this one because it's possible that this website has more books and then of course you don't want to just stop at page 50 you want to stop scraping at the first empty page to make it more Dynamic so let's say if sup. tile. text is equal to 404 and here I'm going to mention proceed is equal to true if we get a four4 proceed is going to be false and then you get here the else and I'm going to put this in a while while proceed and you could do something like this while proceed is true but if you remove it this means the same okay and then we're going to put uh URL page and soup then let's cut this part let's put it here and then let's also add current underscore PD which is equal to one I'm going to remove the one here and add current bridge and I'm going to convert it to a string with Str Str so current page is one um we're referring to current page we're getting the entire page and then we are checking whether the title is not 404 if the title is not equal to 404 then we're going to proceed and at the bottom of the while so here I'm going to say current Page Plus is one so I'm going to add one to current page getting an error because the else is still still empty and in the else this is where we're going to do the actual data extraction okay let's take one more look at the page let's close this one and let's click here click on inspect okay so you see that all the products are um part of a list but there are also other lists on this uh page so if I'm going to search on the list item they they will see that the navigation is also built with a list so um we have to tell the script a bit more than just extract list items because otherwise it's also going to extract the navigation menu and then the other thing which is quite um funny here I'm just going to go back to page one um so you see a light IND the and then three dots so inspect this item then you will see that um here you cannot see the entire title because it's cut off that's why you see the three dots but if you click here on the image then you will see that the entire title all lighted in the ethic is part of the alternative text of the image um element so we can fetch uh the title of this book even if it's longer we just have to use a slightly different method so let's navigate back to VSS code and what we're going to do we're going to Loop through all the books on this page and then in every single book we're going to fetch the attributes that we want so first let's start with getting all the books so I'm going to create a new variable all books is equal to sup. findor all what we want to find is list items let's now get back to the page but the list items need to have these classes and that's how you distinguish books from for example navigation at a comma here class underscore is equal to and then you're going to paste here all the classes okay so this is our method to extract all the books and then we're going to Loop through the books for book in allore books and for every book we're going to create a dictionary which is um item and I'm going to fill this dictionary with all the attributes so let's start start with item title I'm going to use capitals here book do find going to search the image but I want the attribute and the attribute that I want is the alt tag the alternative text then I also want the link and the link that's nothing more than um a link to this entire uh web page where you can find more information on this book because if you if you click here you will find more information here so just this uh link that we also want to scrape item link is equal to book. find and this time you want to find the a tag the link and the attribute that you want is the h then let's navigate back to the page I also want the price so right click on inspect and you will see that this is a p um and the class is price color so let's scrape that one as well item price is equal to book find b class uncore is equal to and then this is the class what's important here and I'm going to show this um the first thing I'm going to do is just put proceed to false because otherwise every time this scant is going to scrape 50 pages so Pro proceed is false I just want to run it once and now let's print the prize on the screen print item price like this let's run it and I will see that um it gets the prices and then you will see that it gets the prices but there is one thing that you still have to do that's add do text so let's do that here press color do text and run the script again you'll see that this tag uh will disappear so with Texs you only get um uh the content of the HTML element but you also see that um before the actual price starts there also two other signs two other characters so what I'm going to do is I'm going to use string slicing to remove them so I only want all the characters from chor 2 I want to skip character Zero and character one let's run it again and now you see that we get clean prices then something that I also want is the uh stock in stock so click here on inspect uh let's copy this one is also B stock is equal to book. findind P class underscore is equal to this text and then let's also see what we get for the stock okay and you see that we get in stock but there is a lot of space around it and in order to avoid that I'm going to add the strip tag so text do strip let's run it again and now you will see that the white spaces have been removed uh by the strip method then I'm going to add a list here um with the name data data is equal to a list and at at the end of every um book iteration I'm going to say data do append item so I'm going to add the item uh to the actual list and now you will see that as soon as you run this script it takes quite a while to scrape 50 pages so I'm also going to add the page that I'm currently scraping currently scraping page current page and add Str Str again to convert it to a string okay let's remove proceed is false and let's run a script and then it's time to save your data either to an Excel file or a CSV file so I'm going to create a data frame here DF is equal to PD do data Frame data and then you can either uh save it to an Excel file with DF do2 XEL books. xlsx or you can save it to a CSV file with a DF to CSV books. CSV I'm showing you both options just remove the one that you don't want to use and let's run a script then navigate to the folder radi working um reveal and file explorer and you will see that both an Excel file and a CSV file have been created let's open them so you will see there is a title there is a link there is a price and there is the actual uh stock now you see that in the link column column C we have relative URLs and not absolute ones so if you want entire URLs then we can just copy this part and we can then put it here before the link like this run your script again and you will see that now instead of a relative pad you have an absolute URL in your Excel file so at the start of this video I promised to tell a bit about using Pro in web scraping so what you see here is another script um as you can see I'm scraping the website id.me um and that website is showing my IP address and at the moment I'm not using any proxy yet you see here that I'm calling to request.get with the URL and also with proxies is proxies but you see here that um this line of code is disabled now if you run the script you will immediately see my current IP address this is my IP address so what I'm doing is I'm using a proxy by using a proxy all my traffic is redirected via another server um and that's why the Target website cannot find my current IP address and when I un comment line n you will see that my IP address will change and in this case the Target website is only able to see the IP address of the proxy server and my IP address is hidden so you'll see that my IP address has now been changed it's possible to get access to some free proxy servers on the internet but if you want to get serious about repb scraping and if you want to avoid a lot of H I definitely recommend you to go with a paid proxy server and the product that I've bought is residential proxies and if you navigate to my account you can see that you can buy this product for 10 bucks you pay as you go so there is no subscription and for 10 bucks you can scrape a lot of websites without exposing your IP address only thing you have to do is buy this package so after buying residential proxies navigate to the product then go to users and then you can create a user here just click here on new user provide a username provide a password then click on create user and those are the username and passwords that you can then provide here in the script and from now on you can scrap the internet without the risk of your IP address being exposed if this video was helpful for you please give it a thumbs up and don't forget to subscribe to my channel I will see you in my next video
Info
Channel: Thomas Janssen | Tom's Tech Academy
Views: 21,516
Rating: undefined out of 5
Keywords: web scraping, python scraping, web scraping with python, web scraping with beautifulsoup, python beautifulsoup, data scraping, artificial intelligence, machine learning, ai, artificial intelligence openai, artificial intelligence openai chat gpt 3, chat gpt 3, chat gpt 4, machine learning tutorial, machine learning python, machine learning tutorial in python, machine learning tutorial for beginners, python, uipath, rpa, coding
Id: nBzrMw8hkmY
Channel Id: undefined
Length: 15min 51sec (951 seconds)
Published: Wed Dec 27 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.