Web Scraping to CSV | Multiple Pages Scraping with BeautifulSoup

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to this video on web scraping so in this video we are going to scrape 50 pages of a website and grab informations regarding the name the title of a book the price and the number of stars it has and Export this information to a CSV file just as in a few lines of code we are going to scrape this website and you can see it has 50 pages so we will scrape all these pages and get the information we want and that is what web scraping is all about that is you grab the information you need from a website because manually it will be difficult to store and copy and paste all this information so we are going to write a program that does that for us but how does it do that well you we need to go to this website called books.twoscrate.com and here this is actually a playground for web scraping you can practice it it's just like a normal web shop like Amazon or other e-commerce stores so for all of these information is stored somewhere and in an organized way so how do I know which information I can have access to here in this place I just right click and press inspect and what it does it grabs me the whole HTML structure of the page so for example this at a hover over the article you can see this is highlighted this part you see and then for the next one you see this is highlighted and let's say let me open them some of them if I go over over this one you see this one is highlighted and this one's highlighted so all of these cards these are cards which are included in a list item inside an article tag with a class of product underscore pod so if I want to get access to the title of this to the stars and to the prices I need to Target or send a request to get this page and then Target the tag which is responsible for displaying this information and in our case it is the article tag inside the unordered list you can see when I hover over it all of it gets highlighted it means everything we need is inside this ordered list ol and then you can see article if I um expand article you see we have a image container and this gets highlighted when I hover over it there is a star rating as you can see there is a H3 tag which has the title in it well parts of the title because it's a bit long and then we have a product price which this part is all about and you can see here under a paragraph tag with the class of price color I can find it okay now the issue is if I go under title I can't see the full title here so where can I get the full title of this book well I have realized that under this image container inside this a tag there's an image tag and it has an ALT attribute which is more like a description of the image and you can see this alt attribute has the full title so we are going to get the value of this alt title and put it inside the titles of our books the same is true for the next one the same exact process so if I go to the second list item an article you can see on under image container and under the a tag there's an image and this image has an ALT of Tipping the Velvet which is the name of this book you can see again the stars as well now what about the Stars how do I get the number of stars it's interesting that inside this article tag the article tag here so we had the image we had now the star rating you can't find any useful information here the useful one is this class name three so this three which is the name of a class of this paragraph tag refers to the number of stars it has so what we can do is to grab the name of the class of this paragraph tag and put it as the star rating so for example this book tipping development has one star if we go down below and under star you see p-class Star 181 so it has only one star that is how we are going to grab the star rating and you you also saw how we do it with the uh with the price tag okay so again the last time what we need to do is to get access to this web page then we need to go into this ol grab this whole list and we Loop through each item to find the article and then inside every article we are going to find the image alt the star rating and the price now let's get to coding so I'm opening uh I've opened this collab of Google so this is an online ID that is I don't need to install any libraries on my own machine I do it here the first thing we need to import is the requests Library and if I press shift enter it goes to the next cell and it runs it as well now I have access with Library so I can send requests to grab this web page and now let me see how so if I go down below let's say press next for instance and you can see we have page two I want to go back to page one and I will tell you why I've done that one okay so I am going to copy this URL and put it inside a URL variable and inside double quotes now I'm going to send a request to grab some response from this so I'm going to say response equals requests.get so I'm going to send request to get the URL so if I run this I don't get anything here now but if I print response if I print response here you would see I get a code of 200 it means this request response was successful so we have access now if we got something like 404 it means the page was not found so there's a list of all these codes response codes you can find online now I don't need this code it's not that useful so what I need is actually the contents of that response so that's why I'm going to save the responsive content into response now and run this now if I run response you would see a huge binary it's like a like a string kind of and you see all this the HTML code is in here but this is not HTML file actually you can see this B here so I need to turn this into HTML code and get access now this is where beautiful soup comes in so we are going to import from well from bs4 import beautiful soup let's see if it suggests it yeah and you can see uppercase B uppercase s so beautiful soup is going to be the library let's run this we are going to use to scrape the website the first thing you need to do is let's copy this down here the response that we got right we're going to make a soup variable you can call it whatever you want though and we are going to use a beautiful soup class that we just beautiful so that we just uh received and we are going to pass in two arguments the first one is the response that we just got and you see it's here this one but we need to specify this is an HTML page so we need an HTML dot parser to parse this HTML for us now I run this and if I run soup to see what soup is now soup is the HTML code you can see it here now so HTML code you see the list items and all that so if I go down further below under Section let me see so there this is the order list we are going to grab which contains all the lists and articles you see inside article we have this where is the image this image you see this image and the alt attribute this is the name of the book then if we go here for a star under P tag there is a class of three which means it's received three stars then for the price we have it here again A P tag with a class of price color and this is the price so we need only three these three pieces of information okay so this is soup now as you just saw so I said we need to grab the order list within that soup right which is uh or was that if you ordered list yeah here so we need to grab this order list because all the cards all the books are within the list inside this order list so we're going to find this order list first so we do uh we can just call it ol equals and then we say go inside the soup that is all the Asian structure that we just grabbed and find it's a fine method first instances are instance of or the list inside quotes double or eight or single quotes so it means it's going to go inside the soup that we just saw or the HTML and find the first order list that is fine so you can see this is the first anyways among all of them this is so it will grab this find this okay now that we have saved this inside ol variable we are going to Loop sorry find all the Articles now all the Articles because you can see there are all articles here and inside the Articles you have the price and everything else so we need to find all the Articles now so that is why I'm going to run this I would have another variable let's just call it articles equals now go inside the ol and find all underscore all it will find all the instances of this article HTML tag and if you think there are some other article HTML tags as well and you want to be more specific so what you can do is you can say Okay I want to find the article tags with this class only so we can specify that here a comma class underscore because if I just leave it at class it's a python class but this is basically a CSS class so class underscore equals and the name of the class so now we have all the Articles which are inside this ol here let's run this and now what shall we do so we have a list of all these articles I can print out the articles for you to see and you can see these are all the Articles like ends here starts somewhere like in the very beginning I can see the list is a list which starts here with article and the article ends somewhere down here somewhere and then another one starts okay so we are going to this time to Loop through to Loop and all the articles and grab what we just talked about the image Alt value the star class and the price so now let's say for article or I or whatever In Articles our list the list that we just have here we need to do something let's first grab the image where is the image the image is inside the article that individual article that we find and let's say find you should find the first well there's only one anyways the image tag get the image tag for us and save it inside image now we need to get access to the alt attribute right so I would say let's just call it title equals image that is the image that we just saw and attributes attrs stands for attributes of that image and what is the attribute we're looking for Alt and give us that title so now if I say print inside the for Loop if I say brand title let's see what happens you can see I have access to all the titles of the books on the first page right so that is how it happened we found the image first and then we said okay now that we have the image I want attributes alt attrs alt and save it inside title okay now we have access to title what about stars now so let's say star equals where it was a star let's find it again it was here A P tag with a class of this right and remember this is the the first P tag inside the article so it's easy to find now so what we do is I would say star equals article dot find P right that is find a P tag but there is an issue here so if I um print out this you will see it it prints out the whole thing here I want the name of the class it has I want to see which class name it has so I would say again star equals I'm going to redefine it again equals star which is the same star and this time get me the class because now it's more like a dictionary so if we say class it grabs the the name of the classes in a dictionary if they're more than one or whatever so now if I print star let's see what we get here now so you can see now we have a list actually of these uh classes so star rating three star rating one star rating one but I don't need the word star rating I just need the the rating itself which is three one one so so how do I access that here I say give me the index one which is this one this is index 0 to say index one now I would only get the numbers you can see three one one four great okay so now we have also store the price should be much easier I guess so price equals where can we find the price under this tag here so it's inside a paragraph tag with this class you just need to zoom in and narrow down as much as possible to be more specific okay so A P tag with this class name of price color is equal to the article find sap tag and the class name is equal to that's right but there is an issue so if I print price now you will see what happens it prints out all the HTML tag PE class and then prices here I just need the text inside this that's why at the end of this find I add dot text and now you see we have this so if you're happy with this then it's fine but sometimes you might need to have numbers so that you add some stuff or arrange them in order you know from high to low or low to high so but this is string format so what we need to do let's just do something with it so what I can do I can say here price is equal to or yeah let's just have it like this it's cleaner price from index one up onwards that is this is the price that's the index zero I want only from index one onwards I want this part only this part only so now I would get rid of the symbol the gray pound Great British Pound symbol okay but still this is a string so what I can do I can turn all this into a float the float is a number with decimals right so I can wrap it like this and now it looks the same obviously but now it's a number that you can add to or subtract something okay we have now our uh title stars and price so let's put them all um inside a variable maybe so we can say I don't know book well actually let's do something else let's define a list of books here so I would say books equals an empty list and I'm going to add all this information to this so I would say books dot append so we're going to add this books to this empty list the title star and price right so I would put them again inside another list so that will be title then maybe price first and Then star now I can print books down here and you will see it's a list of lists you can see this is the first book for example this is second third in August okay so so far so good but what we have been doing is only for one page that is the first page what about page 1 to 50 all of them well the trick is if you see here we have page dash one if I change it to for example I don't know 44 then it will show me this one right so I can go up to page 50 right like this so I just need to update this number so what I can do is like here I can say instead of this where was that this URL which says like number one only let me grab this I want to put everything in in one cell like here and also this one as well here then the content here what else this soup that we made here the order list we grabbed here and the Articles inside the order list we grab here okay so what we need to do is to say now you did this process for only page one then do the same process for page two or page three page four that is why we are going to update this put it inside a for loop I would say 4 i n and we're going to give it a range of 1 to 51. that is up to 51 but not including 51 that is 1 up to 50 and including 50 because we know it's 50 pages right so from 1 to page 50 do something do what so let's put everything a tab and we indent it under this now what is a i i is going to be 1 then runs all of it and then I will be 2 and runs all of it so instead of this page one now we can put I and at the beginning let's put an F string so now it is a variable so the first time I will be one page one and then we're going to send request to page one and do all that here then when it's done I will be 2 so it would be page two so it will go all the way up to page 50. but then we don't need to have this book here inside the for Loop and also this one we don't need this either I will tell you why now okay so this will basically be uh the indentation okay so this is basically the same it just goes over the process for 50 pages and does that that saves everything inside a list but what about PM CSV file how do I export it very good question okay so here we have all these oops Yeah so let me get rid of these now and so here let's import pandas as PD and we're going to use a pandas library to export this information to a CSV file if you're using your machine you have to install it pip install pandas or pip install beautiful soup or pip install request don't forget that okay now that we have access to pandas we are around this I guess yes they will check okay now I can say now that we have the books right which is full of all of these books so let me run this so that everything from all these uh 50 pages is saved inside books so you see it's running it and it's going to take some time because it's yeah 50 pages obviously so or I can stop it and just go for I don't know four pages or five pages so yeah I'm going to stop this it's just too much so what I will do is just say okay for five pages for example let's do it for five pages and it should be faster now yes it's done and everything is inside this list of books so for pandas pandas works like this pandas needs to have access to data frames for instance and data frames are basically like tables like CSV spreadsheet got tables so we need we need to create one data frame like DF you can call it whatever equals and we use the PD which is pandas as we imported dot data frame method so we are going to turn something into a pandas data frame and what is it is going to be the books that we have now I remember the books this is a list of lists right and it doesn't have any columns at the moment because it's just like list of lists so we can give it some column names so columns equals another list and the first one should be maybe title I think it was title the first the second was price yes price here so these are going to be a title price and the star rating so these are going to be the columns for these lists and I'm going to shift enter run this now we have access now that I have access to this I can turn it into a CSV file so easy I need to say df.2 oops two underscore CSV I can give it a name like books.csv for example and shift enter and where does it go well it goes here and you can see I have it I can download it if I want to or I can also display it right here let's see what we have and you can see we have this title price as star rating and you can see this here I can just click on this and since it's by numbers now it's uh like a float number so I can order them like based on the most expensive the least expensive you know low to high as well and you can see they're all here like you can show them I don't know 50 of them for instance here so that was what we did it was awesome and it was so easy uh less than yeah half an hour so I hope you liked it please it really matters and it's really important if you like this video and share it with friends or leave a comment there for it to be seen thank you very much for watching and listening
Info
Channel: Pythonology
Views: 107,845
Rating: undefined out of 5
Keywords: WEB-SCRAPING, web_scraping, webscraping, beautifulsoup, python web scraping, web scraping tutorial, web scraping, web scraping multiple pages
Id: MH3641s3Roc
Channel Id: undefined
Length: 29min 5sec (1745 seconds)
Published: Mon Nov 07 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.