Web Scraping for Beginners with Python and Selenium 4

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this video is sponsored by brilliant in this video we'll learn how to web scrape with python and selenium 4 we scrape audiobooks from the audible website and learn selenium from scratch we'll also see some of the changes between selenium 3 and selenium 4 so let's get started all right the first thing we're going to do is open up a terminal so here I'm going to open up the terminal inside py charm I click here and now we're going to install selenium so we type peep install selenium we press enter and with this we're going to install selenium in this case I already have selenium installed on my computer and if I'm not wrong I have version 4.12 so once you have selenium the next thing we have to do is install Chrome driver so I'm going to close the terminal and then go to Chrome and here I'm going to this website chromed driver. chromium.org downloads which I'm going going to leave on the description and here what we have to do is download a chrome driver compatible with our version of Google Chrome and to find our version of Google Chrome what we're going to do is go to this three dots on the right corner we click on it and then we go to help and then we click on about Google Chrome and here we're going to see the version that we have in this case 121 then we go back to Chrome driver and we see different versions 100 14 and 113 and then there's this message that says that if we're using Chrome version 115 or newer we have to check here so we click here and then we download Chrome driver once we downloaded Chrome driver what we have to do is copy that path of this chrome driver file so in my case here's my path and this path is going to be different in your computer and well that's fine so now let's start with the tutorial and the first thing we're going to to do is import web driver from selenium so we type from selenium which we already installed import web driver and then in the case of selenium 4 we have to import service too so what we're going to type is from cum. web driver. chrome. service import service and this is an additional stat we have to do in Solium 4 then we have to create a variable which we're going to name web and here is going to be the link of the website that we want to script and in this case we're going to script this website which is audible.com and this website has um information about AUD books and now what we're going to do is just copy this audible.com search and paste it here all right so far so good now we have to create a variable named service and here we're going to use service which we imported before and use the executable undor path equal to and here we type path so executable path equal to path this is basically the variable that we created before and now we create a driver variable which we're going to use to interact with the website to extract data from the website and here we type web driver and then we're going to type do Chrome and in this case we're going to type service equal to service so that's pretty much it that's how we start working with selenium and now just to do a simple test I'm going to open a window and to do this we're going to type driver which as I said before we're going to use a lot driver. get and here the variable web and what this means is open the Chrome driver and open the link of this website audible.com search so here I'm going to execute this and we're going to see what's going to happen and as we can see a new window is open and we can see here that we have the message that says that Chrome is being controlled by automated test software so this indicates that we're using Chrome driver so now I'm going to quit this and and that's how you go to a website with selenium now let's go to the website that we want to scr and let's examine the data that we want to stct so here we are on audible.com and what we're going to stract is the information of this Audi books as we can see we have 20 items in this page and if we want to stract data from this page or from any page what we have to do is right click and then click on inspect and here we're going to get developer tools and now we have to locate the element that we want to extract in this case what we want to extract is the name of the book the name of the author and the length of the audiobook so to do this we're going to locate the item of the audiobook and to locate this item what we have to do is click on this button that we can see on the left so we click here and then we can select the element and as we hover on the element we'll see that in Chrome developer to on the bottom is showing us the HTML element that represents this element that we're hovering on so in this case I'm selecting or hovering on this first item mother Faker this is the name of the a book and as you can see we get if I click here we get the HTML element and this is the HTML element that represents this item this is a div and if we go one element above we can see that we get this other element which is the same basically the same element but this one has attack uh lii and this Li tag stands for list item so if I minimize this we can see that there are multiple lii elements so here is the first the second the third if we scroll down we see the third the fourth the fifth and so on so what we can do is create the expath of this Li item to extract all the items in this page so all the AUD books and we're going to create the xath right now and now to create an xath and also test if the xath is working we have to press crl F and we're going to get this this box where we can type anything and especially an xath and we'll see if this expath is working so usually to create an xath we need a tag name in this case lii we need also um an attribute name in this case the attribute name is class and we need the attribute value in this case VC hyphen list hyphen item product list item this in Orange so those are the three main elements so let's start creating this xath so here we start by typing double slash then lii well we lost the element so again I'm going to press this bottom on the left and then select this again and we're here again so we have I and then we have to open here uh square brackets then we have to type at then the name of the attribute in this case class equal to we open quotes and then we have to type the name of the item so here again I'm going to select this item and then what we have to do is just copy this item in Orange so here I'm going to double click contrl C and then contrl V to paste this item and this is the expat and this usually works when we have one attribute value but in this case we have multiple attribute values and as you can see this is the first attribute value and this is the second so sometimes it gets tricky with the expat when we have multiple attribute values and to deal with this we can use a function called contains and what we're going to do is locate an element if this element contains this attribute value product list item so we're going going to delete all of this and write again everything inside square brackets so here we have to use the contains F function as I said before I type contains then parentheses and then add in this case add with the name of the attribute in this case add class then we type Kaa and then we type the name of the attribute value we're interested in in this case we're only interested in the in the in the second value just let me locate this item again and the second value of this attribute is product list item so here I type I paste product list item and as you can see now this expath is locating 20 elements here you can see one of 20 so one of 20 elements were located and each element is the item that contains all the information of the Audi books we want to St so we get that 20 Audi books and now that we have this xath we go to py CH and then we paste this expat so this is the xad that we created and that locates each item so now let's tell selenium to locate this xath and to do this with selenium we have to type driver again and then find elements and in this case there are two option find element and find elements and we're going going to use find elements in plural because we want to locate multiple elements and then inside we have to type the method that we want to use and this is um this is for selenium 4 in selenium 3 we have a different way to locate in selenium 3 something like find underscore elements underscore byor X but in selenium 4 what we do is something similar we just type by equal to and inside we type the method in this case X paath and that's pretty much it then we type the uh value and inside value we have to type the xath that we created so we cut this and we paste it inside and that's pretty much how we locate an export using cenum 4 now let's assign this a variable so I'm going to type equal to and then I'm going to create a variable that I'm going to name products so this product variable represents each audiobook item in the in the page and with this we're locating all the items in this page all right we'll continue extracting this data in some seconds but before I like to take a moment to talk about today's sponsor brilliant.org today we're learning how to stct data from websites once you stct this data there are many things you can do with it and with brilliant you can learn some of them brilliant is the best way to learn dat analysis interactively it has thousands of lessons from math to data analysis with some new lessons added monthly I recommend you to take brilliant's data analysis path to build a solid foundation in data analysis with visualizations and data transformation this part is that brilliant has interactive exercises that will help you develop your analytical thinking which is better than just memorizing formulas or equations I like the brilliant app because you can play with the variables using your phone and see how the output and graphs change in real time to try everything brilant has to offer free for a full 30 days visit brilliant.org slthe coach the first 200 of you will get 20% off brilliant annual premium subscription thank you brilliant for sponsoring this video and now let's go back to the video all right we have the 20 items so we have each of these uh items in this page and now what we have to do is to get the three elements that we said that we want to struct from these items in this case the name of the AUD book the name of the author and the length and to do this we have to click again on this left button and then go first and click the name of the AUD in this case mother Faker we click here and we get this H3 element and this H3 element contains that name of the AUD book if we open this one we're going to see that here is the name of the Audi book and well this is inside this H3 element and now we can see also that this H3 element contains multiple attribute values BC heighten heading then this BC color link and then this BC pop break word and one more so we have to use again that contains function and we're going to do this right now so here I'm going to use the xat that we had before as a template for this new xat so here instead of Li which is the name of the tag we're going to type H3 so here I'm going to delete Li and type H3 then we keep the contains function then the name of the the attribute is class it's the same here class and here we have to type one of the attribute values and in this case the most representative in my opinion is this VC H heading I think this represents well an Audi book title because it's the heading and well the others wec color link wec pop break word may be in other element so that's why I'm not using them I'm going to use just the first one BC hyphen headen so here I paste BC hyen heading and as you can see now we have four 24 elements so we have this elements we have four above and then we start with the first mother Faker and then Defiance of the full 12 and so on and as you can see we have 24 elements so there are four elements extra that we don't want they're not audiobooks there are just some hidden elements in the page don't worry about this because we already took care of that when we created this expat so with this export that we created before we located first the items so each each uh item each Audi book item and with this new expat we're just going to stract data inside this item so we're going to ignore the those four extra elements that were located so here we have the title of the Audi books and I'm going to copy this xot and I'm going to paste it here below so we have the first xot and now let's create the same expad for that name of the author and also the length of the Audi book so here I'm going to locate very quickly the name of the author and then we get this pan element and one element above is this Eli element with the author label so let's quickly create this exper so here instead of H3 we type Li then we keep contains class and here we copy author label and in this case we have only 20 elements and this successfully located the name of the author so here we copy we paste it here and finally we're going to struct also the length of the audiobook so here we click on length we get this span element and one element Above This there is this AI element with a label or with the attribute value run time label and represents the it represents well the length of the AUD book so I'm going to use this one so I'm going to copy runtime label and paste it here and with this we successfully located the length of the AUD book so here I'm going to copy this and paste it here all right now we have these three expats we almost finished scraping this website we only have now to type the python code to locate these elements so here I'm going to type products and then that and use the find element so here product find element and with this we're going to locate the new elements inside this expat that we located before so here we have the items and we're going just to focus our instuction inside these items anything that is outside these items are not going to be located all right now we have to type again buy and then xath and then again value and here we type the elements that we want to locate we start with the name of the AUD book so we copy and we paste this inside here and something else that we have to add to this expad is the that and this that indicates that we want to use this products variable as a reference for this new search so again we're going to search inside the product element so inside this exper so basically this that represents all this element now I'm going to type that text toct only the text inside this element and now I'm going to duplicate this to put the other expath inside uh this value so here instead of this I'm going to paste that xot that has the name of the author and then instead of this I'm going to paste the xot that has the length of the audiobook all right now we have to do an extra step because this product variable is a list and why this is a list because we use the find elements function here and find elements always returns a list because WECT multiple elements that are stored in a list so we have to do is Loop through this list so we type for product and then in product so for product in product and we put these three lines inside this for Loop and instead of using the products variable we're going to use this product variable in singular so with this we have product that fine element and the rest and now to make a test I'm going to print that first element which is the name of the Audi book and then to finish this I'm going to type type driver. quit so we type driver. quit we run this and we should get the name of the AUD books listed in this website so it open the website audible.com / search and now it's going to print the first element and here we have the name of the audio printed in our terminal and with this we successfully scrip this website now I want to quickly show you how to store this data that we instructed into a CSV file and to do this we have to open up the terminal and install uh Library called pandas so we clear this and we type pep install pandas after we install this we have to import pandas here so we type import pandas as PD and then we can use pandas to store all this data into a CSV file what we do first is create some empty list where the data is going to be stored in this case book title and then and book author and book length and then I'm going to to append these elements here inside this empty list so first book title book title. append and then parentheses and the same with the rest of the empty lists and finally I'm going to create a data frame with this list that have the data of the name of the Audi book the author and the length I'm going to name this data frame as DF books and finally I'm going to export this as a CS file and the csb file is going to be named books. csb so now I'm going to run this and we should get a new file named book. CSV in our directory all right the process was finished with success then we check our directory and we have here books. csb we double click on it and we see that we have a CSV file with three columns title author and length and here we see the data of each column and with this we can see that we extracted the 20 elements and we successfully created our web scraper with selenium and that's how we struct data using selenium in the next videos we're going to use this data that we instructed to start a data analysis project in Python and that's it I'll see you on the next video
Info
Channel: The PyCoach
Views: 6,082
Rating: undefined out of 5
Keywords:
Id: lM23Y1XFd2Q
Channel Id: undefined
Length: 21min 21sec (1281 seconds)
Published: Fri Feb 23 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.