Web Scraping in Python using Beautiful Soup | Writing a Python program to Scrape IMDB website

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys in this video i'm going to show you how we can perform web scrapping in python using the beautiful soup module now if you don't know what is web scrapping it's basically the process of extracting some data from a website programmatically in this video what i'm going to do is i'm going to write a python program which is going to access a particular information from the imdb website and then it's going to load this information into my excel file now you may be already aware of imdb website which contains the ratings for movies and one of the information that is available in this website is the top rated movies of all time and that is the information that i want my python program to access and then extract it and then load it into an excel file i'm going to write this program from scratch so you can actually follow along with me and as a prerequisite we just need to install two modules one is the request module and the other is the beautiful soup now we need request module in order to access the website and then we need the beautiful soup in order to parse the html of that website so and beautiful soup makes it very easy to parse the html and then provides us so many different methods in order to access different tags that are available in the html now you don't really need to be an expert in html but you just need to know the most basic that is in an html the data or information is all available within the tags so you should just know how a tag is represented and how you can find the attributes of a tag which is pretty straight forward i hope you will get that when we are going to uh be writing this program so even if you are not comfortable with html you will still be able to perform web scrapping and follow along with this video now before i start writing the program if you have not subscribed to the channel and if you are interested in sql python data science data analytics kind of contents then please make sure to subscribe to the channel and if you like this video and if you find it useful then please make sure to share it with your friends and colleagues and also like the video thank you and let's begin so the first thing that we need to do is we need to install the two modules that is the beautiful soup and request module to do that just go to your terminal and type pip 3 install and the module name so let's install request module first i have already installed it so it will tell me that the module is already installed but if you have not installed it would tell you it is installed successfully and i'm using mac so i'm using the command like pip3 install but if you're using windows the command would be pip install and the module name okay now to do to install a beautiful soup the module name is actually bs4 so you just need to type pip3 install bs4 okay so i have already installed bs4 so it tells me that already installed but if you have not it would tell you installed successfully so let me just close this and now let's start writing the program so let me open a new file and i'm going to give the name like scrape imdb dot py okay okay so first thing to ins we need to do is we need to import the modules so i'm just going to say from bs4 import beautiful soup okay and next is import request okay okay so we have imported the modules the next thing that we need to do is okay let's go and look at our website which we are trying to scrape so let me go into my browser and just type imdb okay so you would see the first website here and inside this you have this something like top rated movies so you can just click on that and this would show you the top rated movies of all time so here it's showing 250 movies so you can see that there are all the movies here so 250 movies should be here and you can see it here so the the python code that we are going to right now what we are going to do is we are going to access this website and then access these 250 uh movies so we are going to access its rank its movie name its year of release and the imdb rating and all of this information we are going to extract and then load it into an excel file okay now let me just copy this url and let's go back to my program here the first thing that we need to do is we need to use the request module to access this website okay how we can do that is we can just say source equal to request dot get and then just pass in the url now what this will do is it's going to access this website and it's going to return a response object okay this response object is going to be stored in my variable source this response object is going to have a few things and one of the things is it's going to have the html source code of this particular web page okay now if you want to see the html source code of this web page and you can just right click anywhere in the web page and then click inspect okay and then you would see some html text here and this is basically the source code okay this is the source code that we want to extract by using the request module okay i can proceed but there's one thing that i want to tell here is that whenever i'm using a request module i always try to put this inside the try and accept block the reason for that is so let me execute this okay so if i execute this you would see there is no problem because this is a valid url and there should not be any problem but let's say if i give an invalid url so i just type one two three four five six okay something like this and if i execute this basically this website does not exist okay so if i just copy this go to my browser and enter that it would throw that page not found error okay but this particular request dot get is not able to capture that error okay so in order to capture an error from your response object what you can do is uh using the response object that is source you have a method called as raise for status okay so this raise for status is going to throw an error in case this url is having some issues okay it's not able to access the website or something like that okay now if i execute this program let me save this and if i execute this program it's going to throw an error saying that basically this url is not found okay so this is a good practice so whenever you're using request.get always try to use the race for status just so that you get you capture an error if the website is not reachable okay now since it's going to throw an error and it would crash my whole program what i'm going to do is i'm going to put whole of this inside my try and accept block okay so i'm just going to say try and then accept and i say exception as e and i'm going to print the exception okay that's all so and i'm going to write all my block inside the try uh block okay now from this response object i can extract the html source code that is basically this source code okay and then i can pass it to my beautiful soup in order so that it can parse it okay how we can do that is i'm just going to say soup equal to beautiful soup and then i'm going to pass that response object the html text of that response object so i can just say source dot text okay so this source dot source is my response object and in order to access the html uh text from that response object i need to use the dot text okay so this will return me the html content of this particular web page once i received that i need to now pass this html okay so i need to pass a parser so i i am just going to use the html dot parser this is the default parser that comes along with your python installation there are other parsers as well like lxml and there are a few others if you want you can use any other parsers it doesn't really matter which parser you use at least for this imdb website so what this is going to do is beautiful soup is going to take the html content of this website of this web page and then it's going to parse it using this particular parser and it's going to return a beautiful soup object which is going to be stored in my variable soup okay so if i just now print this okay so i just say print of soup just to see how parsed html looks like so i can just execute it i'm getting an error because i need to give the proper website name so okay okay now you can see that it's returning the html the parts html that was returned by this beautiful soup object okay we need to extract the movie detail that is present in this webpage so this web page has lot of different content you can see here there is a lot of things okay we don't need all of that so let me just close this so this is my whole website right it has a lot of details i just need this section that is this section where there is movie details in order to extract that what i can do is just go to the section that you want to extract right click on that and click inspect what this will do is it's going to show you the exact place in the html content from where this particular content is being displayed okay now i can see that based on how my cursor is pointing so when i point my cursor it's telling me exactly which section of the web page is or is shown using this particular html content okay so if i move my cursor slightly down if i keep it here it's now highlighting 9.2 meaning that this is the piece of code which is responsible for displaying 9.2 and if i keep it if i keep it somewhere let's say here this is a poster column so this poster is basically being displayed by this particular code okay and so on and so forth now i want to extract this particular movie name right and then i want to extract all other movie names rank year and imdb ratings i know that it's basically coming from this particular piece of code now this is a tag so in html just to give you an overview all the data is present within the text and you can identify a tag by this symbol so uh it's basically the less than symbol and then you have a tag name and then the greater than symbol okay now this is the beginning of a tag okay and the end of the tag will just have a forward slash prior to the tag name so it's going to have a less than symbol forward slash the tag name and then the greater than symbol okay now this is how a tag is represented in html now some tags will have additional attributes so meaning that here if you see this one okay so this td tag the it's opening it's basically the beginning of the tag but there is also some attributes mentioned here and class is one of the attributes of this td tag it has a name like poster column okay and then this is the end of the tag okay so just remember this you don't really need to know a lot of html but these basic things that how a tag is represented in html and how you can find if a tag has some attributes and how what are the attributes how you can identify that that's all okay so if you know this you should be good enough to perform web scrapping okay so now we want to extract the movie name okay now if i right click and click on inspect you can see that it's coming from here now this particular section is belonging to the tag a and a has several attributes so there is an href attribute and then i think there is a title attribute and then yeah and then after that this particular thing so the text that is given in between the beginning of the tag and the end of the tag is basically the text that is associated with that tag so in this case this shawshank redemption this is the movie name this is a text of this particular tag a okay if i if i write a code which will access the tag a and then access the text of the tag a it should return me the name of this okay this movie name okay now i know that this tag a is actually present inside my tag td okay now this td if i just click here okay you can see that it's coming from this td tag okay depending on how i'm moving my cursor so if i move it up it's pointing to something else on the screen if i move it to this title column okay the td tag which has the class as title column this basically the tag from where the movie name the rank and the year is shown okay and if i move the cursor slightly down to this td tag where the class is rating column imdb rating this is the tag from where the imdb rating is shown up okay so this basically how just by moving moving the cursors or just by right clicking on the content that you want and clicking on inspect you will identify the exact piece of the html code from where a piece of information is shown on the screen and then you just need to write your script to point at that particular tag in order to extract the data okay now what we want is we want to extract the movie name and to extract that i know that it's coming from this td tag and this td tag is present inside the tag tr so i know that it's coming from here so if i minimize that you can see now that this tr tag is actually responsible for the whole of this first movie it how you can identify that is how it's highlighting this uh in the how this curve when i move the cursor to this first tr it's highlighting the first movie when i move it to the second tr it's highlighting the second movie when i move it to the third tier it's highlighting the third movie and so on and so forth okay so basically there would be 250 tr's here each tr tag here is holding the details of one particular movie okay so now we know that in order to extract all these 250 movies we just need to extract this tr tag and from this tr tag we will be able to extract all the data that we want okay now this tr tag is actually present inside my my parent tag that is t body so if i keep my cursor here on t body and if i just minimize this okay you can see that it's it's basically highlighting this whole section okay so if i just if i just go to the bottom okay and if i just keep here and move it back to the t body tag you can see that all of this 250 movie this whole section of where movies are being listed is basically present inside the master tag or the parent tag tea body okay it has the class like listed list and inside t body is where you have all these tr tags so what we need to now do is we need to first access the t body tag and once we access the t body tag then we will try to access the tr tag okay let's try to write our code to access the t body tag so what we can do is let's go back to our script so i'm here let me just close this okay so i'm just going to remove this print i don't need that so i'm going to say movies okay movies equal to soup dot find and i'm going to find that particular tag so the tag is t body so i just mentioned the tag name here t body and it has a class so i'm just going to say class equal to the exact name here that is lister dash list so i just say lister dash list okay now generally whenever this find is basically a method that is available with your beautiful soup object find is going to basically fetch the first match okay so in this case i'm trying to match a tag who has basically with the name t body and with a class with this value that is listed uh dash list since class has a special meaning in python it's a special keyword we cannot straight away use class here so we just use underscore but if i had to specify any other attribute so if i say here i have a tag div it has an attribute id i could just say id equal to something in this find if i wanted to find a tag with which has having an id okay but whenever you're trying to specify a class you need to use an underscore okay so now what i'm doing is i'm i have created a variable and i'm going to assign it with whatever this soup is going to find i'm trying to find t body with this particular class okay so if i now just do a print of movies okay so if i execute this there is some errors okay no it's fine so uh basically the file name that i was calling this function this program was wrong but anyways i just re-executed it's working fine so what i have done is i've just written the code uh in order to access this particular tag tbody okay now i'm here now from here i need to access all the tr tags and from from each of the tr tags i need to access the td tag in order to get the movie name and other details okay so this is fine so there is only one deep body tag in this html and since and hence i can use the find but i know that that there are many tr tags within this t body tag so in order to find the multiple tr tags i cannot just use find but i have a method called as dot find all okay so now either i can write this find all in the next line as well but i will just try to use it here so what i'm doing is from this result okay this resulted in all the html basically this tbody html everything that was inside this t body okay now from that i'm trying to find all the tr tag in order to do that i'm just saying dot find all and i passed the tag name that is tr and then i need to see any attributes it does not have any attribute yeah it does not have any attribute so i can just leave it like that dr so hopefully they should find all the tr tags and i know that there are 250 movies so hopefully there should be 250 tr's this find all will always return a list okay so hopefully it should return a list with 250 values of tr okay so just to confirm that what i'm going to do is i'm just going to check the length of that okay so length of this movies and when i execute this okay so it's returning 250 so that means i have now successfully able to point my code to each of this tr tags here okay so let's let me remove this print and let's let me now open a loop because i want to iterate through each tr tag and then from each tr tag i want to access the td tag which has the information of the movies that i'm looking for okay now in order to iterate through each tr tag i can just say for movie in movies okay because i know that this movies is basically having a list which has 250 tr tag details okay and from that movies i'm going to fetch one movie at a time because i know that each tr tag holds details of one movie okay so i'm going to iterate it to each movie and what i'm going to do is let me first print this movie okay let's just see what this movie has i'm going to break this i just want to check for the first movie only okay so if i execute this okay so it's returning tr so this is the end of the tr tag yeah so it's returning the whole of the tr tag okay now what basically this has returned is it's returned the first tr tag everything that was present inside this tr tag okay now what i want is if i right click on this movie and click inspect i want this particular section that is the movie name which is inside the ear tag which is inside the td tag with the class title column so from this tr i need to now access the tag td which has the class as title column in order to do that what i'm going to do is let me just move it here so first thing i'm trying to access is the movie name so i'm going to say name equal to movie dot and i'm going to use find and and i'm going to say the tang as td because i'm trying to find the tag td which has the class like title column so i'm going to say class equal to title column okay so this should hopefully fetch me up to the td tag now let me just print that so let me just print this name just to see what we have inside the name currently so if i execute this name movie okay so if i execute this so you can see that it's now returning the whole uh td tag hold the contents of td tag and inside this td tag i want to find a movie name which is inside my tag a and it's the text of a so what i can do is i can say again i can use either find but i can also say dot the tag name so in this case the tag name is a so if i now execute this it should return only the contents which is present inside the tag okay so that's what it's printing only the tag details of tag here now as i told you after yeah here you have a few attributes but the basically the text that is mentioned within the beginning and the end of the tag is the text of that tag so i want to access that text in order to access the text of a tag i can just say dot text that is stack dot text now if i execute this now you can see that it's basically returning the movie name and this is exactly what i wanted so just by writing this simple code here i'm able to access the exact movie name from the tr tag and then by entering into the td tag and then entering into the a tag and then fetching the text of the a tag okay so we have now achieved one goal that is to extract the name of the movie now let's go to the next one now i want to extract the rank of the movie so this is one here so if i just right click on that one and click on inspect it's pointing me to this particular piece of html code it's telling me that it's present in this td tag with the class title column it's basically the same tag where we found a name as well but here the one is basically the text of the tag td okay so we need to access the text of the tag td now to do that i'm just going to say rank equal to so i can just copy this same thing here so because i again want to access the td tag with the class title column and that is what i'm doing here and then from here i want to access the text of the td tag correct so i'll just print this rank let's see what happens okay i'm getting this one but along with that i'm also getting few other details this is the movie name the year and everything the reason for that is when i'm using the text attribute for that tag it's going to return the text of that tag as well as the text of any text that is inside that so here i have a tag a so it's the text of a is also getting returned and then i have another tag span the text of this pan is also getting uh returned okay so i'm not sure if you are able to see this let me just zoom this okay i'm sorry maybe you were not able to see this properly but hopefully now you'll be able to see it so this is basically my tag td it has a text one but it has another tag a and it has some text that is the movie name and then it there is another text span it also has a text 1994 and that is why when i just use dot text it's going to return me the text of not only that tag but also of the the text of all the tags within that tag okay now that is why i'm not going to use text but i'm going to use another method called as find text okay and then inside find text it's basically going to find the text but here i can pass some attributes so one of the attribute that i'm going to pass is strip equal to true strip is basically going to strip all the new line characters or tab characters or spaces and i'm seeing it true so it's going to strip although all of that so now if i just execute this non none type object okay i think it's not find text it's get text okay so if i just execute this okay so now i am able to get the text and all the spaces new line characters everything is remote so this is much better okay now still this is not what we want we want is just one so i can see that after one there is a dot and if i go back to the web page you can see that after each rank i have a dot 1.2.3.4. etc so i can say that if i just split this whole text based on dot i should be able to just extract the first value here that is any value that is prior to dot so in order to do that i can just use a function uh that is that is split okay so i can just split and then i can pass the character as dot so it's going to split this text based on dot okay now if i'm just going to return a list okay if i execute this you can see that okay let me just you can see that it's returning a list and all the characters before the dot is returning in the index 0 and then all the characters after dot is returning in the next index okay i'm only interested in this one so it's present in index 0 so i can just pass since it's a list i can just pass the index like this as zero and now when i execute this it should only return me the one okay yes so that's all and now i using this simple line of code i'm able to access the rank okay so we have now able to access name as well as we are able to access the rank now let's try to access the next one that is here okay and if i go back to this particular web page and if i right click on the year and click inspect it's going to point me to the exact code where this html code is written and i know that this is present inside the tag span and this 1994 is basically the text of this tag span and this text span is inside my tag td with the class title column so this basically the same tag where we found the movie name and the rank so again i can use the same same tag here i can just copy paste it and next i need to access the tag span so i can just say dot span and i know that i need to access the text of that tag so i can just say dot text and now if i just okay let me just print here if i execute this so you can see that i'm able to extract the year but there are some parentheses in the beginning and the end i want to remove that so in order to remove that i can just say dot strip and those parentheses okay so if i execute now so this strip is basically an another string method that you can use with any string and it's just going to strip out any character that you mentioned here so i'm passing the parenthesis so it's going to remove the parenthesis and i'm able to get the year so i'm able to get the year rank and name and the last one that i want to extract is the imdb rating so i'm just going to call this variable like rating and let's go back to our website so this is a rating that i want to extract so i'm going to right click on that and click on inspect now it's going to tell me this this particular rating is coming from this piece of html here and i can see that the tag is strong and it's basically the text of the tag strong and this tag is present inside another tag called td and this td is not the same td that we previously accessed this is a other td which has a class like rating column imdb rating okay so now we need to write a code to access this td tag okay so how we can do that is i'm just going to say movie dot find i know the tag name is still td but it belongs to a different class so i'm just going to provide that particular class name okay so i can just go here right click on this and click on attribute and copy attribute value so it's going to copy this class name since i did not want to do any typo and i'm just going to do this so i have provided this now hopefully it should access this particular td tag okay so let's just print this rating and if i just execute this it should hopefully print the td tag with this particular class and yes it's doing that it's printing the td tag with the class that we provided here now inside that i need to access this strong tag okay so i can just say dots dot strong and then i need to access the text of that so hopefully this should work if i execute this and that's also i'm able to extract the imdb rating so basically the four information that i wanted to access from this website i have been able to access it for the first movie okay since i'm using a break here it's only iterating this loop is iterating only once it's only extracting all of these details from this first tr tag okay now let me just print all of these four values okay so let's say rank comma name comma [Music] year comma rating okay and if i just execute this so it's printing everything here now let me just remove this break and see if it works for all the 250 movies so now if i execute this you can see that all the 250 movie details are now being printed on the screen and this is exactly what we wanted okay so this is fine so uh just with this much piece of code it's hardly i think how many four lines of code here five i think around ten lines of code we have been able to uh access the imdb website and import or basically extract all these movie details okay now this is fine the last thing that i wanted to do is i want to load all of this information into an excel file okay so we can there are a few modules that we can use to basically create a excel file and load some data into that the model that i am going to use is the open pi excel i'm not going to really explain you in detail how to work with open pi excel but i'm just going to say enough with which you will be able to create a new excel file and then load some data into that okay so the first thing that we need to do is we need to import this uh module that is open by excel and then the next thing that we need to do is we need to create a new excel file okay so i'm just going to remove this for now and let's say i'm going to comment out this okay until we figure out the code for creating and loading excel let's just comment out this okay so the first thing that we'll do is we'll try to create a new excel so to do that i can just say any variables i'm going to call it like excel equal to open by excel dot workbook okay that's all so this will create a new excel file so once a new excel file is created we know that excel can have multiple sheets let's see how many sheets this excel file has to do that i can just say print off excel dot sheet names okay and let me just execute this okay so if i just execute this it's just returning sheet meaning that this excel file only has one sheet that this is a default sheet and this is the default name and i want to change the sheet name but before that let me assign a new variable sheet with the active sheet so this excel now currently it has only one sheet but i think in some other operating systems and depending on the version of python you are using uh there may be one or three sheets so we want to make sure that we are working on the active sheet activesheet is basically the sheet where we are going to load the data okay so to do that we can just say excel dot active and now i'm going to use this variable sheet and i'm going to change the title so i can just say title equal to some different name for the sheet so let's say this is top rated movies okay and now if i just print the sheet name again let's see if the sheet name has changed if i execute this you can see that initially the sheet name was sheet and now i have changed it to top rated movies so i have created the excel i have now changed the sheet name that is fine now let's load some heading into this excel so i know that this excel is going to have four columns i am going to have this name rank here and ratings so let's create some headings for that okay the column names so how we can do that is i can just say sheet dot append and i can just pass a list with all the values so i need four columns here so i'm just going to pass four values and the four values will be rank so this will be movie rank and next will be movie name and next will be year of release and next will be the imdb rating imdb rating okay i have basically created an excel i have changed the sheet name and then i have added a row into that sheet with basically four columns okay this is basically my column header now what i want to do is every time my scraper my program below is going to extract some value from the website i want it to also load that value into this excel so how i can do that is just after this print i'm going to say again sheet dot append and i'm going to pass a list with all these four values so i'm just going to copy this okay and i'm going to pass it here now this values that is basically the details of each movie during each iteration of this loop it's going to load into this excel file okay so that's all and finally what we need to do is we need to save this excel only then the excel will be created so in order to save this excel i just need to say excel dot save and i need to pass an excel file name so i'm going to pass it like imdb movie ratings dot xlsx okay so that's all now before i execute this program let me just show you what i have in this folder so you can see that in this folder i have nothing this just the py file that the basically this program that have created and there is nothing else here now when i execute this program okay this print is what i had here ignore that and you have this 250 movies printed here so that is fine and also there is an excel file that has been created here okay now let's look at that excel file okay this is a path and is a file so let me just open that and you can see that the excel file has been created and it has all the details so let me just zoom in on this and you can see that the basically the column names are exactly what i gave and hopefully there should be 250 movies yes the 250 movies have been loaded here so this is how simple it is to create an excel in python using the open pi excel and then load some data into that okay hopefully this program gave you some idea and some information of how to perform web scrapping in python using the beautiful soup library and how you have scrapped the imdb website you can apply similar logic to scrape other websites as well but of course other websites would have different tag names and different attributes so you will have to do it specific to that website but hopefully it was helpful if you like this video please make sure to like and subscribe to the channel thank you and see you soon in the next one bye
Info
Channel: techTFQ
Views: 2,282
Rating: 5 out of 5
Keywords: python, python tutorial, web scraping, python web scraping, python web scraping tutorial, scrape website in python, beautiful soup, beautifulsoup, beautifulsoup4, bs4, requests, python web scraping using beautiful soup, web scraping in python using beautiful soup, scrape imdb website, scrape imdb, openpyxl, python programming, web scraping tutorial, python for beginners, web crawling, python requests, web scraping using python, web scraping using python beautifulsoup, techtfq
Id: LCVSmkyB4v8
Channel Id: undefined
Length: 37min 30sec (2250 seconds)
Published: Mon Jul 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.