Automating With Python - Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this course you will learn how to automate a bunch of different things in python you will learn about web scraping automating downloads extracting from pdfs automated image processing building an automated news summarizer and more the instructor for this course is abdul also known as one little coder he has been creating courses for a while and is a great teacher welcome to the section one of hands-on built tools to automate stuff in python in this section we'll build a hacker news headlines emailer we'll begin with the understanding of the basics of web scraping then we'll set up our system environment by installing the required python packages then we'll move on to understand the project architecture and then we'll start scraping project with hacker news front page and then finally we'll complete the email section to finish the tool building so that the tool can send us the hacker news headlines in this video we'll learn about project architecture of building an automated hacker news headlines emailer this architecture starts with getting the content of the hacker news website front page so we'll use the python package request to get the get request to extract the content of the website once we have the content of the website in place we'll use the python package beautiful soup to scrape the required content so the required content i mean the components like title link score domain name etc so we'll use beautiful soup to extract these required components from the content that we had extracted from the previous step the next step is to build the email body or the content of the email from the scraped content will arrange it in such a way that the email body looks like there is a title there is a link there is a number so that the email body looks exactly like a news presentation and once we have the email body ready we'll move on to the email authentication section where we will use our gmail id to authenticate the email section we'll use the python package smtp lib to set up an smtp library authentication system where we'll use our gmail authentication once the authentication is set up and then we have provided our email id and then other necessary information finally we are going to send that email using the email body that we have set up so ultimately what we are doing is we are extracting content from the web page and then we are going to take the required components and then use that component to build an email body and then use that email body in an email that we composed and then we sent it to the required users in the next video we'll start to learn how to set up our python environment in such a way that we have all the required packages in this video we'll learn how to set up our python environment in such a way that we have all the required packages these are the following packages that we will be using in this particular project request package is used for http request beautiful soup is used for web scraping smtp lib is used for email authentication and email transaction email.mime is used for creating that email body and then finally date time is used for accessing or manipulating date and time but in these packages smtp lib email.mime and daytime comes by default with your python installation so what we have to do is there are two external libraries request package and beautiful soup that we are supposed to install in our python environment so let us go ahead and see how to install those two packages that are required first open your terminal and then make sure that you have python installed already once you have python installed already you can start with pip3 install requests once you enter it python is going to look up for request and then it is going to get request package from pipe i which is the repository where all the packages are available and then it is going to install in your machine to check if request package is installed let us invoke our python 3 console and then check import request as you can see it has successfully imported so let us exit and now move ahead with the next package the next package that we are supposed to install is pip 3 install beautiful soup oh so now again beautiful soup is getting downloaded from pie pie and then beautiful soup is getting installed so as we can see beautiful soup is installed now let us go ahead and invoke our python 3 console and then check if beautiful soup is installed as you can see beautiful soup even though is the package name while you are importing the package in your python section you have to use bs4 as you can see bs4 is successfully installed you can even check if a particular object is getting imported bs from bs4 import beautiful so as you can see b is capital here and s also also capital so it is successfully imported so let us exit our python environment so now we have successfully installed beautiful soup for and then request package both are the external libraries that are required for this particular project and all other packages like smtp email.mime and datetime are inbuilt present in our python setup but let us make sure that those packages are available also so let us invoke our python console once again python 3. let us clear our terminal first open python 3 and then do import smtp smtp is successfully imported and let us import email dot my email.mime is successfully imported let us import date time daytime is also successfully imported so this tells us that all the required packages are available in the python environment that we have got so we are good to go ahead with our project in this video we'll start to learn how to code the project script for this code editing i'm using pycharm community edition but you can use any ide of your choice so make sure that python installation is proper in your machine and python is added in the system path then you can use any code editor to do the same thing that i'm going to show you right now we'll start with importing all the packages that are required and if you remember those are the packages that we installed in the previous section and we'll start with import request which is for our http request and then we'll import beautiful soup from bs4 next we'll move on to importing smtp lib and then we'll import two objects from email mime and then finally we'll import date time once all the inputs are finished we'll start with extracting the current date time which is the system date time the reason we are using date time to extract the current date time is to create an email subject line where it will show us the appropriate date when the email was sent this is for us to make sure that the same email get doesn't get overwritten every day so that we have an understanding that every day we are receiving a new email from the automated emailer so the next step is to create an empty python object with nothing it's a string object with nothing in it which is going to be used as an email content placeholder once we are ready with this thing we can move on to start creating a new function where we will extract the hacker news components that we are required first let us create a function called extract underscore news which will take one argument which is the url that it is required so to keep the user updated will print a user message saying extracting hacker news stories then we'll create another temporary placeholder which is again an msd string so this temporary placeholder is going to be used to assign value to this content which is the actual email body that we want so the first line that we want in our email body is to say that this is hn top stories hn stands for hacker news and then we are going to display hn top stories and it is going to be a bold text and then we are going to have line breaks and then we are going to show star star star star just to make it more readable once this line is defined now we will move on to get the content of the url the url that we are going to pass to this function when we call this function so we are going to use request packets get function to get the content of this url and then store it in the response object the content once we get from the get function is actually a response body http response which will contain content the actual content that is required for us which is the content of the webpage so we are going to use the method content on the object response to store the actual content in content remember this is a global object which is content and this is a local object within this particular function which means these two are different so the scope of this particular content lies only within this function so do not get confused with this content and this content now we are going to use this content that we extracted using the response body and we are going to use a html parser to extract or make a soup out of it from that particular soup what we are going to be interested is about the components that we are going to require in this particular project so to understand what are the components that we are going to required we have to see the website structure so in the next video we will see the website structure of the hacker news front page to see what are the components that we would be required to extract using this beautiful soup function in this video we'll see what are the components that we need from the hacker news front page website and as you can see from this code this is the url that we are going to use and this is the hacker news front page url so let us go to our browser so in this case i'm using mozilla firefox so let us go to our browser and then open hacker news front page and this is how the site actually looks this is one of the most popular websites on the internet and this was started by paul graham was a very famous personality and internet entrepreneur and he also runs an incubator called y combinator so this is a website that has been read by thousands and thousands of people every day and our objective is to extract this content and automatically send an email to us so that we can see only when there is an important or interesting content for us to see then we can go to the website and that is the objective of this entire project so as we can see this website we can see a header in this website or a navigation bar and then we have list which is uh very similar to how you know rated kind of website look so in this setup there are some components that we could be interested in to understand what is the link or what is the information is about first we need this one which is the actual title of the link then we would be also interested in knowing the points which shows how popular that particular link is so to know what are the contents that we should be scraping from this website we have to first open our web inspector to open web inspector you can either press f12 in your keyboard or you can right click in your machine the browser using your mouse and then click inspect element so click inspect element then you will get this particular page opened which will give you a sense of how the web page is designed so let us slightly increase the size of this and let us pick the pick and element tool so now we have picked up this tool to understand what are the components that we need from this particular hacker news front page which we will use in the code to extract that particular component from the content that we have already extracted so pick this thing and then go here as you can see this as you hover you can actually see the css value which is also called the selector as you can see over you can see a dot story link here you can see a dot hn user a dot some score value then you can actually see a then you can see a so what we can see is we can click mouse and then see how we don't have to actually click we can just hover around and we can actually see how the code changes and then the first area of interest for us is the title so hover your mouse there and click the button as you can see once you hover there what you actually get is you get an a anchor text with the class story link and which is present inside a table with class title so the first point that we are looking for is anchor text with story link so let us go to our code and see what we have written so what is our area of interest is we are trying to extract everywhere where we have got tb td is in html is the actual cell inside a table html table html table is created using this tag table a table row is created using this tag tr and then the values the actual cells inside the table row is using created using the tag t d so in this step what we are trying to do is we are trying to tell beautiful sue to find all td from this particular soup so we have just created the soup so we are using the function find all to find everything that is td but as you can see we are trying to find everywhere where there is td with class title so as you can see we are trying to find everywhere where there is class title and then we are trying to extract the components of it everywhere there is class title then we are trying to extract the components or the values inside it that is exactly what we're doing in this function soup dot find all td and then the attributes the html attributes that we are looking for is class title and then we align we align is something that you can actually see so this is what we are trying to find out from this particular webpage so we are trying to find everything that is td and we are trying to extract that using this attribute class title and then we are saying we align should be empty so the first attribute class should have value title the second attribute we align should have nothing in it so just let us see it once again once you click the inspector tool and then hover on the web page you click on the title that you want and you can actually see the title the td class attribute holds the value title and then the we align like this one is not present here so this is to eliminate the junk and then extract the component that we want so once we do that we are trying to find wherever we are trying to convert the entire thing as a text so to see that we have to understand one more thing what we are actually seeing is this is okay so we are extracting td which has class title but that is only for the first element but what we need is we need all the links we need all the 30 links from this particular page for us to do that we are trying to put this entire thing in a for loop and then we are using this function enumerate just for one simple purpose because in the final email that we want we want numbers 1 2 3 4 5 6 until 30. so for that purpose we are trying to use enumerate which will give us the actual number the index value and also the value of this output so we are using a for loop and we are using enumerate to say okay i want all the values that this output of this thing which will give us all the text all the scraped output and then we are saying enumerated so that we have the index value and we also have the actual tag that is extracted from this page and once we do that we are entering into the for loop trying to build the email content actual content so as we just discussed we are trying to create the table row number with this value i and as you can see python is a zero index language so what we are trying to do is we are trying to say okay i plus 1 which will give us 1 for the first row and then 2 3 until 30 so on and then we are going to we are trying to actually have a nice looking format just like this which will separate the index number from the actual title and then we are trying to convert everything what we are trying to do is we are trying to convert everything that we just extracted using dot text into a text so we are trying to say okay you have given me the tag but i don't want the entire tag i just want the text inside the tag so the value inside the tag so we are using tag dot text to do that and then now we need a line break so we are going to use br which is an html tag for the line break and then one more thing is required here which is if you can notice in this particular page you have got all the title let us again open our web inspector and you have got all the title but at the end of the page you actually have another title which is td class title we align nothing but this value is more so in order to avoid this more getting captured in our final output email body we are trying to eliminate this saying that we want everything 1 to 30 except when there is a value more that is what exactly we are doing we are saying we want everything except when the tag is not equal to more we are saying give me everything when the tag is not equal to more and then we are concatenating it for every row so this for loop is executing for every row and then every row value is getting added in the cnt and at the end of this function we are returning this object python object that we created which was an empty placeholder string as c and t to recap this particular function this function is to extract the front page links or title or the components that we wanted so we are creating a function called extract underscore news where we are passing on the url and then we are creating a nice title which says hn top stories and then we are extracting the content we are using beautiful soup to extract make a soup out of it and then we are using soup dot find all to find all the td tags with attributes class title and then we align nothing and then we are trying to create rows using this thing and while we are creating rows we also noticed that there is one final row which has a value more which we do not want so we are excluding that and then finally we are retaining the entire object cnt as the result of this function in the next section we will see how to call the function and then we'll move on to composing the email in the previous video we learned how to build the custom function that we use for extracting the news from hacker news friend page in this video we'll see how to call that function how to finish the email content and then how to start with email authentication so to start with we can see the function that we try to build in the previous section is called extract underscore news that takes one argument which is url so to invoke the function or to call the function we are going to say extract underscore news and then we are going to pass on the url of that hacker news friend page as a string once we do this thing this function gets executed and then whatever is written in cnt will get assigned to this particular cnt and as we saw in the previous section this cnt is a local object whose scope is within this function and this cnt is part of a global object which has scope in the entire code once we have this cnt what we are going to do is we are going to append this cnt to the content placeholder that we created so we are saying just content plus equal to which is as equivalent as content is equal to content plus so instead of this thing we are going to simply say content cnt plus equal to cnt at the end of the email body then we are trying to put empty lines with dashes to denote that the email is finished and then finally we are going to add two more lines and then say this is the end of the message this is just for us to make the email more professional more useful in understanding where the email starts with this thing and where the email actually ends once we have this finished what we are going to now do is we are going to start with the email composing step and then the first step of email composing is to create the parameters that is required for email authentication as we saw in the project um architecture section so once the email composition the email body is ready we are going to start with the email authentication for email authentication there are five important parameters that we have to define first what is the email server the smtp the server that you are going to use second what is the port number third what is the from address email address where you want to send the email what is the to address where you want to send the email and then finally the password of the address from address from where you want to send that email one thing that you have to keep in mind is this to address could be actually a list where you want to send this email to multiple recipients so in this particular project we will see how to send this email to yourself so that you can keep yourself updated with hacker news headlines every day but actually you can even send this email to multiple person provided that you give a list of email ids list when i mean list it's actually a python list so to start with we are going to use a gmail account for this particular step so we are going to stay smtp.gmail.com which is the smtp email server for gmail and for gmail the port number is 587 and then the next thing is the from email id which should be in a character as a string so for the sake of this particular project i'm going to use my gmail account from where i want to send my email and then also this is the same email id for which i want to send this email to remember the two address could be a list of email ids where we have multiple email address so that this one email could be sent to a lot of people and then finally we have the password that we are going to use for this email account so you enter the password that is required to log into this account and then this will complete the parameters that we wanted once we have this in place the next thing is we are going to create the message body so the message body that we want is a mime multipart so we are creating an empty object with using the function my multipart and then we have to add the subsequent components of an email an email is supposed to have an important thing which is called an email subject so to create an email subject there are multiple things that we can actually do the first naivety thing that we can actually do is we can have a title that doesn't change but the disadvantage with that is in an email client like gmail or outlook if you have the same title every email that comes next day gets folded in the same conversation instead of having a different email the same email will be there and then subsequent emails will be added as a conversation so to avoid that and also for us to understand when did we receive that particular email what we are going to do is we are going to create a dynamic email subject and the way we are going to do that is as you can remember from the previous videos we had created a new object python object that called now from the date time package which will return the system date the current date so what we are doing here is we are saying okay this is my email subject which says top news stories an automated email that's well and good next what we are trying to do is we are trying to actually create the date object str now dot day will give you the day the next one is str now dot month the next one is str now here so what we are trying to do is we are actually trying to create an email subject line that has the date component date day and then your year so once we have this thing we are going to assign it in this my multipart that we created as subject the next is the from address from the next two address s2 and once we are done with this thing we are going to attach the email body the email body that we created as message msg.attach as you can notice here we are trying to make this an html email so if you remember we had used be bold as html tags to make our email look slightly more better than a normal text email and that is why we are using mime text should be content html and then we are attaching that content to the email so with this our email body is currently ready now we are moving on to the authentication section where we are printing the message that initializing server once we have the server components in place the next step that we are going to do is we are going to call smtp function from the smtp lib package and then we are saying okay this is my server this is my port i'm going to assign it to server and then this function set underscore debug level one is to say whether we want to see debug messages if the server has issue in connecting if the server has any problem if the authentication is not successful do you want to see the error messages or no so if you do not want to see the error messages you can set zero if you want to see the error messages you can set one which will help you in debugging once we have that thing we are going to initiate the server with a hello and then we are going to start a tls connection which is a secured connection and then once that is done we are going to log in from id using the password that we have given once the login is successful then finally we are going to send the email that we have composed from this id to this id where the message that we have created is sent as a string using the function as underscore string once the message is sent successfully we are going to print a user message email send and then finally we are going to quit from the server that we just initialized so we are going to initialize the server using the server and port detail that we just created we are going to set the debug level one to understand the error messages we are going to initiate the transaction with the server starting with alo and then starting with a tls server and then we are going to login to the from and the id email id and then using the password and then finally we are going to send the email from this id to this id or set of ids that we have created here and then finally we are going to send it as a message using this as underscore string function and then finally we are printing a user message and then we are going to quit from the server in the next video we'll see how the email actually looks and then how do we execute the script in the previous section we completed the actual code that was required for this particular project but before we move on to executing the code there is one change that you have to do if you are going to use your gmail account if you are going to use your custom smtp like your company email id or you are going to have your own email server you probably would not need to know this thing but if you are going to use your gmail account to send an automated email this is one mandatory step that you have to do otherwise your email would throw authentication error so what is that thing so what you have to do is you have to go to your email account so what you have to do is you have to go to myaccount.google.com security so this is what you have to open you have to go to myaccount.google.com security and then you go to security tab once you get into the security tab you will see something called list secure app access so you'll see list secure app access once you scroll down the top page looks like this so once you go at the end you will see list secure app access so click this button right now i have it on but for you it should be ideally off so what you have to do is you have to turn it on so read this message carefully once google is trying to tell you that you are trying to enable your email sign in for less secure technology and this project that we are doing it is calling it as less secure technology because it doesn't use two-factor authentication unlike your mobile phone or mobile app or something else so google is trying to let you know that you are trying to access or give access to your gmail login for less secure app and this is how exactly the message is for all the email automation project that you would do so that is completely fine but make sure if you have two factor authentication this is not going to work probably you have to find another way which you can find in google forums but for normal login if you have a normal login this is what you have to do you have to go to myaccount.google.com security you go click this button and then by default it would be like this for you which is off and then you have to go here and then turn it on once you turn it on you'll see this message and then you will have an yellow color let me refresh the page it would take slightly some time for google to refresh your on to off and after on so probably you have to wait executing this until that so now you can actually see that this is refreshed and then it is showing with an exclamatory mark which is slightly a warning sign to say that you have enabled your gmail login for less secure apps once you're done with this step now you can go ahead and then execute your code and then let us open go to the python community edition that we were using let us open the terminal so please note you can even go to your system terminal and then do this thing or even you can use your pycharm terminal for the first time we'll use your pycharm terminal to see what are the error messages that we are getting if you are getting some error messages or if you are not going to get any error messages then we can probably you know automate this entire thing using windows patch scheduler or a bash script that would be just simply run on your terminal or shell so to start with what we have to do is we let us see what are all the files that we have these are the files that we have in this thing and this is the particular file that we are of interest we will say python3 and then we are testing the file name and then executing it as you can see these are the error messages or user messages that we were printing and now because we had enabled debug level one this message has been sent and you can see that it is showing that this is first extracting hacker new stories composing email initiating server all the ip address related details now starting the tls server and then it is saying that ok smtp has started and then from this email you are sending it to this email and then this is your email body starting with automated email and then finally you're finishing the email and then you are presenting a message that email sent and then the email connection is closed we'll see how to do the same thing using your terminal so open the terminal that you have start with the terminal that you have like in my case i'm going to open my mac terminal and then i'm going to first navigate to the place where i have got the codes once you navigate to the folder where you have got the codes now check what are all the files you have got so these are the files that we have got so copy the file name now open python 3 and then file name now you are executing first extracting the news composing the email and it is the same set of messages that we have seen so in this video we learned how to enable the google setting that will allow us to send automated email through our gmail and then we also saw how to execute our script both in the pycharm and also in our terminal so the way we execute this using this code python 3 which is to execute python 3 or invoke python 3 console and then the file name in the next video we'll actually see how the email looks like in this video we'll see how the email that we sent using the previous automated script actually looks like let us go to our gmail as you can see google has sent me a critical security alert which is just to notify that i had tried to enable my account login for less secure apps and you know that is completely fine for us to see and then ignore and then the next thing is we can see the email that we sent as you can see you have received two emails because the first email was sent using the terminal inside pycharm the second email was sent using the shell the actual terminal that is in your os which is your command prompt terminal so let us open the email let us slightly zoom out to see how the email actually looks like as you can see this is the subject of the email where this is static what we created and then this is the date that is the current system date and then you can see a bold email title not the subject actually email title and you can also see that email was sent from to this email and you can actually see another important thing that this email has fallen inside your inbox and not inside your spam box so what happens is if you do not have all the email components the mime components that we set up your email might end up in your spam not in your inbox so make sure that you have got all the email components that we described in the code setup right so now you can see that you have a title then you have all the formatting that we did then you have the number that we used using the enumerate index i and then we have this formatting the separator and then we have the title and then finally we have the domain name with the link in fact as you can see we have 30 and then finally we have this thing and then we finally say end of message so let us open the next email also to see okay the first one is this the second one is this and then we have all the 30 items and then finally we are saying it is end of message let us once go to the hacker news site and then verify what is it as you can see bullshitters is the first bullshitters is the first how to set up setting up an ad block setting up an ad block there's no thought probably it has changed by the time we send that email and how to hide from a surveillance how to hide from a surveillance so this makes sure that the email that we sent is the actual hacker news for this time 25 for 2019 at this time 25 for 2019 so in this video we saw how the actual email that we sent using the previous script looks like and in this section we have learnt how to build an automated hacker news headline emailer which now further you can extend to probably a windows cache scheduler to send it every day morning or using a bash script scheduler or a cron job to automatically send this email even without executing so in the video start of the video the previous section end we learned how to execute the script but once you automate it using a task scheduler or a bash script or a cron job then probably you would not have to even open or run the script every day the script would be automatically executed there is only one important thing that you have to keep in your mind before you know we close this section is that this email script will contain your email password so make sure before you upload it on github or before you share it with your friends the same script that you remove your password so that no one else knows your password so in this entire project what we learned is how to scrape a website how to extract the components that we want how to build an email and how to automatically send that email from our gmail account hope you enjoyed this section let us see in the next section in this section two we'll learn how to build a ted talk video downloader we'll see how to install requests package and understand how to use request package for http requests with that we'll build the basic script to download the video of a given ted talk and then will store the video in our local machine then we'll generalize the code to download any ted talk video given the url and will ultimately package the script as a cli tool in this video we'll see all the packages that we'll be using in this project the first and foremost package that we are going to use in this project is requests request is the package that will help us get the web content and the request the name comes from http request which is how the communication between a server and client happens in a http protocol the client sends request to the server and then the server responds back sends the response as a result a typical http request contains the actual request and then the header lines like authentication and then an optional empty message body so sometimes you can have the information that is required to be passed on in that message body the package that we are going to use for that is request in python so let us see how to install that required package inside our computer so open your terminal and then as you might have seen before because we are using python 3 we should use pip3 install request this will install request package from pi pi so now as you can see request package is successfully installed to verify that we can open our python 3 console and then import request and then see that it has been successfully installed thank you for listening we'll see the next video about another package which is beautiful soup in this video we'll see about beautiful soup beautiful soup is another important package that we'll be using in this project beautiful soup is used to extract data out of html and xml primarily beautiful soup is used for whip scraping so the request package that we saw in the last video will give us a content from the webpage but as you might have guessed the content is an html or xml format in which websites are designed so beautiful soup is the package that will give us the formatted content the extract of whatever we want from the html or xml file that we requested from the request package as a get request let us see how to install beautiful soup package inside our python environment beautiful soup goes by the name beautiful soup 4. so open your terminal type pip3 install and beautiful soup for everything in small case once you click enter beautiful soup will be collected from pie pie and then it is going to get successfully installed in your local machine to verify whether beautiful supers install let us open up our python environment which is python 3 and then import bs4 as you might have seen the package that we installed was beautiful soup but when we are going to call it we are going to use beautiful bs4 as beautiful soup and especially the function that we are going to use the object that we are going to use from beautiful surface from bs4 import beautiful this is the object that we are going to use from the package beautiful soup so as you can see when you install it you have to call it beautiful soup four when you import it you have to call bs4 and then from that this is a specific object that we are going to use inside our code let us get into the next video where we'll see how to build the basic code thank you in this video we'll see how to build the first version of the ted talk video downloader as we have seen in the previous videos we have successfully installed request package and we have successfully also installed beautiful suit package so let us start with the code let us import the required packages first in the header section and then let us move on to the code the first package that we are going to use is request so let us import request and the second package is beautiful so as we saw in the beautiful soup package video we are going to import beautiful soup the object from the package bs4 and once we do that the next package that we would be requiring is r e which is for regular expression manipulation so regular expression as you might have known is just to do pattern matching and then finally the package that we are going to use is sys which is for argument parsing which is to generalize the code for using multiple urls as a combined package so let us start with the code so as we have imported the required packages in our header section we'll move on to the further section we'll see this exception handling in the next video meanwhile we'll use a url that is hard coded so here there is a url of a ted talk and that is defined in the object url so the first step is to use request package to send a get request to get the content of the url and store in the object r so because this is going to be a long process it is also good to enter message to the user who is using this package the project that we have developing to indicate that the download is about to start once the request packages successfully get all the content from the url of this ted talk the next thing that we are going to do is we are going to use beautiful soup to create a soup out of the content that we have got as you can see here the response of the get request is stored in r which is the poisson object but here when we are going to use it with beautiful so we are going to say r dot content it is because the response contains a lot of things like the status code the result of the request so in this entire response body the only thing that we are interested is are that content which is the actual content of the website url that we extracted and once we are going to use beautiful soup to assign it to the soup object the next step is to identify the exact location where we have got the mp4 to understand that let us actually see the source code of the a talk page when you open a title just like this when you press ctrl u you will get the actual source of the project the page in this page we have to see where np4 is present so we'll say control f mp4 as you can see this is the place where you have mp4 and we are interested to extract this particular url but before that we have to see where the exact location this mp4 is present in this entire page so we'll just scroll to the first and we'll see that this entire content is present inside a script that starts with talk page in it and that's exactly what we are going to use our beautiful soup code to find within this entire page so entire content is present in soup and then we are going to say inside soup find everywhere where you have script and inside that script using rejects regular expression we are going to find for this particular word and then we are going to store the result inside result and as you can see from this entire script this script contains a lot of text and the only part that we are interested in is a proper mp4 file so we are building a regular expression pattern to say that it should start with the url should start with https and then it will contain also mp4 and then we are going to assign the result in result underscore mp4 at this point you might have got a lot of results so what we are going to do is we are going to split everything based on one separator and then we are going to take the first output after the split as the proper mp4 url because as you can see with the mp4 you actually see medium quality light quality high quality so not bothering about the quality of the videos we are going to just take the first url after the split and then finally we are going to print a message that we are going to download video from the url and then we have to have a file name so to get the file name dynamically we are going to use the url title for the file name which is also present in here and then the final step we are again going to use get request to extract the content of the url which is currently the mp4 file and then we are going to use that content to write using f dot write and then we are going to save in the output file so as you might have seen this is the mp4 file which should end with dot mp4 which we extracted from here this part and then we are going to use f dot write to write the content of the mp4 file in the output file that we wanted which is an mp4 file and then we are going to print a message that the download process is finished so in this video we learned how to build the entire code a generic first version of the code that will help us in downloading the mp4 video of a ted x update video from ted.com using request and bs4 thank you for listening in the next video we'll see how to generalize this code so that it can be packaged as a cli tool that anyone can pass a url instead of hard coding it and then download the video in this video we'll see how to generalize the code that we built using the last video for a better cli tool so what do i mean by a cli tool a cli tool is nothing but you can have a package the code that we developed as one line in your terminal and then you can get the output of that without having the need of entering you know your pycharm or without having the need of editing the code so as you might have seen in the last video we actually hard coded the url we gave the url as part of the code but that is not going to help us in long term because not every time you want to open a text editor enter the url and then recompile the entire um execute the entire python code so for that purpose what we are going to do is we are going to generalize this code and as we saw in the last video that is exactly why we are going to use this particular module of python which is cis so first we need to check whether someone is giving a particular url as part of the execution command so for that we are going to include this exception handling module in the code exception is nothing but unexpected error so that's why we are calling it exception handling how to handle the exception so in this section what we are doing is we are checking cis dot rv r stands for argument that is passed with the code execution so we are seeing if the length of cis dot rb is more than one then we take the first argument that is passed with the code execution else we are going to pass this message sys dot exist exit message that says error please enter the ted talk url to demonstrate the output of this let us go to terminal let us take the code that we saved using the previous section and then we'll use python3 and then type this and see what is the error that you are getting the error that you are getting is error please enter the ted talk url so that is exactly what is happening here so we have written this exception handling module that will tell us error please enter the ted talk url which will show up when someone is just executing the name executing the file name without passing any argument which is what we are checking here so to check how the code executes properly let us actually take a proper url of ted talk and then we will see how to download this entire video as this is becoming a cli tool now let us save this code let us copy this video url that we just used uh hardcoded in the previous uh video so now we'll go to terminal we'll do the same python three date talk downloader and then we'll paste the video you are the actual data url and let us press enter and as you can see these are the messages that we have given in the previous uh section download about the start and then download has started we extracted the url name and then storing in this particular name and then it says download process finished at the start of this video you might have seen that we had only four files in this particular video in this particular folder but now as we do ls you can actually see we have one extra file which is an mp4 file which is what we have downloaded let us go into our finder and see so we had all these four files and now as part of this thing we have this file also so which is actually a ted talk so what we have seen is we actually built a first draft code first version of the code that did not have argument parsing where we hard coded the url in this video we learned how to generalize this video generalize this code which will include url as part of an argument and that url will be used to download the video and we also saw how to handle the exception when someone is not giving the url as part of the execution in the terminal thank you for listening in this section we'll learn how to build a table extractor from the pdf file format pdf is one of the most prevalent file formats that we deal with in our daily life anyone who works in data science would know that extracting table from pdfs is one of the most boring manual tasks than one have to deal with in this section we'll start with basics of pdf file format then we'll learn how to install the required python modules for extracting the pdf then we'll actually do the coding part to extract table from pdf then finally we'll learn a bit about pandas data frame and then using pandas data frame to write the table that we just extracted into a csv file thank you in the next video we'll start with basics of pdf file format in this video we'll learn the basics of pdf file format pdf stands for portable document format which is a file format developed by adobe in the 1990s this file format was developed to present documents that include text graphics and images independent of the software and hardware and operating systems as same let us say whether it is apple mac or microsoft windows a document should look the same in both the operating systems and both the hardwares hence pdf was developed the first version 1.0 of pdf was introduced in 1993 pdf is based on the postscript language each pdf file encapsulates a complete description of a fixed layout flat document the way the text and graphics are embedded on the pdf are based on the layout not with any structured format the general structure of a pdf file is composed of four main components header body cross reference table trailer the header contains just one line that identifies the version of the pdf for example percentage pdf 1.5 this indicates that this pdf belongs to the version 1.5 the trailer contains pointers to the cross reference table and to key objects contained in the trailer dictionary it ends with percentage percentage eof to identify endo file eof stands for end of file the cross reference table contains pointers to all the objects included in the pdf it identifies how many objects are in the table where the objects begin and its length in bytes the body contains all the object information for example object informations like fonts images words bookmarks form fields and so on so these objects are mapped using the cross reference table and then thus this forms the structure of the pdf so far in this video we learned the basics of a pdf file format and the general structure of a pdf file in the next video we'll learn how to install the required python packages for this project in this video we'll learn how to install the required python packages for this project so we need three python packages for this particular project the first one is jupiter the second one is camelot and then the third one is shebond which we are going to use for data visualization so to install jupyter notebook we have to open our terminal and then use the jupyter notebook installation command before that let us see a bit about jupyter notebook jupyter notebook is an open source web application that allows you to create and share documents that contain live code visualization and narrative text so jupyter notebook is one of the most preferred ides or notebooks used in data science community and the reason is jupyter notebook lets you write code and also the narrative takes in the form of markdown in the same file format also jupyter notebook let says upload jupyter notebook's rendered file which is a markdown on web so if you are going to maintain a web blog which is marked on base so you can export jupyter notebook's markdown file and then upload it on web alternatively if you want a python file not a jupyter notebook just a python file to share it with your ps or automation then jupyter notebook also lets you download the file that you have written the notebook file into dot pi format so let us go ahead and then install jupyter notebook open your shell or terminal where you would be doing this installation if you are using mac open your terminal and if you are using windows open your command prompt so once you open your command prompt as we have seen in the previous videos if you have python 3 then you have to type pip3 for installation of any python package and then type install and jupyter once you type enter this command will install jupyter notebook on our machine so it seems that jupyter notebook has been installed so let us validate whether jupyter notebook has been successfully installed so let us type jupyter notebook enter to invoke the jupyter notebook as you can see jupyter notebook has been successfully installed now for us to shut down this notebook let us go to the terminal and then type control c so type control c in your keyboard to shut down this jupyter notebook and it asks you whether you want to shut down press y that the jupyter notebook has been successfully shut down so you can see the shutdown confirmation so the next package that we would like to install is camelot camelot is the python package that we would be using to extract tables from pdf camelot is an open source package that is available on pipeline so the same way that we installed jupyter notebook we can use pip to install camelot camelot is the package that we have preferred in this project to extract table from pdfs so let us go ahead and then install camelot package so now once i can type pip3 install camelot but the thing with the camelot package is instead of just typing camelot you need to install camelot pi the reason is there is already a python package in the name of camelot which is not this package so these package developers decided to put it in the name camelot hyphen py so the package that we should be installing is camelot dash py even though the package name is camelot we have to install it like this so let us type enter that the package gets installed on a local machine so we see that the package has been installed so let us verify whether camelot has been installed successfully so let us open our python client ripple once we have python client let us try to import camelot so camelot has been successfully imported without any error which means camelot has been successfully installed so let us exit the python console and then the next package that we are interested in is c bond c bond is the package that we are going to use for data visualization so as part of this project once we extract the table from the pdf we are going to visualize it so that the data science workflow is completed so cbon is the package that we would be installing so let us install c1 let us go ahead and open our terminal let's clear the screen and then type pip 3 install c bond once you type enter c bond is going to get installed so c bond ultimately requires matplotlib asset dependency so if you have got matplotlib already on your machine so cbon wouldn't install it again but if you have not got it no problem matplotlib would also get installed on your machine so let us clear the terminal and then verify whether c bond has been successfully installed so open python 3 import c bond as you can see c1 has been successfully imported which means c bond installation is successful so in this video we have seen that we installed the required python packages we installed three packages which are jupiter for jupyter notebook camelot for extracting table from pdf and then cbone for data visualization so in the next video we'll start with the coding of how to extract table from the pdf thank you for listening in this video we'll learn how to extract table from a pdf file before we start with the coding part let us try to understand what other python modules are available for the same purpose the first one is tabula tabula is one of the most widely used pdf extraction library tabula is actually based on a java library in the same name tabula so this one that we are talking about is a python binding for the java library the next one is pdf plumber then pdf tables and pdf table extract so all these libraries are available as an alternate for the library that we have picked for this particular project so even though all these libraries are available we have selected camelot to go ahead with so the reason we selected camelot is because of the following reasons the first main reason is you are in control so unlike other libraries and tools which give you a nice output or fail miserably so there is no in between so either it gives you a nice output or it fails miserably camelot gives you the power to tweak the table extraction with hyper parameters which means if you do not get any output then you can tweak your hyper parameters to get at least some output so that not everything in the pdf table extraction becomes manual the reason is because since everything in the real world is actually fuzzy including pdf table extraction is also fuzzy you need to have control over the hyper parameters to tweak how you want to extract the table from the pdf the second one is bad tables can be discarded based on the metrics like accuracy and white space so camelot gives you these metrics accuracy and white space so that you don't have to manually look at each table to select the good table and discard the bad table the next reason is the table output that you get out of camelot is a pandas data frame pandas data frame is one of the most widely used python module for data analysis and data science which means the output of the camelot library could be seamlessly integrated into any etl workflow or a data analysis workflow which existingly uses camelot or python the last reason is because camelot lets you export the extracted pdf table into multiple file formats including json excel and html let us say that the table that you extracted from the pdf file format you wanted it to be published online which means you ultimately want a html file format so instead of sitting and hard coding a html file format a table html table camelot lets you export the pdf table that you just extracted into a html file so this way camelot helps you with being in control discarding bad tables and then using a pandas data frame which is easily seamlessly integrated with an existing data analysis workflow and then finally exporting that file format into a different file format so this is the reason why we picked camelot ahead of the other packages that we just mentioned so let us move ahead and then learn how camelot is going to help us in extracting tables from pdf the pdf from which we would like to extract data is this pdf this has been downloaded from uin website which is economic and human development indicators for india so this is a fact sheet with multiple tables as you can see you have one table here you have another table here so multiple tables with multiple columns so for this particular purpose we are interested in extracting the values row 20 21 and 22 of this table which is literacy rate so let us go ahead and see how we we are going to extract this particular table and then do a little bit of data visualization with this before we move on to the actual coding because this is the first time we are going to use jupyter notebook in our course so let us see a bit of overview about jupyter notebook so open your terminal which is windows command prompt or mac terminal and type jupyter notebook once you type jupyter notebook it would internally work a server and then your jupyter notebook which this interface would open and then for you to create a new jupyter notebook click here new and then type python3 click this python 3. so once you enter here this is how the structure of a jupyter notebook would look like so this is the title which you can edit to say okay my first jupyter notebook so once we have renamed it this is how the jupyter notebook would look like so this in the jupyter notebook would be called as a cell this cell can have primarily two values so it can have a code value where you can write your python code or it can have a markdown where you write your narrative text or documentation so let us start with the documentation and say this is my first jupyter notebook and then let's say this is a heading so once you are done with this thing this is how it looks like so now let us go ahead and write a small python code as you all know python can be also used as a calculator which means you can do basic arithmetic operation so let us go ahead and do a little bit of arithmetic operation which says three into three once you are done with this code you can press shift enter like this and then the output would be displayed or maybe if you do not want to use the keyboard shortcut you can say okay four minus three which is probably one and then we'll go ahead and click run button here which will show us output one so this way you know that you can have documentation or narrative text and then code in the same file and then this is the advantage of jupyter notebook and one of the reasons why we prefer jupyter notebook for this particular project now let us go ahead and start executing the actual code that we would like to write for extracting table from the un report that we just saw so to start with we should name the jupiter notebook which is a good practice so in this case we can name it extracting table from pdf so whatever you would like to name you can name it so i've named it extracting table from pdf the first cell let us start with importing the camelot package in this case i'm importing the camelot package with an alias which is cm which will help us easily call that evoke that package so let us go here and say shift enter once the jupyter notebook successfully install you get this thing probably let's say if you have made a mistake instead of camelot you have said cam as cm now you would get an error that this module is not found because there is no such package cam in this particular python environment so once we successfully invoke or call that package we will not get any error but the package has been imported so next step is for us to see what are all the files available in the environment so we see these are the files that we have in that environment so we have the pdf file that is available in the current folder we have the csv and xls which i just executed before the project and i have it in place and then we have a bunch of other files so now let us go ahead and then import the file there are two ways that you can read pdf one you can read it directly from web like from where we have downloaded the pdf or you can read it from your local machine so the first argument is you read you give the python the pdf file name the second argument is flavor which is there are two ways camelot can parse your pdf file one is called stream the other one is called letters both these have different variety of ways to how to parse a pdf file and in this particular case we will prefer lattice and then we are explicitly telling camelot that we have two pages one is page one the second one is page two so we are going to use the function name read underscore pdf from camelot and then we are going to read the pdf file so let us execute this shift enter once we execute this thing we are writing it in the python object input underscore pdf so we can see that this has been successfully executed without any error so now let us see what is inside that input underscore pdf so this gives you a table list object with four values inside it which means there are four tables that has been extracted from this function from this pdf and this has been put inside this into input underscore pdf as a table file so for us to know what are the individual properties individual dimensions of this particular pdf extraction process we will see for n in input underscore pdf so we are iterating in through input underscore pdf to see what is inside it so let us go ahead and execute this thing so this shows that we have four pdf table extracted files which is first one is the table is with the dimension four by three which means four rows three columns second one is fifteen by three the third one is fourteen by four and then the fourth one is thirteen by three so our area of interest is the last part of the first page as you saw in the previous pdf display so we are going to say okay i want this third one since python is a zero index language we will say two input underscore pdf of two and then we are saying give me it as a data frame so once we write this thing this is what we get as you can see this is our area of interest which is the literacy rate and which is the index value 11 12 and 13. so what we are going to do now is we are going to say okay i want input underscore pdf of 2 and then from that i want the data frame and in that give me the location 11 to 14 and then give me three columns which is one two three so i want 1 2 3 11 12 13 that is what we are specifying here once we execute this thing we can see how the data frame looks like so this is how the extractor table looks like so let us do a little bit of table formatting before we do table formatting we have understood that this data frame that camelot gives us is part of a pandas data frame so let us have a little bit of understanding of pandas pandas is the data manipulation package that is widely used the most widely used for python and pandas helps you read a csv write a csv read in excel write in excel do a little bit of reformatting in case if you want to you know do a little bit of data analysis pandas will help you do data preparation data preprocessing so like that what we are going to do is we are going to use a pandas function which is reset index to drop this index value 11 12 and 3 and then come up with our own index value which is by default 0 1 2 3 so let us go ahead say okay for this data frame dot reduce reset underscore index drop is equal to true and then we will assign it to the data frame here so let us go ahead and execute this thing and then let's see how the output looks like so now as you can see from 11 12 13 it has become 0 1 2 now as you can see there are three columns but the column names are 1 2 3 which is not very intuitive if you want to write a table so what we will do is we'll manually put the table name so from the table you can you can see that this is 2001 this is 2011 and these are the kps that we are interested in so what we'll do is we'll say okay the first column is kpa second is 2001. third is 2011. let us execute this thing and then let us see how the output looks like so from one two three now it has become kpa 2001 and 2011. then the next step is for us to do any kind of data analysis with this thing we need to convert this one which is actually a string into a number format and the number format that we are going to use is float because this is a decimal point so from string we are going to say for 2001 and 2011 convert everything to float so once we do this thing we are reassigning it to the same old data frame and then the output even though it looks same internally from string it has become a character the next step is for us to write the output as a csv file so because this file name was already available we'll say packed output dot csv so once we have written the csv we can go ahead and see using ls command to see how the current working directory looks like so as you can see we have packed underscore output.csv which is what we just wrote using pandas function which is to underscore csv once we are done with this thing i would like to add another information that pandas is not just letting you write it as a csv but it can help you write it as an excel file so let us go ahead and then use this function to underscore excel on this pandas data frame to write it as an excel let's say part output excel with the extension dot xlsx and then once we execute this thing let us see how the current directory looks like so in the previous setup you had only packed underscore output.csv but now you can also see that at underscore output underscore excel dot xlsx so this is how the excel file looks like so as we write it so now what we can do is we would like to import this in our current working directory in this current python session so that we can do some data analysis so our objective in this project is to read table and then write that output table as a csv which we have already achieved but as a bonus i would like to also show you why we would be requiring such a data frame in the first place because we want to do some data analysis some data visualization from the pdf file which we cannot do it directly so we are extracting the table from the pdf as a csv as a data frame which is then we are converting it as a csv and then we are doing some data analysis with this in this case we will build a bar graph with this so let us go ahead and call our pandas data frame which is required for us to read the csv so as we call the pandas data frame we'll say pd dot read underscore csv and what is the name of the file that we wrote it is packed underscore output dot csv so we'll replace this name with this name and then we'll say read underscore csv and then we are assigning it to a python object which is df2 and then we'll display how it looks like so once we execute this thing we can see this is how it looks like with an index value because we just read it then we'll call the data visualization library c1 c1 is one of the most widely used data visualization library c bond is actually built on matplotlib for better visualizations so we'll go ahead and call c bond with an arias sns once we are done with that executed c bond is now imported so for us to build the data visualization we have to change the format the shape of how the data frame looks like to a different shape so we are going to use the pandas function milt which will convert this data frame from the void format to a long format so this is called to be a wide format now we are going to convert it to a long format now let us execute this thing once we execute df underscore melt is available now let us see how df underscore melt is available so as you can see this is a wide format now this is a long format frame where 2001 and 2011 from being the column name that has become the row value and the value that we have given is year and percentage now the df underscore melted is available let us go ahead and then make a bar plot so we are going to use sns dot bar plot for making a bar plot the x axis should contain the kpa value the y axis should contain the percentage and the hue which is the grouping variable which is 2011 and 2001 for us to compare how it has been different for 2011 and then 2001. let us execute this thing as you can see now it has generated a plot with two bars the blue color represent 2001 the orange color represents 2011 and then with three kps that we just built literacy rate male literacy rate or female literacy rate so this is how the overall literacy rate and as you can see the gap between 2001 and 2011 female literacy rate is huge which means there has been a tremendous growth between 2001 and 2011 in the rate of female literacy rate so this is what we have understood from this project that we had a raw pdf file which was unreadable as it is not a structured information what we have done is we have used camelot to read the pdf which is to be technical we parsed the pdf we extracted tables specifically we extracted four tables we went to the table of our interest which is index 11 12 and 3 and then we did a little bit of data preprocessing using pandas once we did the data preprocessing we went ahead and then we wrote the data frame into a csv file and we also experimented with writing it as an excel file once we were done with this excel file we finally went and did a little bit of data pre-processing which is reshaping the data from a wide format to long format and then finally we explored the data as a data visualization to understand some valuable insight from the pdf that we have written so far in this video tutorial we learned how to build a table extractor from pdf so we started with understanding the pdf file formats then we went ahead and installed camelot and jupyter notebook python packages then we understood how to extract pdf table and then we saw basics of pandas data frame to write and read csv and then we went ahead with c bond to do some visualization so at the end of this project we have a successful visualization we have the output table as csv and excel and then we also learned how to extract table from any pdf so thank you for listening i'll see you in the next section in this section we'll learn how to build an automated bulk resume parser going through resumes and extracting relevant information from those resumes is one of the most essential tasks a manager has to go through before hiring new resources in this section we'll learn how to build an automated bulk resume parser that can go through multiple resumes and extract relevant information from them and convert them into a structured tabular format with a click of a button we'll start this section by understanding different formats of resumes and marking relevant information that we would like to extract a brief overview of packages and the installation of those packages then we'll see the basics of regular expression in python and also the basic overview of spacey functions and then we'll move on to build the code to extract relevant information and then finally we'll complete the script to make it one click command level tool let us go ahead and then see the sections in this video we'll learn the different formats of resumes and then we'll mark essential information that we would like to extract in this project so as you can see on my screen i've got two different types of resumes the first one is a single column which has content one by one and then the second one is a double column which means in one page they have got two columns and then the experiences and other details are scattered across the columns so as you can see a resume can of be multiple types so it is up to the creator of the resume essentially the one who is seeking for a job to have the format that he or she likes but it is essential for the recruiting manager to completely go through the resume to extract essential parts of it so what do we mean by essential parts the first one is i would say that the name the name to whom this resume belongs is the most essential part because if you ever want to shortlist this resume you need to understand who is that person and then the second thing is if you ever want to shortlist a resume you just do not want to know there is name but also you need to be able to contact them and then the two key information to contact a person one their email id and then the second one their phone number so as you can see in this resume the name is first of all mentioned in the top left but in this resume the name is in the center position and in this resume the email id is in the top right but in this resume it is all centrally aligned so as you can see we have totally listed down three elements the first one is name the second one is email id and then the third one is a phone number these are the three information significant information that we would like to extract from this resume but more than this what we want to do is we want to have a criteria for which we want to extract this resume which means for example let us say you are recruiting for a position called data scientist and for a position called data scientist you need to have relevant resumes who have the essential skills of a data scientist and that is the most important information that we would like to see in the resume extraction project so for that purpose we are going to extract skills from this resume especially to say technical skills so the things that we are going to extract is most importantly the technical skills from the resume then the name phone number email id irrespective of how or where these informations are present in a particular resume we are going to extract this information using this particular project before we move on we have to also understand one more thing that resume itself is a file and then the file could have multiple formats for example a resume could be a simple image of jpeg or png the resume could be a docx which is microsoft word or the resume could be of pdf in this particular project we are going to only deal with the resumes of pdf type because once we have written a script for pdf file format it is not very tough for you to convert every other format into a pdf format let's say you can convert a jpeg to a pdf format you can convert a docx to a pdf format so that is one of the reasons why we have picked pdf format as one condition where we will build this project upon so pdf is the file format that we are going to use and also that we are going to use resumes of different types it could be single or double column and then the informations that we are going to extract a skill name email id and phone number in the next video we will see the architecture of this project and then the required python modules and how to install those python modules in this video we'll see the architectural overview of this project so in this project we'll take three pdf files so resumes in three pdf files and we'll store it in a local folder so what we are going to do is we are going to take one pdf file from this folder and then we will convert this pdf into text and then we will do natural language processing and pattern matching to extract relevant information that are required the relevant information that we saw in the previous video which are name email id phone number and skills and then we'll use these relevant information to populate a structured tabular format and then finally we'll write the output in a csv format meanwhile while we are doing pdf to text we also would save those pdf files as text files for future reference we'll iterate these process until all the files in the current directory which is the folder are completed so to repeat we are going to take pdf files convert them into pdf to text then do natural language processing and pattern matching and then we are going to populate them in a structured tabular format and then we are going to write it in a csv so for this purpose these are the packages that we are going to use the first one is pdf miner for pdf to text then spacey for natural language processing then re rejects package that is for pattern matching and pandas for output csv saving meanwhile we also would use another package called os this is for operating system manipulation which is highly required for us to iterate through multiple files and in the current working directory and also to save the output files in the required folder so the packages that we would be using are pdf miner spacey re os and pandas so of all these five packages packages re and os come by default installed on your python operating system so of all these five packages packages re and os are already installed by default in your python environment so it is required for us to install pdf miner spacey and pandas as we just saw of the five required python packages two packages come by default with your python installation those are re and os so we'll go ahead and see how to install the rest of the packages which is pdf miner spacey and pandas to start with pdf miner is the package that we are going to use to convert a pdf into text and the package that we are going to use is called pdf miner.6 for that reason is because pdf miner is the actual package name whose development stopped before python 2.7 so there was a requirement to support the latest version of python and that is where this fork as you can see this is a 4 of pdf miner and this is called pdf miner.6 so anytime you're going to deal with pdf and then you want the pdf to be converted into a text format and then you have got python version which is latest mostly 3 then you have to install pdf miner.6 not pdf miner so as we have always seen how to install a python package in this project also we'll use a terminal a shell or command prompt and then we'll use pip to install the required python package so let us open our terminal or in case if you have got windows machine command prompt and type pip3 install pdf miner dot six so as i have got already this package on my machine this is successfully shown that it has been satisfied now let us open our python terminal to see if pdf miner has been successfully installed so let us try import pdf miner you can notice this difference that when we install the package it is called pdf minor dot six but when we import it it is as same as pdf minor so the only reason they have got this dot six is to differentiate between the older version of pdf miner and the newer version of pdf miner so let us type pdf import pdf miner and then press enter and then you can see that it has been successfully imported without any error which means pdf miner has been successfully imported the next python package that we would like to install is spacey spacey is the library that we are going to use for natural language processing in fact spacey is one of the most popular natural language processing libraries in python and it is widely used in the industry for natural language processing there are a lot of features of spacey for example the tokenization is very good it has got a good named entity recognition it has got a very good language support like 49 languages are supported it also comes with pre-trained models which helps us do a lot of natural language processing without training our own model so let us go ahead and install spacey so to install spacey we can see that again we will use pip and then we'll do install spacey and if you remember one of the things that i've told from our previous sections we use pip3 because a python version is python 3. in case if you have got python you have to use pip so pip3 install spacey so as you can see requirement is already satisfied because i've got this package already so it has been successfully installed let us clear our terminal and then open python 3 to see if spacey has been successfully installed so python 3 import spacey yeah spacey has been successfully imported let us exit but there is one more thing that we have to do with respect to spacey which is download the language model so natural language processing can work better only if the language model is available which usually has got all these words and then the part of speech the named entity recognition all this stuff so for us to download the default language model english language model for spacey we have to use this command which will put in our terminal and because we have got python3 will do python3 spacey download so once this is successfully done our language model is successfully installed so we can see that the language model has been installed and then it is also telling us that the way we have load our language model is spacey dot load the language model so let us just check that once let us open our python terminal and then say import spacey and then copy this and then say okay my nlp is equal to yeah the language model has been successfully loaded which means we have got successfully installed spacey and that language model that is required for us to do english natural language processing and then the final library that we are going to install is pandas pandas is the library the go to library for any tabular data manipulation and pandas is one of the most widely used library in data science for data manipulation so it is much easier for us to install pandas as the name suggests we'll do let's clear terminal tip 3 install pandas we can see that the requirement has been satisfied let us open our python terminal and say import pandas so pandas has been successfully imported without any error which means we have successfully installed the packages that are required for us which is pdf miner and then spacey and then pandas but we also saw that there are other two libraries that comes by default in the current python installation that we have got so let us just verify whether those packages are available so the first package that we saw is re which is for regular express manipulation so we'll do import re yeah it's been successfully imported and then the next package that we saw is os for operating system manipulation to you know find files pass through folders so we'll do import os and as we can see these two libraries are successfully imported which means we are all set with all the libraries that are required for us to proceed with this project and then we'll see how to code in the next section we'll see basics of regular expression and natural language processing overview and then we'll move on to coding thank you in this video we learn basics of regular expression regular expression could be also called as rejects or x or reg x so whatever you would like to call it is a sequence of characters that define a search pattern that search pattern is usually the combination of characters and meta characters so the meta characters are like this cap dollar dot pipe or curly braces so these meta characters define the syntax around characters to create the search pattern which is what we call as regular expression regular expression is one of the toughest programming concepts that is not very familiar so there are a lot of memes around uh regular expressions on internet you can just refer to it to see how tough it is so just to understand regular expression we'll go through little bit basics but uh regular expression is so old that anytime you google for regular expression you would find the answer for which you are searching for so the basics of relay expression is primarily to understand the meta characters the meta character pipe the pipe operator defines the boolean or so if you want to compare between two words whether it is g r a y or g r e y then you would use a pipe operator in your regular expression so then parenthesis let us assume that you do not want to use individual different words but you want to just use r for one particular character then you would use g r open parenthesis a pipe e close paranthesis which is to say i want gr and then it could be either a or e and then i want y so this is how my pattern should look like and this is what we call as grouping regular expression meta characters also has something called quantification which is to say how many or how the number of elements or tokens should be present in the pattern that i would like to see for example the question mark indicates zero or one occurrence of the preceding element which means let us say you have got color c o l o u question mark and r which means it could match both c o l o r r c o l o u are while question mark is about only one occurrence asterisk which is star is about multiple occurrences which means if you say a b star c it could be either a b c or it could be either a b b c or it could be either a b b b c or so on so it doesn't strictly count to only one occurrence of the preceding character but it actually has n number of preceding characters that you could have but the condition is that you have to have a and b and c so this is the basic concept of regular expression but you could actually refer more online to understand more of regular expression but for us to proceed with this project i think this is good enough for us to see how to build a basic regular expression or at least to understand the regular expression that we might get from internet so now from the basics of regular expression as such we'll move forward to see how these regular expressions could be used in python for us to do pattern matching and as we saw in the previous video the package that we are going to use for regular expression in python is re so re has these major functions match search find all split sub and compile so of these you would see compile is something that would be mostly used to create a regular expression and then compile it and then use that compiled regular expression to find and search for everything else and match is when you would use if you want to match the first word and search you don't care about whether it is first or second word but you would want to find it anywhere it isn't and find all again it has its own purpose so let us open our terminal and then see a little bit of regular expression how we would like to do it so you can open either your terminal or your python console on pycharm or jupyter notebook but for simplicity i'm just invoking python from my terminal then say import ie and let us create a sample text okay my text is going to say best python course this is my text and now i would like to see where the word based if the word best occurs in my text so first i'm going to say re dot match of paste which is my search pattern here in this case a regular expression comma i'm going to save my text and then enter so as you can see it has replied with a re-object under span 0 to 4 which means i've got the word best and the match is best but let us say we want to test it with two different words either best or let's say good match but we are going to use pipe operator to say good so it says the match is best because the text is saying we have got best python course let us create a second text which is text two and in this case we'll say good python course and then we'll use the same regular expression that we used before which is best pipe good which means either best or good and then say text 2. in this case it says that it has matched with good and then it is the same space now what we'll do is instead of using match in this we'll use good best python course as text to and then instead of using match we'll use search to see what output it gives us as you can see it has matched good because we have got either best or good and then in text two and the position is good best python course let us assume now we want to see both both of these in the given text so now we'll say redot find all which will result as both good and best so in this case we wanted to see whether both these words are present and then because both these words are present it is returning us both good and best that it is present in the text and in case if the pattern that we have given the regular expression that we have given is let's say something like this and then it is not available and then we will not get any result which is we have got an empty list so these are the main functions that we are going to use in this particular project so this is a very brief introduction of regular expression and also how to use regular expression in python for pattern matching in the next video we'll see very basics of natural language processing using spacey in this video we'll learn basics of natural language processing using spacey so let us open our jupyter notebook and create a new jupyter notebook so you can do this by going to file and then new notebook and select python 3. once you have this new notebook let us now see a little bit about spacey so as we saw in the previous video spacey is one of the most popular used natural language processing library so we'll import the library using import spacey once that is done the very first step the next step that we are supposed to do is load our language model so if you remember after installing spacey library we downloaded the english language model so if you remember after installing spacey python library we actually downloaded the english language model because a language model is essential for a lot of natural language processing tasks so we are going to load the language model that we downloaded in the previous section so we will use spacey.load and then put the name of the language model that you have downloaded for example let us assume that you have downloaded a language model for a different language like german so you will have a different name instead of en so in that case put the language model name that you have downloaded and then assign it to nlp so once this is successfully executed now what you can do is you can define the text that you have so the text that i've got is i've got a couple of sentences from google wikipedia page for us to explore this package so i'm going to use this as text and then i'm assigning it to text so if you print text you will see okay this is my text so what are we going to do now is now that the entire thing is ready so we have imported the package we have loaded the language model and then the input text on which we have to do natural language processing is ready so the first step in a natural language processing is annotation so what you're going to do is you're going to let the language model annotate your input text so that the input text is created in such a way that it knows what is a part of speech tagging in it it knows where are the named entity recognition it tokenizes it it creates word vectors it can do a lot of other things so the first step is use the language model because you loaded the language model and assign it to nlp on the text input text that you have got so the way you do is nlp of text and then you store the result in doc so once you store the result in doc the first very first step that any natural language processing would require you to do is tokenization what is tokenization tokenization is nothing but splitting the input text based on a token so for example so in this case what we are going to do is we are going to split the entire text word by word so word is the token for us here so you can do the sentence organization you can do this even paragraph organization but in this case we are going to do word tokenization so say we have got doc so let us print how the dock looks like once i print dock i'm not getting anything other than this thing because this is what it is printing but internally doc has been annotated using the natural language processing library and language model that we have got so what are we going to do is we are going to say for token and dock print token so we are going to iterate through dock and then say ok print each and every token by default which is a word token so we what we see is we see one by one word by word and then okay this the word tokenization has been successfully done once the word tokenization has been successfully done we can do a lot of things for example you can build a word claudio if you want you can build a unigram or if you want you can have a bigram also combining the words in case let's say you want to visualize the sentence instead of doing word tokenization you can do a sentence tokenization so the opportunities are lot so you can do anything you want to move ahead with the spacey library what we can further do is as we saw when you annotate your input text using the language model that you have got you get part of speech also part of speeches to say okay this this word is a noun this word is a verb this word is an adjective so all these things so what we are going to do now is we are going to say okay for token and dock the same thing that we saw before but we are saying only if the token dot pos underscore which is to say that this is a part of speech which is noun then print the token so print all the tokens which is known that is what we are going to do here so you see that you have got founder money money angel all these things let us say instead of non let's say we want work so once you execute this you get was funded but in case if you want adjective you put adj and then you get all the adjectives that you have got so this is how you extract a particular part of speech for example you're doing text summarization or you're doing any other text technique you want to extract topic from it so you're going to only go for nouns and adjectives so the same way for you to identify a particular part of speech so this is what you're going to use iterate through the document for every token you're going to say give me the part of speech so the next important thing or in fact the most important thing for this particular project is named entity recognition so what is named entity recognition named entity recognition is nothing but identifying an entity of that particular word for example google is an organization august 1998 is a date when you have money dollars it is money when you see a person's name it is a person so identifying you know putting a context around the word instead of just simply saying whether it is a grammatically noun or verb or adverb adjective so you are trying to say okay this is what i have identified this entity so this is what we are calling us named entity recognition and this is much much easier to do in spacey because of the language model that we have got and spacey has made it so easy for us to do it in one single function so what we are going to do is we are going to say doc dot ins which will apply the entity recognition on the dock so in the previous section you would have seen that we have done it only on doc because we were iterating through the actual words but in this case we want to iterate through the entities so we are saying doc dot ends which is for entities and then we are iterating over it with entity and then we are saying okay print entity dot text the actual word and then print entity dot label underscore which means the label of the entity that has been recognized so i'm printing the word and i'm printing the label of the entity which is organization or date or whatever it is so once i print this thing this is what i get so once i print this thing so we have entity.text ntt.label underscore i'm going to print this and once i print this thing and this is what i get so i get a google org or augustus august 1998 is a date and google is again organization jeff bezos ceo of amazon is a person and then stanford university is an organization so this is the way what we are going to do is we are going to take the text and then identify important entities that are present in this text so this is the basic overview of natural language processing using spacey of course natural language processing is an emerging area in research and development and of course one of the most widely you know anticipated areas in machine learning and artificial intelligence so learning natural language processing with just five sales in jupiter notebook is completely impossible but just wanted to give you a flavor of what we might be doing in this course so that you have some idea of you know how to proceed further or if you want to extend this particular project to include different set of skills a different variables that we have not captured in this resume parsing project so you can use this understanding that we have covered in this video to extend this particular project to you know a wider objective so in this video we successfully learned the basics of natural language processing using spacey we saw a couple of species functions how to load spacey how to load natural language processing the library english language library and then we also saw a little bit about tokenization and named recognition in the next video we will actually start with the coding part of resume parsing project which we are entered to do in this particular project in this video we will learn how to code the actual project of resume parsing before we get into coding there is a little bit of understanding of folder structure is required so let us see our folders are organized for this particular project we need to create two different folders that are essential for this particular project the first one is resumes resumes is the folder where we will have all the input files all the resumes that we have got in this project so in this project we have taken three resumes and all those three resumes will be present inside this folder resumes then the next folder that we would need is output folder which will have two subfolders the first one is text text txt will contain all the text converted format of those resumes so once the resume a pdf is read and then convert it into a txt that is stored inside this text folder and then once the entire parsing is done all those content is converted into a structured tabular format which is a csv and then it is stored inside this folder csv so first we need resumes folder where we will have all the pdfs that we want to be parsed second we will have output folder inside output folder we will have two subfolders one is txt and then the second one is csv once you are done with creating these folders the next important file that we need for this particular project is pdf2 txt dot py so the file that we need is pdf2 txt dot py and how do we get this file this is a file that is present inside the pdf miner library so we need to get this file from the pdf miner library so let us go ahead and then open the pdf miner pdf miner.6 to be precise pdf miner.6 github repository and then let us go inside the tool section so open pdf miners github repository and then get into tools section so from this tool section you can see that you have pdf2 text dot py click this and then you will get raw file once you click the raw file you have this option when you press ctrl s or if you are using mac command s or you can use your firefox to say okay save page as which will tell you okay how do i have to save the file so once you have this file saved in your local drive the folder where you have bought this project now we are ready to go further with the coding so there are three essential things two folders main folders output and resumes and inside output we want txt and then csv two folders and then we need to get this pdf to text dot py from the pdf miner github repository and then store it inside the current working directory the current folder where this project is set up once this is done let us go ahead and open our jupyter notebook so until now what we saw is how the folder structure is organized and what is the essential file that we need that we source just from pdf miner github repository once this is done let us go ahead and open our jupyter notebook once you open the jupyter notebook please create a new notebook by just going into file new notebook python3 once you do this thing your new notebook will be ready in the new notebook as we have a better coding practice in the every project that we have done once you create the new notebook by going to file new notebook python 3 we will have a new notebook and as a better coding practice we will start with importing the library that are required and then we'll create functions that are required and then finally we'll invoke those functions and we have got the entire project setup in the jupyter notebook so this is the flow that we are going to follow while creating the jupyter notebook so the first one is we are going to load all the packages that are required for this project so if you remember from the previous videos that we need five essential packages the first one is pc for natural language processing pdf miner for pdf two text d re for rejects ox os operating system os for file manipulation and then finally pandas for csv tabular format so pandas for csv tabular format once this is done let us execute this thing next we are going to import the file pdf2.txt.py so next we are going to import pdf2 text dot py next we are going to import pdf2 text dot py that we just downloaded from the github repository that we have kept it in our current working directory the project folder so what we are going to do is we are going to simply say import pdf to text and then we are going to execute this this is successfully executed which means we have got the file in the right drive folder location so the next task is the first function that we are going to create is for converting pdf to text and we'll call this function convert underscore pdf so this function is going to take one argument so this function is going to take one argument which is the file name so the file name is going to be given inside this function and then we'll see what else this function is going to do so the first thing is once we get the file name that is passed into this function we are going to say okay split the file name and then add dot txt so imagine you have got a file name let us say okay abdul machid which is my name and pdf this is how a typical resume might look like so what we are trying to do is we are trying to create the output file name so because when we are going to convert the pdf into text we also want to save the text in the folder that we just saw so what we want to do is we want to create an output file name using this input file name so when we use os.path.split text on the input file name what we get is we get two items just like this so we get two items so what we are trying to do is we are going to take the first one because python is zero index language we are going to take the first one like this and then we are saying okay plus text which gives us the new output file name for the text file and then we are assigning this name to the output file name so understand that we got the input file name like this a pdf file and we are trying to remove that extension dot pdf and then we are trying to append this new extraction that is dot txt and then this new extension dot txt is appended to this file name and then this name is getting assigned to this output underscore file name once this is done we also have to define where do we want to save this file so as we saw just now in the folder structure that we are going to have an output folder and inside that output folder we are going to have all the txt files so what we are going to write here is we are saying okay give me the path which is output slash txt slash and then this output file name that we just created so our output file is going to be saved in this file path where it is output slash txt slash and then the file name output file name with dot txt extension so this one we are going to assign it in output underscore file path the next thing is the pdf2 text that we just imported it has a main function and then it takes a couple of arguments so what we want to do is we want to save this file in that particular location so we are going to say okay the arguments that i'm passing is the file name which i just received and then the output file name which is passed on with this argument hyphen hyphen out file and then the output file path where it has to be saved so this is the function that helps us converting pdf to text and save it in the given location which is what we create here so the output location is created here output file name is created here so the input file which is the dot pdf file is given through this f and then the output file is saved once that is done we would like to present a message to the user saying that the file has been successfully saved this is just for the reference and finally we are going to return the output file path and then we are reading it as an input file so what we are doing is in the same function we are outputting the red file so that is why we have got open the file name and then dot read which is a function to read any file in python so ideally we could have done it in two lines to say okay i am reading the file and then i'm passing that red object the file object as returning but in to save space and also for simplicity which is one of the core philosophies of python to have simple code what we are trying to do is we are trying to do it in same line so the file path which we are reading after opening so this is what we are going to return so in this function convert underscore pdf there are five things that we are doing the first thing is we are creating the output file name second thing is we are creating the output file path and then the third thing is we are converting pdf to text and saving it in the given location which is the output file path then we are printing a user message to say that this is successfully done and then finally we are returning the red file opened and read file which we just saved and then with this this function is done let us execute this function and this is just for sample we don't need i'll delete this cell once this is done now we are going to use spacey to load the language model so let us load the language model in this line which we just saw load the language model once we have the language model we are going to create an output file structure so we saw that the ultimate objective of this project is to capture four important content four important components from the resume and then make it a structured document which is name phone email skills and then these are going to be the four columns that we have in our output tabular format and for that we are creating a dictionary a python dictionary using the curly braces so if you remember your basics of python a python dictionary is created using curly braces a python list is created using your square brackets so what we are trying to do is we are trying to create a result dictionary and then we are also creating four placeholder values names phones emails skills and then we are making it a list so that when we extract these information we can put that particular component the respective component for example let us assume that we have extracted name from the first resume second resume third resume so we have three names and those three names will go into the placeholder list that we have created names the phone numbers will go into phones the email ids will go into emails and then the skills will go into skills so with this we are ready with the placeholder the type of output that we would like to have once this is done we are getting into the core function which is going to extract the content from the resume so with the placeholder output now we are moving forward to see how to define the function that will do the extraction the core four component extraction for this particular project now that we have got the output placeholder in place which is a dictionary and then couple of list to put the content inside it extracted content inside it we are moving forward to do the actual core component extraction part of this particular project and this function will call it as parse underscore content which is to pass the content that we have and the argument is text so this function receives text if you remember what we returned from this function is the convert pdf function is we opened the file and then we read it which is a file object and then in this function we are going to read a text so that is how we are interlinking those two functions and we will see it in the future so the first thing that is required for us is to define what skill set that we are expecting to extract from this project so extract from the resume so the first thing that we have to define is what are all the set of skills the skill set that we would like to extract from the resume so considering a data science setup what we are trying to do is we are trying to see okay i want python and if you remember your rejects you might remember that a pipe operator signifies or condition so python or java or sql or hadoop or tableau so these are the five things that we are trying to extract as skills from a particular resume we are trying to see whether this particular resume has any of these things or all of these things the next thing is phone number so what we are trying to do is we are trying to create a rejects that rejects should be able to capture phone number and this phone number rejects has been extracted from this stack overflow answer so i would like to give credit to that answer that has created a rejects that can handle multiple different types of phone number so we saw we want this as skill set so we are compiling this expression and then saving it in skill set and this is the rejects to capture the phone number and then we are compiling it and saving it in phone number and then what we are trying to do is we are trying to take the text that we just got inside this function and then we are trying to do annotation so if you remember the basics of spacey video the first thing that we have to do is annotation once that is done what we are trying to do is we are trying to extract two entities from the annotated the past or the natural language processing text so the text on which the natural language processing has been done and the text is nothing but the resume content okay so from the past content or from the extracted text content now we are trying to do two things the first thing is extract name the second thing is email so totally we want four components which is name email phone number and skill set and we just saw creating regular expressions for skill set and phone number and now we are trying to extract name and email id and to start with name what we are trying to do is entity any named entity recognition has one particular label which is called a person that signifies the name of a person so what we are trying to do is we are trying to do if the entity label is person then give me that entity text and assign it in name so this in python is called a list comprehension which means instead of writing a for loop in multiple step you can write it in a single step so what we are doing is we are doing a list comprehension and then we are saying okay whenever the entity recognized entity whose label is person then give me that entity text and then we are saying okay what if a resume has multiple names for example first the name of the person would be on the top of the resume probably their dad's name and mom's name is there so for that with that assumption that the person's name is always on top of the resume you're saying give me the first name that you detect so we are trying to take the first name that is detected in the resume and then assign it to name next we are also printing the name for us to have some reference whose resume we have passed next is we are trying to do email id extraction so what we are trying to do is we are trying to say okay for every word the tokenized word in the document if the word is like email so spacey has such an attribute like underscore email if this word like underscore email which is returning a boolean value either true or false if it is true then give me the word and i'm telling okay give me the first word there could be multiple email ids but i don't care about the second and third email id at least for this particular project so what i'm saying is okay give me the first email id and then store it in email and then print the email id again for our reference now that we have built the regular expression for skill set and phone number it is time for us to use the expression that we have compiled and then extract so there is one thing that we have to note is we are trying to convert the text entire text into a lower case before even proceeding further with the regular expression this is to solve the case issue so it could be like for example let us say this is python someone could have written python as capital p ytohn someone could have written sql as sql capital or small letter java could be with j capital so to solve this all these issues we are normalizing all the text or downgrading all the text to a lower case and then we are saying okay rejects find all where you have phone number pattern and then the string on which you have to find is text which is what the argument that we got and then after converting into lower case and then we are assigning the output to phone after converting it to a string object so simply we have the text we are converting into a lower text the text is the regiment x we are converting everything into lowered case and then we are trying to find all everywhere in this resume with this regular expression and then we are converting that result into a string and then assigning it to a phone object and then the same thing that we are trying to do here is find all the skill set takes lower and then assign it to a skills underscore list because we will have multiple skills so one resume could have python and java both so that is why we are calling it a skills underscore list also one thing that we have to notice in one particular resume we could have multiple instances of python and java for example let us assume in the technical skill section someone has mentioned python but also while describing the project they might have used the word python so what we would end up getting here is in the skills underscore list we would end up getting python two times and we don't want to record python two times for that matter because we want to see whether python is present or not present so what we are trying to do is we are trying to say okay convert it to a dictionary which means of course it will have only unique elements and then we are converting that into a string and then assigning it to unique underscore skills underscore list so in this entire section we have defined the regular expression for skill set and phone number we have used spacey for annotation and then we have used the annotated text to extract a person and then assign it to name to see whether there is anything like email then assign it to email and then we have used phone number and skill set regular expression that we just compiled to find everywhere where we have got and then we have got the skills and unique skills so right now after this what we are trying to do is we are trying to append these names in the placeholder list that we just created so these are the four empty placeholder list that we created so we are trying to append all these that we just extracted in those list and then finally we are trying to print a message a message to the user to say extraction has been completed successfully so for a small summary this is the function the core function where we are extracting the four components it takes one argument which is the text of the resume and then initially we are compiling the regular expression for skill set and phone number next we are annotating the text document using natural language processing of spacey and then we are extracting name and then we are extracting email and then we are extracting phone number and then we are extracting skills and then making unique skills out of it and then we are appending all those values in the placeholder list that we created in the previous cell and then finally printing a user message to say extraction has been completed so let us execute this thing and it has been successfully executed in the next line what we are trying to do is we are trying to say okay now i've got two main functions one to convert all the pdf to text and also you know meanwhile save the text file in the particular folder second use the text content and extract whatever i wanted the four components name phone number email id skills and with this set now what i have to do is i have to make this project work on multiple files rather than only one pdf file so that's the entire objective of doing an automated bulk resume parser right so you don't really want to use a code just to pass only one resume because a human being of course could be better in passing one resume but assume that you are a manager and then you have got 100 resumes or 200 resumes or 50 resumes and this is the case where we want to have an automated resume parser bulk resume parser and that is exactly what we are trying to do here so what we are trying to do here is we are trying to list down all the files inside the folder resume so as we just saw we have three files inside resume we have three resume files and then we are saying okay list down all the files and iterate through each of the file calling it as file so we will take individual resumes and we are calling it file and then we are trying to just validate whether the file name ends with dot pdf so in the same folder you could have let's say uh forgotten to convert a docx file into pdf so what we are trying to do is we are trying to add an extra layer of validation to say okay i want only pdf because right now this code is built only to function with pdf and if you have got a docx you have to manually convert a docx to a pdf or you could find a lot of script online to convert docx to pdf so we are saying okay if the file name ends with dot pdf then first print the user message that we are reading the file and then read the file where convert underscore pdf is the function that the function that we just created invoking that and then the path where we have all the files is this one so the first file in the first iteration it will be the name of the first file in the second iteration it will be the name of the second file and in the third iteration it will be the name of the third file and as we saw in this convert underscore pdf we are returning a file object we opened the file using the path and also we read it so we are assigning that output inside this object called txt and then now we are passing this txt to the function that we just built here so this takes one text object as argument so we are passing that as the parameter here saying okay parse underscore content txt once we run this okay first resume is red alice and parker's resume is read and then the output is saved successfully alice and parker's right is the name email id so these are all hypothetical names that do not exist so it's a disclaimer that this is these are all hypothetical resumes and do not reflect to any living human being so this is the first resume pdf is read text is successfully saved the exact same message that we printed here and then we have the name which we also printed here and then we print the email id which is here and then we finally say extraction has been successful so this is first resume then this is second resume reading john dominic saving it as text name email id extraction completed and then finally ashley miles reading saving it as text extracting name email id which is what we printed we didn't print phone number and skills and then extraction complete successfully so now that we know that we have got three resumes all those three resumes have been successfully read what we are going to do is we are going to use these placeholder values which is now populated with all those names so to give you some perspective names will have all the names phone will have all the phone numbers skills will have all the skills and then of course emails will have all the emails now that we have got all the values of this placeholders populated now we are going to say okay assign these into these so this is a dictionary with which is a key value pair and this is the key and then we are going to put this against the respective key as values so what we are doing here is we are saying okay the value of this key should be names the value of key this key should be phones and the value of this key email key should be emails and then finally skills then to see how does it look let us execute this then say result underscore dict which is the result dictionary so as you can see it starts with the curly brace ends with the curly brace this is the key and this is all the values this is the key this is all the values so this is how a typical dictionary in python would look like and then finally we have to convert into a tabular format which we will do using pandas data frame so in this video we saw how to start with importing libraries define two important functions one is converting pdf to text the second one is the code the main function the it's like the engine of the entire project which is parsing the content and extracting required components and then finally storing the required components into a python dictionary which in the next video will see how to convert into a tabular format using pandas and also to save that csv as an excel file csv file in this video we'll see how to convert the dictionary that we created into a tabular format using pandas and we will also see how to then finally save this entire thing as a script to run on a bulk you know set of files in a particular folder so to start with this is where we left in the previous video where we had created result underscore dict which is a dictionary python dictionary with all the essential content that we just extracted from the three resumes that we had got in the resumes folder so to move ahead with we are going to use pandas which we just imported at the start import pandas pd is what we used pd is an alias so we are going to use that alias pd dot data frame so the beauty of pandas is that pandas data frame is nothing but a python dictionary internally so it is easier for us to convert result underscore dict into a data frame just by invoking this function called data frame from this alias pdp dot data frame where frame is the f is capital d and for capital and pass result underscore dict assign it to result underscore df df stands for data frame or you can give any name that you would like to and then i am printing df which says okay this is my name this is my phone number this is my email and these are the skills that i've got so alison has got python tableau in java john dominic has got hadoop python and java ashley miles has got sql and tableau and then the next step is for us to save this entire thing into a csv file as we saw in the previous video the folder structure that we've got where output has two folders subfolders one is text which stores all the text files that are converted from pdf and in the csv where we have got the output csv so we are going to say okay save the output csv there and then we can go ahead and open the folder and see that the output csv is present here so until now what we have seen is take the dictionary convert into a tabular format which is a pandas data frame and then save that data frame into a csv file but this jupyter notebook is good for us to prototype but now the objective is we have a folder full of resumes and then you know you want to give this to someone who cannot use proper python coding and then they should be able to convert all those pdfs into structured content and for that purpose we are going to convert this jupyter notebook into a python script and then use that python script to convert all those pdfs into text so let us go ahead and then open file and then go to download as and then do dot poi this will give us the python script but before we do that we have to remove those instances that are not required for example we have a lot of places where we have printed these things which is quite unnecessary so we'll delete these cells you can use x to delete a cell or probably you can go here and then run delete cell so i'll use x delete delete delete delete cool and then i'm going to delete this dictionary so we are done with this as of now and uh we can save and checkpoint it just for our reference then what we can do further is we can go download python dot py so we'll save the file and then make sure that you have got the file inside the folder the current project folder so bring that file here the current project folder where we have got all the files so now you should have the folders that we defined with the resumes with input files resumes with all the pdf resumes that we want to be parsed then the output folder with two subfolders csv and text txt and then also the pdf2 txt.py should be inside this folder and this is the notebook that we just created to create this entire project and then finally the python script of this jupiter notebook with this we are good to go that we have got resume underscore passing which is script we'll use this script on command level as a cli tool to automate this entire process of converting pdfs into a structured format of valuable content but before we do that let us go ahead and then delete these things because these were created when we ran this particular code in the jupyter notebook but to make sure that the project is successful we'll delete this and also we'll delete the output file we have got an output file we'll delete this output file also so right now let us just validate we have got three resumes alice and parker ashley mills dominic john dominic and in fact you can notice that this is all three different formats this is a double column this is a single page resume so we have got different resume formats and then output does not have anything output is empty with two sub folders empty subfolders and then we are all set to test our script on this resumes folder to see the code that we just have created using the jupyter notebook the python script that we just downloaded from the jupyter notebook if it could be used for automation so open the terminal or if you are using windows please open your command prompt and then make sure that you are inside the project folder so my project folder name is resume parsing i can use ls to check okay or if you are using uh windows check the command to see where you have got your current folder so you can see that this is my current folder and then it shows that i've got this file regime underscore parsing in place and i've also got the folders like output resumes and the main pdf to txt dot py file so with this what we can do is we can say resume underscore parsing dot py but before we do that we have to say python 3 space resume underscore parsing dot py so the same name that we have given here you have to say python 3 resume parsing.py once you press enter we'll see all the details that we have done so for example this is exactly what we saw first reading alice in parkers the first resume output is txt is saved name email extraction completed second resume john dominic john dominic txt is saved name email id successfully completed and then the third resume actually miles csv reading pdf saving txt name email id successfully completed now we can see okay ls nothing has changed let us enter into resumes folder and then we will see okay ls we can see that there are three files which is what we used as input now let us come out you see the space dot dot coming out and then doing ls just to validate where we are and then let us get into output and then you can see ls we have two folders cd txt and then ls we can see that we have got three text files so initially before we started with this execution of the script we did not have these txt files now we have this txt file which means the conversion of pdf to txt has been successful now let us also validate whether we have got the structured information in the form of csv so let us go out cd space dot dot now we are again inside output will get into csv and then say okay we have got the csv inside csv ls we have got parsed underscore resumes dot csv which means the script that we executed has completely done what we did with the notebook which means the script is perfect that the automation is successful that it picks all the resumes the folder resume and it converts everything into txt and saves inside txt and then it also finally creates one csv file which is the parsed resumes file so now let us go ahead and go to output open csv now that we find csv in this folder let us right click it and open the csv with microsoft excel for us to see okay this is excel the csv has been opened you can see that okay name is there phone number is there email id is there skill is there and how can you use the output of this project let us assume that you are a hr manager or the recruiting manager and you want to use the output of this project and you have got let's say maybe 100 names like that the way you can use it is right now you have a requirement for a sql developer sql developer what can you do you can just go there apply this filter and say okay i want someone with sql then what do you get let's say you go to the header let us format this header little bit and then you go to the header and then say filter and i want someone with sql then you see oh you have got ashley miles with sql set and additionally ashley miles also has tableau so maybe let us pick up ashley email's email id and then mail and ask if he would be interested in joining our company for an interview or let us assume that you have another requirement where you want someone with tableau and then in this case you will go put a filter and say one tab blue then you see okay alice and parker wright and ashley miles has tableau as skill set and then you know now let us prioritize that these two resumes for interview and then go ahead and then call them for interview so this is the main objective of this project so if you have got a lot of resumes like 50 resumes it is nearly impossible for one human being to literally go through all the resumes but using an automated script a bulk resume parser that we have built using this project what we can do is we can extract the essential skill set that we want and as you know you can change the skill set like you wanted in the rejects expression that we built and then filter it using excel to say okay these are the resumes that i'm going to focus instead of just going uh randomly with all the 50 resumes so in this section we learned how to build an automated bulk resume parser using natural language processing and regular expression we also alongside learnt basics of regular expression and how to implement it in python and we also saw introduction to spacey and natural language processing using spacey and then once we successfully built the script to completely convert a resume which is in pdf and unstructured format into a structured tabular format we saved it in csv and we saw how we can use that csv to prioritize resumes for selecting the right resume for the job requirement thank you for watching this video and we'll see the next section in this section we'll learn how to build an image type converter converting images from one image type to another image type like png to jpg jpg to png or bmp to png is one of the most wanted tools that every one of us expect to have handy to build such a tool we'll start learning with basic image manipulation in python then we'll understand what are the python packages used for image manipulation and then finally we'll build a tool that helps us do image type conversion this section contains following topics what are the different types of an image what is an image type converter introduction to image manipulation in python and the python packages used for that and then finally we'll build a script project that will help us do image type conversion in the next video we'll learn what are the different types of image file formats and its details in this video we'll learn what are the different type of image file formats and its details image file formats are standardized means of organizing and storing digital images the image file format is usually identified by the image file format extension that comes with the file name an image file format is required to store data in an uncompressed or compressed or vector format once rasterized an image becomes a grid of pixels each of which has a number of bits to designate its color equal to the color depth of the device displaying it in general an image file format defines how data is stored the image data is stored in that particular file format an image compression can be of two types one is a lossless compression the other one is a lossy compression in a lossless compression the image in file format is changed the compression is usually lossless which means there is no information loss in a lossy compression the algorithm preserve a representation of the original uncompressed image that may appear to be a perfect copy but it is not so often lossy compression able to achieve smaller file sizes than lossless compression and it is highly preferred when you are going to transfer image from one place to another place where you need to compress the image what are the various different types of image file format the most widely used and in internet the file format that has been highly preferred to be used for image transfer is jpeg jpeg stands for joint photographic experts group which is a lossy compression which means jpeg is a compression algorithm that stores image data in a compressed format after jpeg one of the highly preferred file format is png png stands for portable network graphics file format png was originally created as an alternative for gif or gif however you want to say it give stands for graphics interchange format in the next video we'll see what are the different python packages that we'll be using in this particular project in this video we'll learn about the different types of python packages that we will be using in this particular project for image manipulation we are going to use the package called p pil l stands for python imaging library python imaging library is one of the most popular python package which is free for image manipulation in python however there was no recent support from pil for any python version that is greater than 3 which means pl supports only python version that is lesser than 2.7 so for this someone has forked a friendly fork of pil repository and that is called pillow pillow now supports any new latest python version that is greater than three pillow was created by alex clark and its contributors so pillow is the library for image manipulation that we are going to use in this particular project pillow follows the same syntax as pil you have to make sure in a computer where you have installed pillow that you do not have pil pillow can be installed using pip which will see later on the next package that we are going to use in this project is globe globe is simply for unix style path manipulation so to identify the files images in our current folder we are going to use globe so we are going to use globe to identify the current image files that is in our current directory and then we are going to use pillow for converting into from one format to another format in the next video we will learn how to install the required python packages and loading them into our project in this video we'll learn how to install the required python packages and loading them into our project as we discussed earlier the package that we are going to use is called pillow so let us open our terminal if you are going to use mac open your terminal or if you are going to use windows open your command prompt and make sure that in your computer if you have python 3 you are going to use pip 3 so if you are going to have python 3 use pip3 or if you are going to have python less than 3.0 use pip so in my computer i have got python 3.7 so i am going to use pip 3 and install and below make sure that your p in pillow is in capital letter and then press enter the installation procedure will start now so right now you can see that pillow has been installed successfully for you to verify pillow has been installed successfully you can open python console here using python 3 and then import below the reason that pillow is not found here is because we also saw that pillow follows the same syntax as pil but it is just a simple fork so in this case if you are going to use pillow you have to just simply import p i l make sure that you have only one version of pillow installed in your machine so that there is no clash among the packages this is the only package that was required for us to install the other package that we saw globe is already available which also we can verify by using import globe it has been imported which means the package is available already by default in python 3.0 in the next video we'll start with the coding part of creating our image type converter script in this video we'll learn how to code or image type converter script to begin with open your pycharm or any other python id that you are going to use for this particular project once you open your id or pycharm in my particular case go to file and click new to create a new python file once you click new you'll get all these options and select python file to create a new python file you might have to name the file up front so give a meaningful name like image conversion new dot py for ease of process i've already created the code and i'll take you through section by section the first section we are going to load the library that we are going to use in this particular project as we discussed earlier even though the package name that we are going to use is called pillow because it is a fork of pil library so we are going to use pil so from pl we want to import the class object image and then we are going to import globe which is to identify the files with the particular extension so in this step we are going to use print globe.globe and we are going to use this small regular expression pattern that tells us that anything that starts with anything and then followed by a dot and then finally ends with a png which means we are going to tell python that please give me the list of files that have an extension png to understand this let us see the current working directory of ours in the current working directory of us we have three png files as you can see the first one is batman lego the second one is pump of girls another third one is tarmanjari.png so all these png files will be displayed once we run this code in the next section we are going to iterate through the png files each and every file and then we are going to open the image file and then assign it to a new python variable called im once we assign it to the new python variable called i am we are going to use this i am to apply a method called convert where we are converting into two into its rgb file format rgb stands for red green blue which forms the complete color that we usually see so in this step we're going to convert the image that we read which is a png image after getting assigned into a new object we're going to convert it into its rgb format please note that image conversion can also have rgba but rgba is not something that we are going to use here for that purpose because jpeg is a file format that cannot retain transparency a stands for alpha alpha represents transparency so for a jpg image it has only three properties which is r g b but for a png image for example if you're going to convert from a jpg to png image you need to have rgb a which will convert the image into a new file format along with this attribute alpha which stands for transparency which is not required for our current use case so we'll use rgb format and convert the input image and then finally while saving the image we are going to use the same file name which is what we read and then we are just going to replace the extension from png to jpg and then it also gives us the flexibility of setting a quality value depending upon how large the images so if you want the image to be more compressed low quality if you want to upload it online then you can reduce the quality which means the size of the image will also be reduced because we have given it in a for loop it is going to happen for all the files that we have got so let us go ahead and run the code so to run the code we can go here in pycharm and then just execute the code as you can see the code has completely executed and then it has displayed all the png that we had and then it also has finished with exit code 0. as you might have noticed initially when we had opened this finder the explorer where we had all the files we had only png file format but you can see now that we have also created new jpeg file format and to notice the difference when you open batman ligo file you can actually see that there is no background in there which means it is completely transparent which is one of the attributes of a png image but when you actually see the jpeg image which is this one you actually see that the entire background has been filled with the black color that is what happens when we had done this conversion where the attribute alpha has been lost so in this project we learned how to import the python image library how to find png in the current working directory and then how to convert this image from one file format to another file format with different quality level compression level in the next video we'll learn how to execute our python project that we just created using the terminal in this video we'll learn how to execute our python project using terminal or command prompt what we did in the previous project is we imported below package and then we iterated through all the png files in our current directory and then we converted all those png files into a jpeg but the problem with that in sharing the code is that someone has to have the knowledge of python to open the text editor or to open pycharm and then run the code to avoid that we can convert this entire code into a python executable file which is something that we have already created in the previous project so the dot pi file that we had created in the previous project is what we are going to use in our terminal our bash our shell our windows uh command prompt and then use that py file to execute and then convert everything within our shell itself so this way we can convert the entire project into a single click command line utility so let us first get into our current working directory where we have got all the files and the code in my case image type conversion is where i've got image underscore conversion is where i've got my codes and then project file to check that we can use the linux command ls to see what are all the files in our current directory so as you can see we have three png files and we have the python file that we had created in the previous project so to execute this file let us first copy this file name right click copy and once the file name is copied remember whether you have got python or just python 3 which means if you have got python less than 3.0 you have to use this command python or if you have to if you have python version greater than 3.0 you have to use python 3. so in my case i've got python version 3.0 so i'm going to use python 3 here as the first command and then i'm going to put my file name which is image underscore conversion dot py this command is going to execute image underscore conversion dot pi which will iterate through these png files and create new jpeg files so let us see so as we have executed in just a micro second this entire code has got executed and then we can see that the png files have got listed and then the code has completely successfully executed now let us see whether the new files are there available in the current working directory so to check that we can use the same command ls and then c as you can see we have new files which has this extension called jpg jpg and jpg so let us go to the folder where we have got all these files as you can see initially we had only png files and after this execution we have all these jpg files to just verify this one second will delete all these jpg files and then we'll go back and then check using ls yes there is no jpg files in this current directory and then we'll run the same command again to execute the image conversion python file i'm going to use python 3 because my python version is greater than 2.7 so once we executed this we can go to the current working directory and then see that we have got new jpg files and as we see last time this batman ligo has got no background which is one of the properties of png files we saw about alpha and the the jpg version of the same batman lego has got a black color background which means the jpg file has replaced the transparent background with a black color so this project we learned how to create an image type converter using python using the library pil which stands for python image library and then we converted that code into a shell script where we can execute in one line that the entire conversion will happen for all the png files in our current working directory thank you for listening see you in the next section in this section we'll learn how to build an automated new summarizer the reason we call it an automated new summarizer is because the machine learning algorithm is doing the summarization technique for us with no manual effort of going through the long text of news news summarization is nothing but the text summarization of news will start seeing an introduction to text summarizer and its techniques will implement one such text summarization procedure in python with that we would have extracted the summarized text of the news and thus we would have got our automated news summarizer in place let us go ahead and start seeing the course section in this video we will learn about text summarization text summarization is the process of extracting meaningful text that is shorter in length from a huge chunk of larger text using algorithms powered by natural language processing and machine learning text summarization is actually one of the most exciting fields in machine learning and natural language processing which is nlp automated text summarization allows engineers and data scientists to create softwares and tools that can quickly find and extract keywords and phrases from documents thus creating a summarized text text summarizers are implemented in a variety of web applications and mobile applications to deliver summarized content or news and the example is there is an app in short which is one of the most popular news apps in india that delivers a summarized text from a larger news ex summarization usually are of two types the technique is of two types the first one is extraction and the second one is abstraction the extraction technique the automated system or algorithm that we have built extracts objects from the entire collection of text without modifying the objects which means it extracts key phrases from the entire text that we have given and then ranks those sentences based on its importance and then finally gives a text summarized format of text using only those most important sentences so in this case there is no modification of objects that are present inside the actual text that has been provided the second technique is abstraction under abstraction instead of just merely copying the information from the given text what it does is it actually paraphrases the entire section so it takes the entire text paraphrases the section and then finally identify key words and then key phrases and then uses natural language processing to create a new text which is more meaningful and also covers the context and then finally it gives us the summarized text of the original text that has been given so there are two techniques the first one is extraction which actually returns the entire document without modifying the objects but the minimalistic version of using only important sentences the second one is it breaks down the entire sentence multiple objects and then it builds meaningful sentences using natural language processing by paraphrasing the sections and then we get the summarized text in the next video we'll learn about what kind of technique we are going to use in this particular project what python package we are going to use for that purpose and then how to install that python package to proceed further thank you in the previous video we learned a bit about text summarization in this video we'll see what kind of text summarization are we going to use and then the python library required for that so text summarization is of two types as we saw in the previous video one is extraction the second one is abstraction so in this particular project we'll use extraction method for text summarization and the python package that we are going to use is called gensim so the gensim implementation of text summarization is based on a popular algorithm called text rank text rank algorithm is a graph based ranking model for text processing an important aspect of text rank is that it does not require deep linguistic knowledge which means text rank model is highly portable to any other domain or language which means if you have built a model using english language you can use the same algorithm for a different language without any further major changes jensen is also known as topic modeling in python gensim is a pythonic library for topic modeling document indexing and similarity retrieval for large text corpora the target audience for jensen is the one who uses natural language processing and information retrieval or information extraction gensim is particularly one of the most popular python libraries especially for topic modeling gensim has been open sourced by a company called ray technologies rare rare technologies let us see how to install gensim on a computer open your terminal or if you are using windows open your command prompt and then type pip3 as we have seen before if you are using python 3 you need to type pip3 install if you already have got gensim so u is for upgrade jensen once you type this and enter gensim would start getting downloaded on your machine as gensim is related to topic modeling and then much more there are many pre-trained models that comes with gensim and language models also so for that purpose gensim is quite heavy and it would take some time to get installed so as we can see gensim has got installed successfully so let us clear the terminal and open our python console to see if gensim has been successfully installed so import gensim yeah as you can see gensim has been successfully installed without any error so we can now exit gen sim so jensen is the library that we are going to use for text summarization automated text summarization but for us to get the text itself because we are going to do new summarization we need to extract the text from news which means the news is published on internet and we need to extract text from news and to extract text from news we are going to use something called a beautiful soup as we have seen in the previous sections beautiful soup is the library that we are going to use for extracting text from internet as we can see it could be from html or xml so beautiful soup is the library that we are going to use for web scraping which is to extract text from internet to install beautiful soup we need to type pip install beautiful soup 4 so let us once again open our terminal our shell and clear the text that we have got in there and then install beautiful so pip 3 install beautiful soup because it is the latest version of beautiful soup so we have to type beautiful soup for so as you can see here the terminal that i have already got beautiful soup installed in the previous section so my beautiful soup requirement has been already satisfied if you have not got beautiful soup beautiful soup would get freshly installed on your machine let us open our python console and then try to see if we have got beautiful soup installed so let us say from bs4 import beautiful so which is the object of our interest so as you can see beautiful soup the object from the package beautiful su 4 has been successfully imported so remember this always when we install beautiful soup it should be installed as beautiful soup for all and small glitters but when you are importing beautiful soup it is from this package bs4 and then you import object beautiful soup so let us exit beautiful soup so in this video we learned little bit about the kind of text summarization technique that we are going to use in this project the package name jen sim that we are going to use in this project how to install jensen and then a little bit about gensim and we also saw that we need beautiful soup the python package beautiful soup for whip scraping which is to extract text from the web url the news that we are going to be of our interest in the next video we will learn how to extract the news text from the internet using beautiful soup in this video we'll see how to use beautiful soup the python package beautiful soup to extract text from the internet news source so to begin with let us import the packages that we want for text extraction the first one is beautiful soup as we know before beautiful soup is a web scraping library that we are going to use to extract text from a new source the second package that we are going to use is called request the request package is used to extract a web content beautiful soup is used to parse the text content that we extracted using request package so after we import the packages from bs4 import beautiful soup from request import gate so the reason why we are importing a particular object or a function from a package instead of loading the entire package is because memory management if we have got a huge package like jensen and if we import the entire package then the primary memory would be occupied with a huge memory chunk of this particular package so it is always better to import only the packages only the functions only the objects that are our interest rather than importing the entire package so likewise beautiful soup object is of our interest and the function request that we wanted from the request package so once we do that thing let us create a custom function the purpose of this function is to extract only the text component so extract only the text component from paragraph text so in a typical website a web url you would see text spread across different tags it could be in a span tag it could be in a h1 tag h2 h3 tag or it could be in a div it could be anywhere so the tag that is of our interest is paragraph tag which is a html tag which is denoted by this symbol p so what we are going to do is we are going to create a custom function the first step in the function is the function is reading the url so the function is trying to get it since http request to get the url using request package and then it stores certain page once it stores it in page we are going to use beautiful soup to parse the content lx sigma parser so we are using lxml parser to pass the content that we just extracted using the get function and then we are storing it in soup so next step is we are going to identify only the paragraph text and then we are going to save it in text the reason we are using lambda function is to iterate this for different tags the reason is because in a particular page there could be different paragraph text so we are using soup dot find all to find all the paragraph text and then we are using lambda function to iterate the speed of text to all those paragraph function that we found out and then we are finally using join to join all the text and store it in text object the next thing is it is always good to present the title so what we are going to do is apart from the text that we extracted we are going to extract the title also the reason we are using soup dot title dot strip strings is because the title might sometime come with escape strings like slash n the denotes end of line or slash tab that denotes a tab space so to strip out those strings we are using soup dot title dot strip strings and then again we are using join to join all the words that we have got and then store it in title and then finally we are returning this as title comma text which will be read as a tuple so finally we are sending it as a tuple and then the custom function that we wanted to create is done the function name is get underscore only underscore text and then the argument that we are passing is the url that we want to extract the text from so once this function is done let us execute this and then the next step is then use url from which we are going to extract the text this is the url from where we are extracting the news the news is from a very popular media publishing site called walks the news title is california is cracking down on the gig economy and then this is what we are going to use to extract and summarize text from once we take this url and then we put this url within the function that we just created so you can see the function that we created is called get underscore only underscore text and then we are passing this url as a character or string the string argument is passed on to this url and let us execute this url as you can see it just finished executing and let us print the text object to see if it has successfully scraped extracted the text so you can see that the text is available now and let us just see how large this text is in other words how many number of words we have got so once we execute this thing the way we find it out is we take the string and then we split the strings using str.split as words and then we are using length of the words to see how many words we have got so you might doubt why have i used text 1 instead of just text so as we created the function before i mentioned that we are returning it as a tuple a tuple is a different type of python object from a list so a tuple is an immutable object so the reason we have returned it as a tuple is because we wanted to return title and text in the same expression so that is why we have returned both so if you see text length of text it would show you 2 which means it is a tuple with two objects in it so you can also see it opens with the open bracket and it ends with a close bracket as opposed to a list which would open with the square bracket and close with the square bracket and that is the reason why we are excluding this because text of 0 would be the title and text of 1 is the actual text that you want so text of 0 would be the title and the text of 1 is what we actually wanted and that is why we are using text of 1 to use string split and then split it by words and then calculate the number of words in it so in this video we learned how to use beautiful soup and request to extract the text from the new source and we also saw what is the length of the text the extracted text which is 1625 in the next video we will learn how to do text summarization using jensen in this video we'll see how to do text summarization automated text summarization of that extracted text that we did in the previous video so as a first step we need to import all the required python packages the python package that we are going to use here is gensim as we know before and within gensim we are interested of two functions one is summarize the second one is keywords so from gensim we are going to import summarize and keywords so say do from gensim.summarization.summarizer import summarize and then from gensim.summarization import keywords once you have typed this in execute your jupyter notebook cell so once this is executed your python package has been successfully imported so the next step is for us to do text summarization tick summarization as it might look very complicated gensim is offering us this in a single function so how we can do this is the first one is the function called summarize and then we have to pass the text to this so the text that we have extracted in the previous step as we saw is a tuple so what we have to do is we have to do text dot one and then the summarize function as a first argument it takes text the actual text content as a second argument it actually takes ratio and then the third argument is word count the thing with this is it can be either done with ratio or with word code not with both so if you supply both which means ratio let us say ratio is equal to 0.01 so the ratio is nothing but the ratio of text summarized text that you want as opposed to the original text so in text summarization we use summarize as a function we can either use ratio or word count as argument to extract to limit the amount of text that we wanted so let us start with the first method which will use using word count so using summarize function we are passing the text which is of text 1 and then we are saying the word count that i want is 100 which means i do not want more than 100 so in the previous video we saw that we have totally 1625 words and then we are going to extract only 100 words out of it which is meaningful a summarized text so to make it a little bit the cosmetic changes to the output we are going to say okay this is the title and then we are using text of 0 to print the title and then we are saying this is summary and then the summarize text once we execute this thing we see that we have got the title of walks and then we are getting the summarized text which says okay the state assembly has passed the bill which makes it harder for companies to label worker as an independent contractor instead of employees which is what usually happens in gig economy so as a first step we have extracted our text and the number of words that we have got is let us say is 98 so we limited our word count to 100 and then we have got 98 words so this is the first method where we have extracted the text using word count as an argument to put a threshold in the second method what we are going to do is we are going to say i do not want to put the threshold of word count what i want is i am giving a ratio the ratio can be anything between 0 and 1 and then within this it will use as an approximation as a ratio of the original text which is 1625 and then given this ratio it will give us how much text that we are going to be given so what we are going to do is we are going to use the same function text of 1 because we just wanted the text and then we are going to say ratio of 0.1 and then see how much are we getting so you can see that you have got the title and then you have got the text if you want to reduce this text further we can say instead of point zero one we can say point zero lets say seven point zero seven which further reduces the text if you are interested in reading more text instead of point zero one we can say point two which is giving us more text point one is giving us less text so as you can see that we have extracted the summary using a different method other than specifying the word code but still it is good for us to see how much it has reduced so what we will do is we'll copy this function which is where we have got the output text and then we'll say we are going to put it as summarized text is equal to this is what we'll execute this once we execute this we have got the summarized text now we are going to use the summarize text and then put it inside our text length function to see how many words we have got we have got 217 words in the previous method we got 98 words because we had set a threshold using the word count in this method we have set a threshold using the ratio and then we have got 217 words so this is how we have extracted summarized text but sometimes not just summarize text is enough but you also want to see the keywords that are more important that the algorithm has found out so what we are going to do in this tip is we are going to extract the keywords that is of more importance and we are going to use the same method so we are going to say keywords which is the function name for the first argument is the text that we are passing the original text the second argument is we are setting a threshold using the ratio and then the third argument says whether you want to do lemmatization or not so to understand what is limitization so let us first execute this function without limitation function so once you execute this thing you might see okay the critical keywords are drivers code codes workers worker states state contractors contractor so even though it gives us keywords sometimes you might see the repetition one is court and quotes state and states contractors and contract this is because we have not done limitization limitization is a process of taking a word and then converting it to a root word for example workers would be converted into worker codes would be converted into code contractors would be converted into contractor which means we would not see duplication because of that just one extra word so let us go ahead and then do lemmatization limit is true it's a boolean flag once we execute this thing now we have got the new set of keywords which is just codes uber contractor business worker and there is no plural form and singular form because we have done limitization so we can do this for the other method also so in this video we have learned how to successfully do automated text summarization using the gensim function summarize and also we learned to do it using two different method one is using ratio the second one is using word count apart from this we also learned how to do ticks keyword extraction so one is text summarization the next one is keyword extraction to find out more relevant or important keywords for example if you are going to run a google adwords campaign for this particular text then you need to understand what are the keywords critical keywords that are presented in this particular word text and then we'll run the campaign we'll do the bidding accordingly so in the next video we'll learn how to make this a complete project and then we'll see a summary in this video we'll see how you can take this further forward so as you can see what we have done is we have used a jupyter notebook to create this particular project so jupyter notebook is good if we want to incorporate text which is narrative as markdown and also the code and also the output but jupyter notebook is not applicable for every purpose so for example let us take that you want to do this project as an automation so what you want to do is you want to take a particular url and then you want to schedule this in your computer to get the summarized text from a new source every day let's say morning so in that case what you can actually do is you can go to your jupyter notebook file and then you can download this jupyter notebook as a dot pi file so instead of having it as a jupyter notebook you can download this as a dot pi file which means you are going to get a python code dot py file which you can use your windows task scheduler or mac automator to schedule it every day at a certain point of time so this is one thing that you can do this in this project the second thing what you can do is instead of a jupyter notebook you can use this jupyter notebook to create a more generalized version of this project and then you can convert it as a command line project so where you can just invoke this project with an argument of url and then it would give you a summarized text so in this project in this cell if you see we have hard coded the url but instead of hard coding the url what you can actually do is you can use this as an argument that could be passed at command level and then what you can do is you can use your terminal to invoke this python project passing the desired url in the terminal as an argument and then you can extract the text as an output so there are two things that you can do further with this project one is extracting the python code and then scheduling it so that you get the news into your inbox every day or instead of that you can have it as a python project the same dot pi file but in a more generalized format instead of hard coding the url and then you can get the extracted summarized text whenever you supply a url to this particular project but if you do not wish to do any of these things you can just keep it as this jupyter notebook and then what you can do is whenever you want to change the url you can give a different url here and then the run the text and the text would this algorithm this notebook would give you the desired output where you can use to read the news of a summarized format like just like headlines or how the in shorts app which we saw that most popular news application in india would be doing to send a card of text which is a very summarized version of the actual news so in this video we learn how to take this project further so far in this section we learned how to build an automated new summarizer we started with understanding what is a new summarizer and what kind of techniques are available and then we learned about the python packages required for it and then finally we did new summarizer using gensim and the text extracted using beautiful soup i hope you enjoyed the project thank you very much
Info
Channel: freeCodeCamp.org
Views: 12,324
Rating: 4.9766383 out of 5
Keywords:
Id: s8XjEuplx_U
Channel Id: undefined
Length: 190min 29sec (11429 seconds)
Published: Tue Sep 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.