Web Scraping with Python - Beautiful Soup Crash Course

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone and welcome to a special python tutorial where we are going to learn how to perform web scripting so first of all thanks to free code cam to giving me this opportunity of being a guest on their channel and i have a youtube channel as well that is named gym shape coding and you can find there any tech related topic such as programming language web development and more content that i am uploading once or twice a week so you can just go ahead and find the link from the description okay so in this video i'm going to do my best to teach you anything that is related to web scripting and i'm going to do that with the beautiful soup library and that is a special library that will allow you to gather any information you want from any website you want okay so this website could be your bank account could be a job post website like linkedin this could be wikipedia or a sports website and really anything that you can think about so we will start by scraping a basic html page first just to understand the concepts and then we will move on to scraping a real website and by the last 15 to 20 minutes of this tutorial i'm going to show you how you can store the information that we have just pulled from this website so let's begin great so this is the webpage that we are going to start web scraping and i'm going to explain what is going on here so you can see that we are having a basic title and then we are having a kind of three paragraphs so you can see that we have a title of python and then we have a kind of secondary title and then there is a basic explanation about the course itself and then we are having a button that says start that will probably lead us to a different page if we click on it and then you can see that it has the price here as well now we are kind of repeating ourselves three times here and this is what is responsible to that web development paragraph and then also for that machine learning paragraph now what we are currently looking at it is basically the behind the scenes of that page so this is the html code that is defined in order to show you that hello start learning page and you can see that inside our html documents all of the code is being created with tags now those tags are what are responsible to display different information for you and you can see that we have a big tag that is called html and then inside of that html tag we are having a head tag and then a body tag now you can see that we are defining a closure for each of our tags with the forward slash here and then you are probably going to see that for the different tags as well now let's expand the head tag here and then inside of it we are seeing some meta information that is not quite relevant for us but we see that link tag which is responsible to import some styling for our page and then we can see that title tag which is responsible to customize our tab name and that is why you'll see my courses over here now i will close back the head and then i will expand the body so the body is responsible to display what is going to be on the page itself it is the page's body and you can see that we already have the h1 tag that is created here and then between the closure which is the area that you can write the text for that tag we see the hello comma start learning and then we are having some div tags here and when you see the tag of div this is the very basic tag that will create some tags in different styling so you'll see here the class equals card what this attribute assigning does here it is importing the card styling and that is why you see the kind of carding style for each of our paragraphs over that page and you can see that we are having one more div inside that card class which is called card header so this is the styling for card header this is why it is called that way and then the text is python and then we have the card body and we have the h5 tag which is a kind of smaller header that you can display and if i scroll right here you can see that python for beginners text and then the closure for hyh5 tag and we are having a paragraph and then the a tag which is allowing us to lead to another page so when you see the a tag it is basically a reference to another page that you can visit now this entire code that i'm currently marking let's actually make our page a bigger here this entire code that i just marked is kind of repeated three times and that is why we see the page that we saw previously okay so it is quite important to understand and we are going to scrape that page and pull some information with the beautiful soup library now if you are confused with the script tags here don't because those tags are responsible to import some javascript libraries and that is something not relevant for us right now okay so we are going to switch to python now in order to apply some basic scraping for that page so i will go and start working on my main.pi file and you can see that nothing is here now before we actually start we have to install same libraries and one of them will be the beautiful soup so i will open my terminal and since i'm working with my system global interpreter i will allow myself to install it over here and i will go here and write pip install and then we will write here beautiful soup 4 so make sure that everything is not spaced or not split it with dashes and then i'm going to hit enter and then you can see that it is installed successfully and then the next thing that i want to install will be something that is going to be used from the beautiful soup library and that is the parcel method so when you work with beautiful soup you have to specify the method that you are going to parse html files into python objects okay so there are going to be different methods to parse your html code and i heard that the best of them could be the lxml parser since if you work with the default html parser it is not going to deal well with broken html code so just go ahead and install the lxml parcel library and you can also do that with pip install and then we are going to use that when we work with the beautiful soup so i will go here and then write pip install lxml and then once i do that let's wait until it's finished great so we are ready now to go back to python and start working with the beautiful soup library now we have to go here and import that beautiful soup library so it could be a little bit confusing because the libraries folder is created as bs4 so that is why we are going to write here from bs 4 import beautiful soup like this and once i have done that i have to figure out how i'm going to access the content inside the home.html file that is right there inside my web scraping directory so in order to do that we have to work with file objects now if you don't know how to work with files in python that is totally fine because we are going to go over it and it also might be worth to check my channel out if i have already uploaded how to work with files in python so i'm going to write here with open so this is basically a statement that will allow me to open a file and then read the content of that specific file so as you can see from the autocompletion i have to specify as my first argument the file's name so i'm going to close the parenthesis here and then inside here i'm going to write my html files name now since the python file and then the home.html file are in the same exact directory it will be okay just to write its name so it will be home.html and the second argument will be the method that you want to apply when you open that file in that python's memory so you have couple of options when you work with python files you can read them you can write them or you can do both and if we only want to read the content then we somehow want to specify that we only want to read this file so we will open here a new string and we will write here r so what this tells to python is basically that i'm going to read that file only and once i have done that i have to write here a variable that is going to be used inside that code block that i just created which is the with open so i'm going to use the as keyword and then i'm going to create here a variable name that is going to be used throughout the block of the open so it will be html underscore file and that will be basically my variable name and then once i do that i will go inside the open block and then i will write here content equals to html file dot read and once i apply the read method i'm basically reading the html file content and in order to show you how this works let's first print the content itself so i will go here and print the content and then i will run out the main.pi and then you can see that the information that is printed is exactly what we saw in the home dot html okay so we kind of did a great job reading this file now in my future episodes we are going to read html files from real websites but i just want to give you an idea of how web scraping works in a very basic way because when you work with actual websites the scraping and the information pulling is going to be quite harder than the html file that i just have written in order to explain the idea of web scraping okay so i'm going to continue on here and i'm going to use the beautiful soup library in order to prettify my html and work with its tags like python objects so the way you can accomplish that will be by creating an instance of beautiful soup and i will go here and create a new variable let's call it soup and that is going to be equal to a new instance of the beautiful soup library now the arguments that i'm going to specify here will be the html file that i want to scrape so the content of that will be the content variable that is created up above and then the second argument will be the parser method that we want to use so we will pass the password method as string and that will be the lxml that we have just installed previously now once i go ahead and try to print what is inside that soup instance it will be something like the following so we will create here a print statement and then we will go with soup dot pretify so that will allow you to see the html code in a more pretty way and if i go ahead and run this you can see that we see the html content that is exactly the same like what we saw in the home.html so we have done a great job until now so let's minimize back our terminal and now we are going to get more familiar with the special methods that are created inside the beautiful soup library so we are going to delete the print from here and we are going to start working how we can grab some specific information that we want to grab so let's assume that we want to grab all the html tags that are created as h5 tags which is a kind of header tag so we will go here and create a new variable let's call it tags for example and then we will go with soup dot find and then once i go with find it is going to search for the specific html tag that i'm going to specify here as a string so if i go here and write h5 and then down below i go ahead and print the tags the results of that will be something like the following now you can see that we have the entire html tag for the h5 tag as you can see that its text is python for beginners but if you remember we have more than one h5 tags that are created inside our home html tag so if you remember from the home file there is one here there is the second one over there and there is the third one over there and what that means it means that the find method searches for the first element and then it stops the execution of searching for the html tag that you are looking for now if you want to change this behavior and not only grab the first element then basically you have to change your method into find underscore all okay so that will search for all the h5 tags inside the content and now if i go ahead and run that out then you can see that the result here is quite different as we have here a list and then you can see that it has python for beginners and then also python web development and then also the python machine learning now that could be a great logic to bring you back all the courses names from that web page so you can go here and change this into courses html tags okay so this is what the h5 tags are actually responsible for and now i can write here some different code that will allow me to see all the courses that are defined on our page so we have python for beginners and then we have python web development and then we also have python machine learning so we can work with these courses html tags that stores all the h5 html tags and write a next program that is going to display all the courses so we can actually create here an iteration over the course of html tags because it has a list so we will go here with four course in courses html tags and then inside of that course tag that we are iterating we can bring only the text attribute which is going to display the course text itself so it will be here course dot text and now if i go ahead and run our program then you can see that we have a nice output regarding all of the courses that are available from that page so this could be a nice starter to understand how you can scrape a web page to grab some specific information you want all right so we were able to understand how we can apply some basic scraping to a web page but when you are going to deal with real websites the html code is not going to be quite friendly and simple like we had here so in order to be able to access the html code behind the scenes of some page we have to use the inspect of any browser so let's say that you want to grab the price for each of the courses so it makes sense to go with your mouse and hover to that button and then right click on it and then you want to look for that inspect option and once you open that out you will have a new pane that is going to be opened and then here we can see all the html code that is responsible to display what is going on on the left pane so you can see that we have here let's make it a little bit more bigger so that will be enough and then you can see that we have here div class card three times which is displaying all the different courses now when you go over different html tags with your mouse you can see that it is going to mark for you the html tag that is related to it so it is a quite important behavior that we should understand now let's say that we want to grab the price for that python for beginners so it makes sense to expand this tag and see what is inside so i will go here and search for that button and you can see that this a tag is actually responsible for that button itself and then you can see that its text is start for twenty dollars so the price information is right there and let's actually write a program that is going to search for that python for beginners and then we will grab the price for that course and then we will be able to write a nice program that is going to include a list of all of the courses and then the prices for each one of them so let's go back to pycharm and write this program so we will go here and delete everything from here and the first step that we probably want to do is to be able to grab all the course cards so it will be course underscore cards equals to soup that find underscore all because we probably are looking to bring us back all the cards so this is why you have to use find all and not define and i'm going to search for the div tags now it could be much nicer if we could filter the div tags that we actually want to grab and store it inside our course cards so if you noticed let's go back to our courses page and here if i just expand back there all the div tags you can see that there is something that is common for all the div tags their class is equal to card so i can filter my div tags by this expression right there so i go back to pycharm and i will write here class equals to card but now you can see that there is an error and it is quite important behavior to understand you have to apply here the underscore because the class is a built-in keyword in python where you create python classes so that is why you have to add the underscore over here and then the beautiful soup will understand that you are relating to the class of the html attribute okay so it is important now since we have all the course cards stored right in this variable then we probably want to iterate over this list and then search for the course name and then the course price so let's see how we can do that for each of our course cards so we will start with for loop here and that will be four course in course cards and before we go ahead and write some more code inside our for loop let's actually remind you what is inside each of our courses and then you can see that we have h5 tags on each of our course cards and it makes sense to access this specific h5 tags so we can accomplish that by going here and then use the h5 tag as an attribute so if i go ahead and press here dot h5 and re run my program then you can see that we were able to grab each of our h5 tags that are inside the course card so it is a quite great thing and now if i revert this back to course again and run that out you can also see that inside our a tags we have the text for start for 20 dollars and that is repeated for all of our cards as well so first of all it makes sense to delete this again and right here something like course name equals to course dot h5 and then here we probably look for the text attribute of that h5 tag so i will write here dot text and then this course name will be responsible to store the text on each iteration so it is great and now i can go here and write course price and then this time i will search for course dot a because the a tag stores the information about the course price so until now if i go ahead and print the course name and then i also go ahead and print the course price then we will see the results like the following so you can see that we have python for beginners and then we have the a tag itself but in this case we look for the text of that a tag as well so i will minimize my terminal out and excuse me for that i will delete that from here and then search for the text attribute over here as well and now i will run my program and then you can see that we have python for beginners and then we have the text for each of our a tags and now since we reached this stage it might be a greater idea to print a sentence like python for beginners costs 20 dollars okay so the way we can do that is basically using the split method to access that last element of that text because the price is located as the last word so it makes sense to go here with split and then we will split it by the blank so we don't have to specify anything here and we want to grab that last element so we are looking for -1 index over here and now if i run it you can see that we have the price for each of our courses and now it might be much nicer if we go ahead and use an f string to print a dynamic sentence for each of our cursors so we will go here with print and then we will open an f-string and then we will access the course name so it will be course underscore name and then we will write costs and then we want to display the close price so it will be cool underscore price now if i run our program then you can see that it displays a nice information about each one of the courses now if you think about it that is a quite nice behavior that we have applied here because if you scrape a real website like udemy that keeps updating courses then it might be a great idea to launch this program every certain amount of time for example each week and then you have the ability to be aware about each of the courses that udemy has updated on the webpage that you scrape on so this is a quite nice behavior that we were able to reach here on this one we are going to scrape real websites with the requests library so i'm going to simulate this against a website that is going to search for job advertisements and i'm going to bring all the jobs from a specific website that their main skill requirement is python programming language and i'm going to write a program that is going to pull the latest published job advertisements from a specific website so it is going to be very interesting so let's get started all right so one of the first things that we must do is to ensure that we have the request library installed so i'm going to go down to my terminal right in pycharm and i'm going to write here pip install request just to make sure that i have the request library installed now the output for myself could be different than yours because you may not have the request library but since i already have that you can see outputs like requirement already satisfied okay so it is quite important now i'm going to minimize the terminal and right here import requests so you want to make sure that you do that after the installation of this library and the first thing that i'm going to do here is to use the get method of the request library now what request library is doing behind the scenes it is just requesting information from a specific website so it is like a real person going to a website and requesting some information okay so you can go with something like the following when it comes to request library so it will be request dot get so you want to get specific information from a website and here we are going to provide an empty string for now but later on we are going to complete this string with the url that we are going to web script against it and i'm going to assign this to a new variable and i will call it html text so i'm going to make that to be equal to this entire statement now let's go to a web browser and look up for the website that is going to include some job ads okay so this is timejobs.com and this website includes job posts about almost everything so you can simply go down here and search for some skill that you own and then this will search for you jobs that are requiring this specific skill in that position now this video is recorded a couple days before when i uploaded it so if you watch this video after a couple of months or even a year or two since the publish date then there is a great chance that the html elements are going to be quite different but the main point of this video is to teach you all the tools to pull information from a website just as you want and then you can apply your own customizations and kind of doing a reverse engineering to the code that i'm going to write throughout this tutorial great so let's go here and write python so i will receive only job posts about this programming language and you can see that we have this job found over there and we have a lot of jobs that are published so my goal here in this tutorial would be to let's get this closed so my goal in this tutorial will be to bring all the jobs that are posted a few days ago so if i am zooming here in then you can see that we have posted a few days ago for a couple of posts but after i reach down here we have posted four days ago so this might mean that this job post is not the most updated so i'm going to bring all the jobs and i'm going to condition my program to bring those elements with the posted few days ago text only so let's go back to here now i'm going to bring this url from here and i'm going to paste that in in the empty string that we created inside the request.get and once i have done that what is going on inside this variable right now is simply the request code status okay so if i'm going to print the i mean if i'm going to run this program then we are going to see the results like the following so 200 is the convention number in web that the request is done successfully but in order to avoid the status code we are going to go to here and i'm going to accept the text only so i'm going to go here and then write dot text okay so this is what we have to apply here in order to bring the html text of that specific page and now it makes sense to leave this variable name as it is because it is storing the html text and i'm going to re run this program and we will probably receive a large information of html so right now it is not quite relevant but i'm just i just wanted to show you the results so let's continue from here okay so as you know we are going to create a beautiful soup instance like we did in the previous episode and i'm going to provide the html as the html text variable so it will be soup equals to an instance of a beautiful soup and then i'm going to write here html text as my information that i want to scrape and we are going to use the same parser again like the previous episode so it will be lxml now once i have done that it makes sense to go back to our page and see how we can grab only this each paragraph from this website so the white boxes are kind of a list of elements that this page has provided here and i want to look for a method that is going to bring me all the job posts so it makes sense to catch a certain element inside that post and right click on it and then click on inspect and once i have done that you can see here so i'm going to zoom in things a little bit so we can see that the h3 class is pointing to that text over here i know that the text is a little bit small here but just you can see that it has a gray mark and i'm going to go up here and then you can see that those elements are opened up as well so if i hover my mouse here then you can see a green background wrapped in the article over here i mean the paragraph and then if i close that up you can see that we have a lot of clear fix job dash px and something like that that its name is the class and our html element here is called li so li stands for list and then you can see that it is inside a ul tag so this is standing for unordered list and it is containing a lot of list tags inside that ul so you can see once i close that then the entire list of all the posts are marked with a blue background so i'm going to search the element of li with that name of class so i'm going to copy the name of the class here and i'm going to go back to my pycharm and i'm going to write here jobs equals to soup dot find underscore all and i'm going to search for all the lis and as the second argument it makes sense to pass here class underscore equals to and then inside that string i'm going to paste that in the class name that we have copied from the page itself so once i have done that then we will probably see the results of all the jobs in that page now this doesn't mean that it is going to bring back all the 16 000 jobs because you can see that this page is being paginated so that means that it is going to bring the results only for the first page so this is not going to take extremely long now if i go back to here and paste the jobs then let's see the results before we continue on just to make sure that everything is okay so we can see that we receive the results and then we see that we have some company names and i think that everything is quite great here now in order to work with this scraping project it makes sense to only work with only one job element so i'm going to delete the underscore all from here and what this means it means that it is going to bring the first match that sees the li tag and then the class name as this string over here so let's change this variable name just to job for now okay just in order to develop our program slower relying on only one job post okay so once we've done that we probably want to search for the company name of that specific job post so i'm going to go back to here and i'm going to make things bigger over here and now let's actually go here and try to inspect what is going on here again so let me zoom that out great now i'm going to try to inspect this text over here again and then we can see that it is inside the li tag for sure but we can also see that it is inside an h3 tag and it has the class name of job list comp name so i'm going to search for that class in the entire page as well but speaking about the entire page so let's go to our pycharm you want to search for that specific element only inside the job itself so you'll see it doesn't make sense to search for an h3 tag in the entire page again so you can basically go with job.find besides soup.find because we want to search for that h3 tag only inside our job so if i go ahead and print the job here then we can see that it only includes an html code about only one job and i'm going to search for this h3 tag so let's create here a new variable and i'm going to call that company underscore name and we are going to use job.find and we are going to accept here as an argument the h3 and then this time the class underscore is going to be equal to whatever this h3 tag includes as the class name which is the job list comp name now to debug this out and to ensure that the results are great we are going to print the company name and then you can see that we receive this this element back and i'm going to use here the dot text method just to bring back the text itself now once i do that we are going to see a weird result here now you can see that we have some white spaces so we kind of want to replace our white spaces with nothing so in order to do this one i'm going to go here and i'm going to use the replace method and this trick is going to avoid having these not necessary white spaces so i'm going to replace the spaces with nothing so i'm going to just write here double quotes twice i mean single quotes twice and once i have done that and rerun our program then you can see that the result is going to be quite different as you can see this text is fully aligned to left now let's minimize back and continue from here now we're going to zoom out a little bit the code here just we can see the important points like the replacement and let's continue from here now it also makes sense to bring the skill requirements other than the python programming language because we know that this job is only for people who are good with the python programming language so i'm going to go here and i'm going to repeat myself in the same process again and i'm going to write here job.find and we are probably looking for an element that is including a text about the skill requirements so let's search for that okay so let's go back to our website again and i'm going to go here and check out what html element is including the skills so we are talking about this one so i'm going to inspect inside here and we can see here that this text is inside a span class with the class name of srp skills so i'm going to copy again this class name and that time i'm going to search for the spin elements inside my job post so i'm going to go back to pycharm again and i'm going to write here span so this is the html tag that we are searching for and again i'm going to write class underscore equals to that srp skills now i want to ensure the results over here once again so you always want to quickly print the results of whatever html element that you want to pull to see what other methods you have to apply to prettify your result okay so let's run our program again and it makes sense to delete the print company name so let's re-execute our program and then you can see here that we have some spin tag and then here we have a strong tag which is basically created to make our text bold when we want to type in something so i'm just going to guess here that i'm going to only write here dot text and then i expect for the results to be fine so let's check out for that and then you can see here that the results are quite great so we have the python scripting and then we have some more requirements that are divided with commas and a lot of white spaces again so i'm going to apply the same method of dot replace once again like we did with the company name so let's write here dot replace and i'm going to replace white spaces with nothing so let's re-execute that out and then we can see that the result is quite like we want and now we were also able to grab the skills as well so this is quite nice now if we want to display a nice information about the job until now then we want to go with a nice print message here so let's try to create a nice message so we will use an f method here and we will also use the triple quote method just to allow us to write some text in separated lines as well and i'm going to write here company name like this and then i'm going to write here company name so i'm calling the company name value by writing it inside a curly brackets and i'm going to repeat the same process for required skills so it will be required skills and then i'm going to make that to be equal to skills variable and now if i go and execute our program let's see if the results are quite nice yes so we kind of receiving a nice information about the job info okay so this is quite great now if we go back to here then we want to search for one more element so you remember that i told you that we only want to grab the job post with the text of posted few days ago so we for sure want to write some extra code to apply this functionality so i'm going to go here and i'm going to inspect for that element again and then we can see that it is inside a span once again but i can also see that this job post including some more span tags so i have to filter out the results again with the class name itself so i'm going to search for that sim posted class name and i'm going to go back to here so we will write this time job published date so it makes sense to delete the job excuse me so it is just going to be published date and i'm going to go here again with job.find and we will search for the spin and then this time the class underscore is going to be equal to the text that i just copied and i'm going to repeat myself with printing the published date but that time let's just avoid printing this print line so i'm just going to comment out those lines and let's see what the published a date text is looking like and you can see that we have here something a little bit weird so we have the span here and we have also one more span inside of the text of it so what that means it means that we have to take some different action than what we did previously so this time i want to search for the attribute of span just to get inside that tag over here and then right after it i want to look for the text of that span tag so this will give me the published date of this specific job but i'm not going to include the publish date inside my print message because we only want the publish date for the functionality to stop our execution if the published date text is not including the word of few and i'm going to code this functionality just in a second so you will see what i mean by what i said all right so what i'm going to do here is take a tricky action that is going to bring me all the jobs from the first page so if we paid attention then all the job posts including this class name so what i can do besides the find is change that back to underscore all and change this variable name to jobs and i know that just now it just raised an error here and i'm going to use here a for loop that is going to iterate over each element and i'm going to write here for job in jobs and then i'm going to create an indentation of the entire code that is right there so the results will be applied for all the jobs that are posted in the first page of the web page that we scrape so once i hit here the colon sign then i'm going to create an indentation for each of our lines like this and then the results are going to be quite the same so let's test that out okay i'm going to uncomment our print line over here and just for comfort reasons i'm also going to print here and empty lines so we can kind of see a division between the different jobs and then i'm going to delete the published date for now so if we execute our program that time then we are going to see a nicer results and this is going to contain all the job posts from the page that we scrape against so you can see that we have a nice paragraph for that job post and then we have also another one here and if i keep scrolling up we can see a lot of them in that output so this is quite great so if you remember we wanted to filter out the job posts that are not including the word of few inside the published date because what that means it means that this job could be outdated so if i go to our page again then we can see that as i keep scrolling down we have some text like posted six days ago and i wanted to filter out only the jobs that are containing the text of posted few days ago so in order to apply this i'm going to change the orders here a little bit okay so i'm going to cut this searching here and i'm going to paste that in as the first line inside my for loop now the reason i'm doing this it is basically because i don't want to continue on scraping for that post if the publish date is not matching my condition so it makes a lot of sense to place this code as the first line inside my for loop and then right here i'm going to write a condition that is going to check if the word of field is inside that text so it will be if fill in published date and again i'm going to create an indentation for the entire code here so you can do that with the shift alt combined and then you can just press tab and all the lines here are being indented so right now if i go ahead and execute our program then we should see the results again like almost the same but we also see here that the f string is not quite nice but i can live with that okay so it is great that we were able to receive the posts only that have been published few days ago now there is no limit for what you can do when it comes to web scraping and what you can filter in or filter out but basically this program deals with how to grab some job posts with the filters that you want to apply that maybe sometimes may not be available from the website itself so you can write your own filtrations on your python code while you scrape some information from a specific website so i'm going to do whatever it takes to turn this program into a very useful one and i'm going to do that by applying some special functionalities such as wrapping this entire program in a while loop and executing this project every certain amount of time and also apply some filtrations to filter out the job post that are not meeting the skills that i own and also i'm going to throw the results of the different job posts into a new blank file so i can be aware of the post that are being posted every certain amount of time so let's get started all right then so let's start with a kind reminder of the results that we got until that point so we run our program now and if we show it right here you can see that those lines are not aligned well so i'm going to change that and i'm also going to provide some extra information that will show us the exact link of the specific job that we are iterating on so that way i will have the ability to just click on the link and then see more information about that job so as a beginner i will get rid of the formatted string in that case because doing a formatted string with a triple quote might not be a great idea when you execute it with a for loop because as you can see that it also includes the indentations right here so i'm going to delete this entire code here and i'm going to write two more new formatted strings and we will start with company name make that to be equal to company name so make sure to add a column here so it will be more friendly and then i will write here required skills as well and then we will write here the skills variable now there was one more issue with the result that we showed a minute ago and that was the blank spaces that are being shown as well so we can get rid of the spaces by a special method that is called strip and it is a special method that you are allowed to use inside strings and since the company name and the skills are strings by default i don't have to convert them to a string so i can just call that method like this okay and now i will show the results of something like the following and in a few seconds we will see that this is aligned way better than what it was and i'm also going to add here more information line that will show the link of the job post so let's do that okay let's go here and write this functionality okay so we had an unordered list that inside of that we had some different html tags that are called li and that stands for lists and they are actually different job posts that are divided into different elements inside an unordered list and then if we hover our mouse you can see that there are different jobs now if i go inside one of them and i go inside a header tag that is actually the first header of the li tag and then i will go inside the h2 here and then you can see that we have a link that could lead us to a link that provides some extra information about that specific job so if i actually go here and click on here you can see that we receive the job description right here so what we have to do in order to access this link in each job post that we are iterating on the python code is actually going inside and header and then going inside one more tag with a kind of h2 as you saw me doing that and then access that a tag so let's do that okay i'm going to go back to pycharm and apply this functionality so we will go under the skills and then we will write here more info and that will be equal to job dot header because this was the first tag that we want to go inside of it and then we want to go inside the h2 and then inside that h2 we want to go inside the a tag now before we go further let's test ourselves that we have done great job so let's print the more info in the following way so it will be more info and then we will call the variable in a formatted string now let's execute our program and then you can see that inside the more info we have the a href which gives us the link about the specific job that we are iterating on so all i have to do here is going back to my more info and then call that href attribute so this time i'm going to do that with a square bracket like in dictionaries and then i'm going to write here href so i will receive the value of that attribute so if i run that one more time then i should see the link only and that is what exactly happening so the result is quite great and then you can see that this is already better than what we did in the last episode and we will continue from here okay so what i want to do now is giving the opportunity for the user that executes this program to filter out some skill requirement that he does not own so we will use the input function for that and then whatever the input is equal to we will filter out the results from the jobs that we are finding right here okay so let's write this functionality so to apply this i'm going to create a new variable over here and i'm going to call it unfamiliar skill and i'm going to make that to be equal to an input and then i'm going to write here something like this okay so the user could understand that he has to provide some information in order to execute this program and actually it might be a great idea to print some extra information before that input function so it will be print put some skill that you are not familiar with and then right after the unfamiliar skill input i will write here filtering out and we will actually make that a formatted string and then we will write here filtering out and then whatever the unfamiliar skill is equal to now what are we going to do with this unfamiliar skill variable so that is quite easy right we have to search for a condition that will filter out the job post that is including that word that we are going to provide here as an unfamiliar skill and what we can actually do is search for the unfamiliar skill world inside the skills string so if you remember the skills is a long string that is divided with commas so we can go with a condition like the following so it will be if unfamiliar skill not inside the skills that we are grabbing in the each job post that we are iterating and now all what we have to do here is creating the indentation for the different print lines okay so now i should see the job posts that are not including the unfamiliar skill that i'm going to provide so just to test that out let's run our program twice okay so in the first we are going to write here linux as a skill that i'm not familiar with and you can see that we don't see anything that is including the keyword of linux over here but let's actually take that to the next level and test that out so we see here a specific java post that is including django so let's say that i am not familiar with django and see next time if i see that job post with this company so let's re-execute our program and that time i will write django and let's see the results so we can see that we don't have any job with django but we do have linux that time so this condition works well and we will continue on to next step from here now what could be an exciting challenge for you guys is to write an algorithm that will accept more than one unfamiliar skills so you want to accept multiple inputs from a user and it might be more challenging but i think you should try to spend some time on something like this because i think this could be an amazing challenge for everyone who is watching this video all right so now we are going to save each job post in a different file so besides printing this in the terminal then we are going to write this entire information in a separated file and then i will also allow this program to run every 15 minutes or every 10 minutes up to you and i will show this logic as well so first of first it makes sense to wrap our entire program in a function and i'm going to do that by collecting everything that is kind of pulling the information from the website and i'm going to indent everything one step aside and then i'm going to right here def find jobs okay so that way we have one function that executes our main program and then what i'm going to do here is using the logic of if double underscore name is double underscore main so that way if you want to extend this program only if this file is ran directly then this function will be executed now if you don't know what i said about if double underscore name equals double underscore main then i have a video that explains this condition so you can check that out by the suggested link above so let's write here if double underscore name equals to double underscore main inside a string and then right here while true so i want to run this program forever and then i will call the find jobs and right after it since i don't want this program to being executed like every millisecond then i'm going to write here time dot sleep so time dot sleep allows your program to wait certain amount of time that you decide and you can provide its argument by seconds so i'm going to write here 600 just to make that program to run every 10 minutes but you can notice how we did not import the time library so let's do that by import time okay and then this program should be okay now to make this more dynamic i actually prefer to make some variable here that will be equal to 10 and then i will just make that to be equal to time weight multiplied by 60 and right after it we can provide some extra information excuse me this should be over here and we can write here waiting let's make it formatted then we can write here waiting time weight seconds and let's write three dots here great so this is great so if i'm executing this program i expect to see this program running every 10 10 minutes so i'm inside my command line interface and you can see that my directory has been already set to the directory where we worked so i can go with python and then execute the name of the file by calling it so it will be main.pi and then once i run that you can see that we receive this output and then i have to provide some information that is going to be filtered out and then let's write here django again and you can see that we receive the results successfully but more important we see that waiting 10 seconds which is not great we have to change that to waiting 10 minutes because we are waiting 10 minutes right but the program works great it was just my mistake by writing here seconds so it should be minutes for sure but i'm not going to wait 10 minutes until this program is running one more time and so i will allow myself to move on to writing this information inside file so it makes sense to write this kind of information in a separated directory so i will go inside my web scraping tree file i mean folder and then i'm going to create here new directory which is going to be named as posts and then i'm going to write here some extra functionality that will create files i mean text files then and then inside each text file i'm going to write this exact information so you can do that by with open i already show you how you can do that in the first episode now i know that i don't have any separated tutorial about working with files in python but you want to consider check out my channel maybe i will upload very soon so you can go here and that time i want to put here information and i will call my post directory and then inside here i have to provide my file name that i'm going to create now before i move on here i thought about changing my for loop here and use the enumerate function now enumerate function is going to allow us to iterate over the index of the jobs list and also the job content itself and so i have to provide here one more variable like index so the index is going to be a kind of counter for the job that i'm iterating on and then the job variable will relate to the job beautiful sub object itself and so it makes sense to name our files with the index of the job that i'm iterating on so i will change this into a formatted string and then i will write here index dot txt so it will be something like the following and i expect each my text file to be named like 0.txt or 1.txt and so on now the second argument will be the permission level that you want to give when you create or open a new file and this time i'm going to write here w and that stands for writing inside the file and then i have to use the as statement and i'm going to use the f variable so inside that block i can write to a file with the f variable and i'm going to go inside my with open and i'm going to create indentation of the prints and i'm going to delete this print line here and it makes sense to remove this blank space as well and all i have to do here is changing this print statement to f dot write and then that time i'm not going to print the results in the command line interface besides i'm going to write the information in a new file so i'm going to use the combination of alt shift here and i'm going to change those entire three prints to f dot right okay and then i'm going to open the parentheses so it will be closed by those and then i expect for each job to being written inside a file and once i do that it might be a great idea to print a sentence like file saved and then you can provide the name of the file as an extra information so i will create one more time formatted string and then i will relate to that index variable and now our program is complete so let's check it okay let's go back to our command line interface and let's actually control break this program and let's write cls to clear our terminal and then i'm going to re-execute my program so it will be python main dot pi and then i'm going to execute it so let's see this time i'm going to write django as well and that time i don't expect to see output for the information besides i expect to see this okay so let's see what is inside each of our files so let's see what is inside that post directory okay so i'm going to go inside my c python put web scripting tree and then the post directory that we created a few minutes ago and you can see that inside of that we have our text files but if i go here inside let's see if the results are okay okay so i'm not quite satisfied with with that because it might be a greater idea to see that like i mean like this okay so you might want to divide those information in separated lines but that is not going to be complex so we just have to go inside our python again and then whenever we write to the file we have to use that convention where you can just jump a line and that will be backslash in so when you provide backslash n inside a string it is just a convention that is going to jump to the next line right after it so it will be backslash n for the first line and then also here and let's run this program one more time so i'm just going to break the program and re-execute it so that time i will write linux and then let's test our results one more time so let's go inside our 19.txt and then you can see that the information is right there just like we expected okay so this is quite great alright guys so i hope you enjoyed this entire series and you can find everything that we have done here by the links in the description of course i will provide extra information in my website about this series so if you like this video consider subscribing and also hit the like button i will see you in my future uploads
Info
Channel: freeCodeCamp.org
Views: 490,375
Rating: undefined out of 5
Keywords:
Id: XVv6mJpFOb0
Channel Id: undefined
Length: 68min 23sec (4103 seconds)
Published: Wed Nov 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.