Building a CLI Tool | Scraping & Processing Data with BeautifulSoup, Requests & Pydantic

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

this is going to be a quick video where we're going to demonstrate how to scrape data from the web using beautiful soup and requests in Python and we're going to model that data in a very simple way with pantic and then perform some calculations on that data so we're going to combine the use of pantic beautiful soup and requests and what we're going to do in this video is we're going to scrape the episode list from the talk python to Me podcast and we're going to fetch the data from this table where we have a show number a date a title and some guests as well as the UR for each individual episode so let's get started I'm going to go to vs code and I've got a couple of imports at the moment we're importing the requests module and we're also importing the beautiful sup object from beautiful sup now I've defined a couple of URLs here this is the URL for the page we want to scrape which is this page that we saw here and I also have the base URL for the talk python website we're going to use that later in the video now in order to follow along with the video you should activate a python virtual environment and we need to install the packages that we're going to use and we can use this command for that it's the PIP install command and we're going to install pantic requests and Beautiful s 4 now I've already done this so that's not going to install them on my system but you need to run that in order to be able to use those packages once you've got them installed we can start writing the code to scrape the data from this URL now we're going to use requests to send a get request to that page and we're going to get back a response when we do that so we're going to use the requests.get method and we can pass the URL into that in order to send that request and we're going to get back some HTML content now just to validate that we're getting the correct response I'm going to go back to the page here and I'm going to take this title here about the Azure data centers and I'm going to copy that in here and we're going to check to see if that string exists in the response. content property now these response objects in the requests module they have a property do content and as it says here that gives you back the content of the response in bites now actually I'm going to convert this to Tex text using the text property on the response and that will give you back the Unicode response now we can check this by running the script so I'm going to run this on the terminal and I've called this script scraper dop when we run that we get back true here so that means that we're getting some text here in the response that contains that title from the page so we know we're getting the HTML back from this page we can now start processing the content using beautiful soup so let's go back to vs code and what we're going to do is we're going to take the resp response. text and we're going to pass that into the beautiful soup object that we have here so let's do that just now we're passing response. text and we're going to pass this text response as HTML so we're going to pass a second argument of the HTML parer in beautiful soup and we can store the result in a variable called soup so let's get rid of the print statement and I'm going to execute this code just to make sure everything is working and we get no errors no issues when we run that now what we want to get is each episode from this table now this is a table table element in HTML and if we inspect the Dom here we can see that all of the rows with the data that we need are children of this table body and the table body has a number of table rows as direct children so we're going to start by extracting each of those so let's go back to vs code and we can use this beautiful soup object and we're going to get back a variable here called Rose by calling the soup do select method and we can pass a CSS selector to this so we're going to get the table body and we want to get all of the immediate children that are table rows and that's going to give us back a python list with an element for each row in this table so let's see if we're getting something back here I'm going to print the very first row by indexing in at zero and we're going to see what we get back on the terminal so I'm running this now and we get back the HTML content for the first table Row in that table and you can see the title of that particular podcast it was python in Excel and that matches what we see here on the web page so now that we have the rows we can start extracting each each piece of information from those rows so how do we approach this task what we're going to do is we're going to create a for Loop and we're going to iterate over each row in this table and for each row again we're going to inspect the Dom here and we can see that each table row has four table data elements and the first one is for the show number the second one is for the title and so on so what we're going to do in the for Loop is we're going to create another for Loop and extract each table data element and then we're going to par the data that we need from that element so let's go back back to vs code now and we're going to code this up so I'm going to make the terminal smaller here and remove the print statement and now we have the rows we need to iterate over each one of those so for each row in the rows what I'm going to do here is get back a variable called data and I'm going to call a function that I'm going to Define in a second and that's going to be called extract episode data and we're going to pass the row that we're iterating over into that function after we get the data back I'm also going to print that out and for now I'm going to break out of the for Loop so we only print the first element so let's now Define the function that's going to take the row that we've got from the for Loop and it's going to process that and extract each of these pieces of information from the table now I'm going to do this at the top above all of the other codes you might want to break stuff like this into its own file its own module I'm going to Define everything in one file so we're going to copy the name of this function it's called extract episode data and it's going to take a row and that's going to be an instance of a tag object from beautiful soup so let's import that and what this function is going to do is it's going to take that tag with all of its children with the data we need it's going to process that and it's going to return a dictionary with the data for that rule so that's the function signature for now I'm actually just going to pass on this and what we're going to do is we're actually going to create a pantic model now and that model is going to define the structure of the data that we're going to get back for each row now again we're going to do this at the top of the file so what I'm going to do first of all is from pantic we're going to bring in a couple of imports we're going to import the base mod model and I'm also going to import this object here it's called any HTTP URL maybe a better word for that is a type not an object but we're going to use this type hint in the pantic model and that's going to be to represent the actual URL for the podcast so let's get started writing this model it's going to inherit from the base model and this is going to represent an episode that we're pulling from the website and it inherits from the base model as I said before now the first field that we're going to Define is the show number and that's going to be of type integer and if we go back to the talk python website you can see the show number here is an incrementing number and it started off from number zero presumably and you can see that at the bottom so the show number is a very simple field it's just a number then we have the date as well so let's go back to the model and we're going to define a field called date and that's going to be of type date so we need to import that from Python's date time module at the top so we now have the date field and then the next field is for the title now I want to get the title of the podcast but we also want to extract the link from this as well so we're going to create two Fields here the first one's going to be very simple the title will be of type string and then we're going to have the URL for this episode and that's where we're going to use this any HTTP URL type and as VSS code says here this is a type that will accept any HTTP or https URL and there's one final field in the table and that's the guest and that's going to be a string again so let's go back to vs code and I'm going to add the field guest here and that is our pantic model contains these five fields and we're going to scrape the data from this website and for each row we want to create a model with those fields and what pantic is going to do is it's going to perform some automatic type conversions for us as we're going to see a little bit later in the video so now that we have this model let's now write the function to extract the data for each row now to start with I'm going to set up an empty dictionary called Model data and we're gradually going to add the data that we're extracting to this dictionary as we iterate over the table data element El now in order to get the row data which is the table data elements we're going to use the row that's passed into the function and that's a tag object in beautiful soup and that therefore contains that same select method so we're going to pass a very simple CSS selector here and that's to get the table data elements that are children of the table row and just to reiterate what that's going to return here if we look at the very first row in the table we have the table row and that table row contains four table data children and each one of those is going to allow us to extract the data that we want so let's go back to vs code we're going to get these table data elements and then again we're going to create a for Loop and I'm going to use the enumerate function in Python and let's get each table data element and again we're going to use enumerate here and we're enumerating over the row data that we got above so we have four table data elements we're going to iterate over those and we're going to use the enumerate function so that we know which index of iteration we are on so to scrape this data let's start with the very first column here for the show number I'm going to find that element in the Dom and that is the first child of the table row so what we need to do here is check the index and if I is equal to 0 we know we're extracting the show number so we're going to take this model data dictionary here and we're going to add a key here called show number and we're going to set that equal to the text within the table data element that we're iterating over from that row data so that's going to handle the show number now I do want to perform a small transformation on this we want this to be a numerical value so we want to get rid of this hash symbol in order to do that we're going to take the text that we got back here and we're going to call the replace function in Python this is a function defined on strings and we're going to replace that hash symbol with an empty string so that takes care of the show number we can then check if the index is one and I appreciate this is quite a manual way of doing this but it is going to work for this example because we have a fixed number of columns in this table and there's only four of them now the second one is the date here and it's very simple to extract that if we look at the table data element we just need to get the text here and we don't need to perform any Transformations so let's add a field or a key to the model data called date and we're going to set that equal to the table data's text and we don't need to do anything with that so that's all we need to do for the date now we're going to check if the index that we're iterating on is number two and this is the more complex one so I'm going to add a little comment here what we want to do is we want to get the anchor tag with the link to that particular podcast for this row and from that anchor tag we want to extract the HRI and add the base URL to that as well as the text within the anchor tag and that's going to contain the title of that podcast episode so let's go back to the HTML and we're going to look at this particular element what we have in the table data for this column is an anchor tag so we need to find that from the children of the table data element so let's do that now we're going to create a variable called link and we're going to take the table data element and we're going to call the beautiful soup. find function and get that an anchor tag once we have the anchor tag we're going to take the model data and we're going to add the URL field to that and in order to do that we're going to take this base URL here so let's copy that and to that we're going to add the actual hre that's present on the anchor tag so let's go to the Dom again if we look at the hre here on the right hand side you can see that this starts with a slash so we don't have the domain here we just have the particular hre on that domain so we need to add the base URL to that now how do we get the hre element from this link tag in order to do that we can use the beautiful sup tags. attributes dictionary and that dictionary has a key for all of the attributes that are present on the HTML element and we want to get the href attribute that's going to give us back that link and we're appending that to the website's base URL so that should take care of the URL we also need to look at the title itself so we're going to add another key called title to that model data dictionary and that's very simple we're going to take the anchor tag which is called Link in our program and we're just going to get the text from that so that should hopefully handle the anchor tag there's one more column in the table and we're going to get that now so we're going to check if the index is equal to three and that should be the final index and then we can take the model data and add a key of guest by setting it equal to the table data. text now the last thing we need to do in this function is just return this model data so let's go to the bottom here and after the for loop we're just going to return the model data so that's our function it's called extract episode data and if we go down here we're iterating over each row and we're calling that function for each row and getting back that dictionary so let's test this out I'm going to expand the terminal at the bottom here and clear that out we're going to run the program again and let's see what we get back from this program and you can see we have now keys with the show number and that's set to the value 446 we have the date we have the URL and we also have the title of the podcast episode as well as the guest name now what we're going to do now is we're going to to convert each one of these dictionaries to the pantic model that we created and that's going to perform some useful type conversions and validation now here's an example of some of these type conversions the show number that we have here is currently a string in this dictionary with the number 446 as a string if we look at the pantic model that we defined earlier in this video you can see that the show number should be an integer now pantic when we create this model and pass the dictionary of data in it's going to automatically convert that to to an integer because what this string represents can actually be converted to a number so it's going to do that for us automatically and very similarly for the date that's currently stored in the dictionary as just a string but when we create the pantic model we're going to get that converted to the date type in Python and as one last example of some validation we have a URL that's set to the N HTTP URL type so when we pass in the URLs from the website if any of these do not match a valid URL scheme and that's to say a valid HTTP URL then that's going to be rejected by pantic so we get that immediate validation and we can be sure if nothing goes wrong that we have data that we expect in the application so let's go back to the code now we're going to scroll down to where we get back the data from calling that function and what we're going to do now is we're going to set up an empty list above the for Loop called episodes and we're going to populate that episode list here by calling the dot append function now what we're going to append to the list is a pantic model at each step in this for Loop so we're going to create a model which was called episode and we're going to unpack the data that we got back from calling this function so this function here will return a dictionary of key value Pairs and we want to pass each one of those in as a keyword argument to the pantic model that's going to create the model and populate it with the data and then we're appending that to this episodes list so let's test that out now what I'm going to do is remove the break statement and we're going to print the very first element in the episodes list now we could see what we got before we had a python dictionary here if we rerun scraper dop this time we're getting back something different and this represents a pantic model class and if you look at the types here for example the show number that is an integer now it's not a string and the date is an instance of the Python date object and as well as that if we look at the URL field you can see that that has a different representation as well and that has validated that we have a valid URL in this data for each one of the rows in the table and we can actually represent this whole piece of code here it's four lines of code but it can be represented as a list comprehension as well so what I'm going to do is just go down below that and I'm going to reset the episodes list here and we're going to set that equal to a list comprehension in Python and I'm going to comment out what we have above just to show how we would do this so in the list comprehension we're going to create an episode and what we're going to do is we're going to call that extract episode data function and pass the row in and we're going to do that for each row in the rows that we have from this statement above here and don't forget to close this bracket here when we create the episode now I'm going to remove the code here so that this looks a bit more coherent what we're doing is we're getting the rows back by looking at the table body and getting all of its children table rows that's going to give us all of the rows on the page and then we use the list comprehension and for each one of the rows and actually I can see I've still got a syntax error here for each one of the rows what we're doing is we're passing that row into the function we defined and then unpacking the dictionary that we get back from that to create the episode object now let's see if this still works I'm going to save the file and rerun the code and hopefully we get back the same pantic model and you can see that below now at this point we have a list that's populated with the data for all of the episodes on The Talk python podcast we can then save this data to a database we can save it to the file system or we could do anything we want with this data we could also search through the data or order the data in a particular way it's going to be very easy to do now that we have the data in that format now what I'm going to do is remove the print statement and we're going to define a very simple application that takes some text in from the user and then it uses that text to search through the episode list and find any episodes whose name contains that text now to do this we're going to go to the top and I'm going to import the pprint function from Python's print module once we've got that we can go back down here and we're going to get a search term from the command line so we're going to use the input function in order to do that and this is a built-in function in Python that's going to read a string from standard input and that takes a prompt as an argument so we're going to pass a prompt here of enter a search term now the user is going to enter a search term and then we want to look at each episode in this list and we want to find if the episode contains that search term so let's create a variable called results and again we're going to use a list comprehension here and we're going to look at each episode e and that's coming from the episodes list and we're going to add a condition here we're going to look at the search term and we're going to lowercase that within this for Loop and once we've done that we're going to check if that search term is in the episode's name so we're going to look at the episode e that we're iterating over and we're going to look at the title property and again we want to lowercase that just to make sure there's no case mismatches now the title is what we want to look on that's one of the fields in the pantic model and that represents the name of the podcast you can see that in the third column in the table so we're going to look at those and we're going to try and find any episodes who have a name that contains what's been entered by the user and once we've found those results what we're going to do at the bottom is just add a couple of print statements we're going to print out a message episodes with the term and then we're going to reference what the user has typed in and then we're going to use the pr print function to print out any results that contain that text so let's save this and we're going to expand the terminal and try this out now if I run the python scraper dopy command we now get the prompt to enter a search term now I'm going to enter HDMX here and you can see we get back a couple of of results there are two episodes in the results the first one contains the title of HTM X for Django developers and all of us and the second has the title HDMX clean Dynamic HTML pages so by doing this we're taking some input from the user and we're able to search through the list of pantic models and extract the episodes that match what the user has entered now let's try this with a couple of other search results I'm going to also look for the term pandas which is a very popular data package P AG in Python and we get back some titles that contain pandas for example understanding pandas visually with pandas chor and a few more here as well let's search for something else I'm going to search for D Jango here and as you'd expect there are loads of episodes here that reference jangle let's do one last one here I'm going to search for py test which is a popular python testing package and we get back three results here for episodes of talk python to me that contain py test in the title now that's nearly all for this video before we finish I want to just talk about the pantic model a little bit for this data that we have in this small program it might be considered Overkill to create a pantic model to model such simple data but I think it gives you a very clear logical way of looking at the data you can create a class with the fields that you need and as I mentioned earlier on you get the automatic type conversion for things like the date and the show number these are coming into the model as strings but pantic is clever enough to perform those type coercion to whatever Define here if it's possible and we can also Define validations on the model very simply using the types such as this one here but also we can write custom validators as well now I'm going to push this code to a GitHub repository that's going to be linked in the description if you have any suggestions for future content let us know in the comments but that's all for this video give it a thumbs up on YouTube and subscribe to the channel if you've not already done so thanks again and we'll see you in the next video

Info

Channel: BugBytes

Views: 2,208

Rating: undefined out of 5

Keywords:

Id: dYWnS8eRf4M

Channel Id: undefined

Length: 20min 59sec (1259 seconds)

Published: Thu Feb 08 2024