Solving real world data science tasks with Python Beautiful Soup! (movie dataset creation)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey how's it going everyone and welcome back to another video we got a fun one in store today we're going to do another solving real world data science problems video so basically how this works is that we're going to walk through a data science project and particularly in this video we're gonna be walking through a web scraping project and as we go through the project i'll be presenting tasks for you to work on independently so you can kind of pause the video try out the task and then whenever you're ready you can resume the video and see how i would go about solving the problem you guys seem to really like the last time i did this so i figured it was due time that i did another one of these the specific project that we're going to be working on today is going to be scraping through a bunch of wikipedia pages on disney movies and building up a data set on that information so we'll be using libraries like beautiful soup requests we're also going to throw in some unit testing stuff so pie test and a bunch of others and ultimately the real goal of this is to help you learn how to solve real world problems and ultimately one of the biggest questions i get is you know what library should i learn to get a job the answer to that question is really you just need to learn how to solve problems in data science and learning how to do that is what's going to ultimately help you find jobs and kind of continue your growth in data science once you've built up this data set you'll be able to answer all sorts of cool questions about disney movies like what was the worst ever disney movie uh released and you'll find out that oh uh it was uh a jonas brothers 3d uh movie experience rated terribly on rotten tomatoes so you'll be able to do some fun stuff like that i think i'm going to leave the analysis for a follow-up video but uh we're gonna have a lot of fun you know walking through all sorts of tasks just building up this data set before we begin i want to give a quick shout out to this video's sponsor and that is data camp for those of you familiar with my channel you probably know that i'm not always the best about posting frequently and you may be looking for more resources to continue learning i apologize for not always posting but each time i post i want to make sure i deliver a lot of value my ultimate goal is to help you become better programmers and better data scientists and i think that data camp is a platform that can help you do this data camp offers over 300 courses that combine short expert videos with hands-on exercises they offer courses on all sorts of python topics that range from the basics to advanced topics like data visualization probability and statistics and machine learning one thing that i really like about datacamp is that the lessons are bite-sized and you can really fit them into a busy schedule they even offer a mobile app that can allow you to take your exercise and your courses on the go to get started with data camp i left a link in the description make sure to click on that you'll be able to access all the first chapters of each course for free and then to get unlimited access to everything that data camp has the subscriptions start at 25 a month all right to get started open up your preferred code editor for this video i recommend a jupyter notebook so i'll put some instructions on how to set that up in the description you can also use google collab as a browser-based option once you have that open you also want to make sure that you open up a web browser and we're going to start off and i'll also have this in the description going to this wikipedia page that has a list of walt disney pictures films and ultimately this is what we're going to be scraping when we're building up our data set so as we see here we see all these tables with a bunch of different disney movies that were made over the years and ultimately what we're going to be trying to do is going to each one of these links scraping some information and then saving that so let's just pick one of these to kind of start with i'm going to go to one of the newer ones let's start with how about uh toy story three so click on that link okay so if we look through this page and try to figure out like where the best spot to get you know the information that we want to include in our data set you could scroll all the way through it but what i would say with any wikipedia page you go to or pretty much any page you usually have this info box on the right side and that has all sorts of useful information like the director you know who it's starring uh you have like budget box office etc running time so what we're going to do we're going to ultimately scrape this info box for all of those pages we just saw so to present the first data science task for a single page so let's do this toy story 3 page let's scrape that and store it in a python dictionary so scrape all this information you see here and stored in a python dictionary and and i recommend using beautiful soup library to help you do this if you're not familiar with beautiful soup i have a full tutorial on beautiful soup so be sure to check that out but feel free to pause the video try out the task of storing all this info scraping and storing all this information for toy story 3 and then resume when you want to see the solution all right so the first step of this task is going to be to just load in the page so to do that i feel like you can break it up into some simpler tasks but you know really the first thing is to just import the necessary libraries so i'm going to say import necessary libraries so what are we going to be doing using to scrape a page well i think that we will want to first off import probably the beautiful soup library so i'm going to do from beautiful beautiful soup for import beautiful soup as bs this is like this is how i like to do it and again just feel free to watch my web scraping tutorial if you want to see a bit more about beautiful soup before we kind of jump into things in this video then the other library that i think is going to be important is going to be the requests library so we'll load those two libraries in next we'll just want to simply load the the page so i'm going to say you know the next task is to just load the web page so how can we do that it's pretty straightforward we're going to want to use requests to load in the content so we can do r equals requests dot get and https let me just paste in the link we're going to do that toy story 3 page like this and that should load it and then we will want to convert it to a beautiful soup object so we can do that as follows we can do soup equals beautiful soup of r dot content and as a reminder really a lot of the things if you don't know how i'm kind of discovering my normal workflow is google searching how to like convert a web page you know how to load and convert a beautiful soup webpage so i always recommend just you want to be thinking about what you need to do and then just use google to your advantage to figure out the exact syntax and so we have the beautiful soup object but it might also be nice to just print out the html so to do that we can do contents equals soup dot pretify and then within the jupyter notebook i can just type in contents but you could also just print out contents here so let's see what we have oh man so i mean this is all of our html there is a lot here uh it's going to be tough to work right off of this so let's try to narrow down what we need we're actually real quick just try to print out contents and see if that actually looks better oh yeah it does that's a little bit more manageable to read so i mean this is the wikipedia page but still there's a lot and as we mentioned the task right now that we're trying to do is to just get that info box so let's go back to our wikipedia page and when we're trying to get a specific thing in on a web page what you want to do is use your browser to your advantage most browsers support this right click and inspect on the element that you're curious about so as we see here if i kind of hover over things here we get this full table which is called infobox v event so really it looks like to me if i'm not mistaken i'm going to try to just scroll down a bit yeah it looks like if we grab this table we're pretty much good to go to get this information so let's just narrow down our html content and just get that table so to do that we can use our soup object that has everything on the web page we can do soup dot find and let's pass in the class because class is a python built in we have to do class underscore the syntax for soup beautiful soup and we're going to do at equals i think it was infobox space v event so that should allow us to grab that table and if you wanted to be more particular you could you know specify that this is a table element but i think we'll probably be good by just grabbing this class and i'm going to just define that just we have it saved as a variable as infobox equals soup.find class of infobox v event um let's see and now let's just print that out real quick and the prettify command helps you with beautiful soup just to get a nice like indented html syntax it's a little bit more easy to process okay so what do we have in the info box and it does look like we got that right table so the first thing we see is the name that's going to be very important for us to save the name in our python dictionary if we go down we see i think that this is probably the image that was in the the table let's see okay we got like down here the director producer and let's just see the syntax here we see that all the items that we really want to grab we first want to grab the title and then i would say we want to grab all of this stuff here so what i think it's important that we're going to do is grab the table rows here as we try to build this so let's start doing that all right to grab the table rows let's build off of our info box beautiful soup object so let's do a find all this time and we're specifically going to be looking for a table row tag so if we do that we're going to get a list of all the different table rows so this should help us out a lot and it might be helpful to kind of just iterate and see what's in that list so let's just do a for loop for row and let's actually save this too so maybe i call this info rows for row in info rows let's go ahead and do a print of row.purify see what that gives us right so it's just a little bit of a neater way to look at it and we have a little bit of separation so this one the information we need is in the table head we don't need the second table row third we see that basically how we're going to separate this out is we get the key for our python dictionary from the table head and we get the value for a python dictionary in the table data and that i think is the case for all of these i guess you have some more complicated ones when you get to multiple selections so we'll have to kind of handle that separately if it's helpful i also recommend using the inspect tool a lot down here so like as we can see we kind of navigate to these table rows you can open that up and see exactly nice and uh simple kind of syntax i guess to to see what is what probably an easier view seeing them side by side than just through jupiter notebook all right so let's start building our dictionary out we're going to do this on a separate cell so maybe close this out too okay so i want to get all these rows and basically save them to a dictionary so i'm going to start out by saving a empty dictionary so movie info equals this blank python dictionary and then we'll probably want to iterate it through it just like we were before so we'll say for row in info rows again and if we remember what was in that the first row we just wanted the title we didn't want all the other stuff so we wanted to handle that a bit separately so what i think was going to be helpful for us is if there's certain indexes that we need to hand handle differently it'll be nice to just know which rows those are so i'm going to use this enumerate keyword that allows us to both get the index and the row at the same time all right so for index and row in in enumerate info rows so first off i guess if index equals equals zero that's the title row so we'll probably just want to add movie info uh title equals row dot find let me open this back up just so we can see exactly what we need to find in there we need to find the table head here and then we need to get text so this once again comes back to the web scraping tutorial that i've done before if you don't see understand the syntax that i'm walking through right now you definitely would if you watched that video alright so road up find we want table head and then we want to get the text there so i'm just going to do dot get text and i think that that might be good so let's see what we get if we do that i guess to start i might as well just print movie info just see if that first index gets handled properly looks look at that looks pretty good to me toy story 3. cool um all right uh what else do we need to do well we don't really need the second indexed so we can probably skip that um i'm gonna just say elf if index equals equals one we'll just continue because that was the picture and else this is when we actually want to collect the table head and table data and put that into the dictionary well let's just start very simple and go ahead and do i'm going to say the content key equals the row dot find of the table head and probably the get text out of that and the content value is going to be row.find the table data and let's just remind ourselves we see that the table data or the table head has the directed by the table data has the um the actual name so that's why i'm saying content key and content value like this so dot get text and then what we need to do is just simply in our dictionary we'll do movie info content key equals movie info or equals content value and let's see what happens now when we print our dictionary oh man it's a it's a little ugly but i think it might be good we'll have to work on this all right let's try to dig into why it's like pretty ugly like this so let's just look through our dictionary well we see that the first one worked fine see that the produce buy worked fine screenplay looks good story bye that's where we get our first ugliness and it seems like it's because there's three different names here so let's just go back to our web page and see kind of how we maybe should handle when there's three names so i'll go to the story by here and okay so what is in that so let's go to the table data here we see this time instead of just text we have div and we actually have a list so unordered list and in the list elements here we get the actual names so i think we should have like an if statement that handles uh the possibility of lists different than just other more straightforward ones like directed by which is just directly in uh the table data i guess this one's within a link but you can get the text much simpler there so let's add that if statement into our code all right so i often like writing functions to do stuff like this try to you know keep things a little bit more confined and like build up you know don't have too much in this single like for loop so let's build a function that we call get content value and that's going to take in a row and what can we do with the get content value i'd say that if the row basically we want to break it into two cases if it has a list and if it doesn't have a list so if row dot find of list if that's true well we want to probably return a list that is going to be and just bear with me here we're gonna do list dot get text for li in row dot find all of the list elements so for each list item so list li is list item we're going to get the text for that list item for all list items in that row that has a list that looks good and then else i think we can basically do what we were doing before so we can just return row find tabledata.text and actually just to simplify this a little bit further let's say that this is row data instead of just the entire row so instead we don't actually have to do this anymore it's going to be contained so rowdata.gettext and that allows us also to grab the list items directly in the row as opposed to as opposed to having to first get the row data then do that so this should be row data i think that's good so this should be row data so now we want to have this be get content value of the row dot find table data and let's see what this does with our movie info list or movie info ah not defined what the heck honestly if i run this again it'll work because i'm in a jupiter notebook so it really wants me to have this above my for loop so i guess i can make the compiler happy and do that oh look at that looks a lot better toy story 3 all the same stuff as before looks good now we have nice story by here looks like this is all good the starring looks like the music is good looks like the cinematography is good edited by is good production company ah why is there not a space here so we might have to figure out what's going on with the space here and the the name um this looks good except for we have this weird xa oh character same thing with budget and box office so let's just make some minor cleanups and then i think we're probably good for task number one all right to me it looks like we need to do two main things of cleanup the first looks like that we want to replace any xa zeros whatever this character is with just a simple space so that's just a string replacement that should be pretty straightforward the other thing that might be a little bit more difficult is like the word production company is like getting kind of smushed together it's not separated by words i don't know if that's happening anywhere else don't see it happening anywhere else well let's just try to figure out why this production company looks a little bit off and do the string replacement i think we'll start with the string replacement because i guess that's kind of a little bit more of a low hanging fruit so if you forget how to do string replacements just feel free to do a google search like string replacement python uh replace method i like finding stack overflow posts usually just because i feel like they usually have the best examples but maybe we'll just try this geeksforgeeks article okay so string dot replace what was originally there and now what you want to be replaced there so that's very straightforward so let's just go ahead and where we do our get text we'll want to do a string replacement so get text dot replace the old was anytime we see xa0 we want to replace that with just a space and we'll want to do it for this spot too xa0 with just a space run that let's see if it improves things yeah look at that i don't see that annoying character anymore here so that's solved number one second thing is why are these things getting squished together so production company is the main one we want to look at i also see it like here it looks like motion pictures in walt disney studios um are right next to each other all right so what one was getting squished together production company ah look at that try to find it in the html over here our inspect tool production company what is happening here so i'm going to make this a little bit bigger there's all sorts of nice settings in google chrome that allows you to see this uh preferences actually i think all i have to do is do control plus inside of this control plus yeah it makes a little bigger so production company we see let's drop that down see we have a div in here uh okay so there's production and then company and they're separate oh this has a weird um space after it too so it looks like maybe they're getting joined some way weird let's look up some documentation real quick so i'm going to type in here i guess let's just look up the get text method get text method documentation beautiful soup all right beautiful soup documentation and i'm gonna look up get text get text oh okay look at this there's a couple of things here you can tell beautiful soup to strip white space from the beginning of the end of each bit of text that seems like helpful too because i did notice a extra space so strip equals two let's make sure to do that and oh wow you can specify a string to be used to join the bits of together text together too so before i guess by default it's probably just no spaces at all but if we just made this a space here and made strip equals true that should solve our problems i think so let's try that okay so get text we want to use a space that was what was in the documentation and set the strip keyword to true and i guess we can just probably do this in every spot to be safe and the nature of i'm going to just say this now the nature of web scraping projects is they can get a bit messy because you have to handle edge cases um you just have to handle edge cases and it can be messy to do that so using functions is one way to make it a little bit cleaner but you know sometimes it's just going to be a bit messy with web scraping and that's just life i would say okay that looks good run it again nice this looks good to me production company that looks good walt disney motion studios that's good awesome we've completed task one nice work everyone all right let's begin with task number two so let's go back to the wikipedia page let me just close out the inspector real quick right now we're on toy story 3 but we actually want to be on that list of movies that we showed previously so we're at the list of walt disney pictures films and this will be in the description this link i also will include this link in the github repo for this so if you ever want to like start on a certain task i'll i'll have in on github all the code for each task so you can go to github.com disney data science tasks to see all that all right so what are we doing in this task well if we look at this list we see that we have all these tables with disney movies and ultimately links to each one of these movies that looks similar to what we just scraped with the toy story 3 link so what our goal is here so the task that's going to be assigned is just like you did for the toy story 3 movie the new task is to go through every one of these items in these lists and scrape and collect that information that that we just got for toy story 3. so you want to get the info box for all the movies in this in these lists in these tables before we get into this task i want to say some logistical stuff about web scraping and some rules you kind of want to adhere to as best as possible so if we search up so basically every website that we go to will usually have what's called a robots robots.txt and this basically tells us what we are allowed to do um on a site as far as scraping goes and usually for personal purposes you're pretty safe to scrape pages websites but if you're ever doing anything more commercial you really will want to read robots.txt and just make sure that you kind of adhere to what they're telling you to do so i just looked up the robots.txt for wikipedia and basically what it says here is that there are a lot of pages on the site and there are some misbehaved spiders out there that go way too fast if you're irresponsible your access to the site may be blocked so it's just really wikipedia is fine with us scraping their site but it's just telling us to be responsible about how we do it so i would say whenever possible don't try to swarm wikipedia servers with tons and tons of requests try to work slowly and kind of build up to where you need to get to and kind of limit how many times you make tons and tons of requests wikipedia provides a lot of information for us so just be be kind as far as scraping goes for other sites you might see like if we go to let's say ebay robots.txt um you'll see here the use of robots or other automated means to access ebay site without the express permission of ebay is strictly prohibited notwithstanding the foregoing ebay may permit automated etc etc so like ebay for example i wanted i was kind of thinking about doing some crawling on ebay but you know according to the robots.txt you're not supposed to be scraping ebay so it just it's a good uh guide in general for what you can and can't do as far as scraping goes often times if a site says you can't scrape it they might have an api that you can use instead to access information all right back to the task clarification on this task so the ultimate goal is to have a list of python dictionaries where each dictionary represents the info box for a specific movie feel free to pause the video and then resume when you want to see the solution all right so the first thing we're going to want to do for this task is to just get that you know to load in that web page into beautiful soup so if we go back up a bit and just see how we scraped the previous web page we can just kind of duplicate this behavior for our new web page so i'm going to paste this in and now instead of toy story 3 there's a different link that we're going to want to use so let me go ahead and just paste in the list of walt disney pictures films so this is the page that we're ultimately scraping if we want to see what it looks like we can run this and see it once again just printing this out as a whole is a little bit challenging so i think what's better is probably to go to the actual page and use the inspect tool to help us out here all right so i'm going to just go ahead and click on inspect here ultimately what i'm going to be looking for here is is there some common you know class that all these tables are in that i can utilize to kind of pull the information we see down here i'm going to scroll this up a bit we see table class equals wicket table wiki table sortable jquery table sorter etc i'm curious do all these different tables have that so we have it here if we scroll down do we have it the next one wiki table sortable jquery table sorter it looks like it's there for all of these um different tables that we're ultimately going to want to scrape so i'm going to just click on this and copy in this hopefully put somewhere so let's say that movies is equal to soup dot let's say maybe find all of um class equaling this and then just see what we get in uh movies if we print that out ah nothing what happened um [Music] i'm wondering if it's because there's spaces in this what happens if i do like dots between these hm it's also not helping me right now let's go back to the page it should should be there that's the class right uh so what i'm going to say instead is i'm sure that there's some way to grab the class here with the find all but maybe we want to instead use the select method with soup so i'm going to do soup.select and i'm going to just delete all of this if you remember in the web scraping tutorial select gives us a bunch of different options as far as getting information from a table so i'm just going to load in a site on css selectors and it has all sorts of useful information so when i'm using the select method of beautiful soup i usually reference this and one thing to note here is that i can get specific classes using dot and dot so if there's multiple classes like we have we probably can use that syntax so let's do that soup dot select um and we're now gonna do wiki table dot sortable actually i just want to load that page in one more time yeah dot class okay so dot wiki table dot sortable maybe dot jquery table sorter let's see what we get still nothing i'm going to get rid of this jquery taylor table sword i don't know what's going on with that oh wow i don't know why that must have been the dash or something that was throwing things off with the jquery table sorter table class equals wiki table sortable or maybe that like gets populated later but that looks good okay let's go back and then kind of get a little bit more precise with what we're trying to get which ultimately is all these links right here so let's kind of do some digging into one of these tables so uh let's look at the 19 well this one's open already right okay wiki table assortable we have table head and then we get into the table there's a bunch of rows um let's get into the table body a bunch of rows again so let's open up one of those rows first thing it's table data within a table row we have this no we want this one right here so what's in this so we have look at that so in the italics here we have the this is really what we're wanting right here is the reference link to that wikipedia page and that's what's right here we also get the title so i'd say the two things we really need from each of these items is the reference link to so we can do further scraping on it and then the title so let's grab both of those things and how can we do that with beautiful soup well the one thing we could do is we saw that that was a italicized element so we might be able to just specifically grab the italics from this type of table to get what we're looking for and i see it's the only italicized element in this table so i'm thinking if we just grab each italicized element in the tables that have that class wiki table sortable then we're probably good let's see what this gives us look at that that looks much better i'm gonna just do like just so it's a little bit less overwhelming i'm gonna do the first ten okay that looks good that looks good this is giving us stuff to work with so we can do the scraping that's looking pretty good i would say let's now you know recurse into those uh reference links all right so how do we get specific properties of a link well what we can do is let's just do it for a single example so let's just take movie zero we'll want to grab the link element so that's going to be a and then what we can do if we want to get like either the title or the link the href we can do in brackets here href and let's print that out oh no what happened movie is zero okay cool that gives us the link and then similarly to get the title we just make this title and just to be clear how we're getting this again let's just do movie zero real quick delete all this other stuff and see that all this stuff is contained in this link element and i grabbed the href and we grabbed the title to get those two things so that's going to be important and we're probably going to want to bake this into kind of like a function or something so let's do that with the next step so let's define a function called get info box and that's going to take in a url so really what we're going to do in this function is just copy the code that we had from task number one i'm gonna just copy all of this stuff i guess specifically i want to copy this stuff to start uh let's just copy it all in one go copy all of this into our get infobox method so let me just paste that so this is a separate method so i'm going to move that above our new getinfobox method we're going to use it in that so get infobox and then this stuff will all be happening in this method all right so we need to pass any url this time so the other aspect of the info box for the toy story 3 was we had a first load in the toy story 3 webpage so we did that i guess right up here i guess we're going to also need to grab this stuff too so we'll do both of those things in this function bear with me as we do some copy and pasting all right so that's now in the info box and the last thing we'll want is just throwing this stuff into the get infobox function and this will all make sense in just a second i think all right cool all right so let's uh start editing some of the contents that we just pasted in well first off uh now it's not hardcoded to be toy story 3 it's going to be whatever our url is here so we can say url here instead uh you could keep in you know the comments and whatnot but maybe to clean it up a bit i might just leave them out remove the print statement okay so now we get our info rows and we see that we get our movie info here and then probably the last thing that we'd want to do is just go ahead and return our movie info and really i think that's all that we have to do to turn this into a method that we can use for each url that we have all right so let's just run that cell just so we have those loaded into memory next step will be to take this stuff up here i'm going to probably just paste this again basically we want to load in our list of walt disney films load in the way that we're selecting it and iterate through each of our selections and basically run this get infobox function on every one and append to a list to get our final kind of output so let's just actually we don't even notice that we actually don't need this contents in here that's just a pretty print it may be able to move all this together all right right so that's this is getting all of our movies that we're going to want to iterate over and so maybe the next step would be to have a for loop that iterates over each of these movies and i think it's often easier for a lot of these functions and these methods and you know code that we're going to write it's probably easier to do the enumerate method so for index comma movies and enumerate movies let's go ahead and we're gonna want to grab some information so i showed just a little while ago that you could do movie actually that's sorry movie in movies movie dot a href would give us our i'll say relative path and our title will be equal to movie dot a title i'm just go ahead and print relative path and title and i'm just going to break out of the loop real quick cool maybe i'll print a new line in between them so it's a little bit more clear cool so yeah this gives us the relative path and the title let's continue and i guess print this all out for each movie that we have i'm going to put a new line in between them let's run this okay looks like it's working cool cool it looks like it's getting it for all the movies oh no all right none type object is not subscriptable so it looks like our kind of technique of how we're getting all of these isn't foolproof so what i recommend in this type of circumstance is you know trying to do what we just did whenever possible so i'm going to surround this with a try accept so try this and if it doesn't work let's uh print out the exception so accept uh except and i'm just gonna capture a blanket exception accept exception as e and we can actually print out the exception i'm going to remove these print statements real quick and what we could also do is maybe it would be helpful to print out the movie dot get text so just figure out what kind of text is happening like what text we're dealing with when we get these errors all right let's see what happens there all right escape from the dark none type object is not subscriptable the omega connection trail of the panda so what i recommend doing is you know keep track of this list and see what's going on why you're getting errors on these specific movies so let's go back to our list and escape from the dark that's the first one so i'm going to just copy this go back to our list i'm going to close out the inspect tool real quick and i'm just going to do escape i guess i can paste in what i just typed in escape from the dark all right so it's over here in the right here i thought that uh only you know the only italicized elements were over here on the left but we see that we have one over here so makes sense that it's giving us an error because this is not a link so it doesn't have any link property so let's see what the other ones might be the omega connection the omega connection okay same issue as the last one so we know kind of two spots where we'll have to fix something trail of the panda none type object is not subscriptable panda uh okay looks like this just doesn't have a link it doesn't have a wikipedia page so you could either decide to maybe investigate this uh movie on your own or you could just skip it i would say either options acceptable uh expedition china what is that one okay same thing i'll probably just do this for each of them luca and kanto are the other things luca woods luca uh luca also doesn't have a link and it was encanto and kanto also doesn't have a link so i'm going to say what we're going to do is we will figure out how to deal with the one the italicized elements over here on the right but i'm going to just skip the movies that don't have links it's just kind of too bad if you want to investigate them on your own feel free all right so these aren't huge deals you could honestly leave them being printed out as exceptions if you wanted to but it's a pretty easy solution all of them could be solved by basically they needed to have links and they didn't so all we'd really have to do to not get errors on these items we were basically just skipping over them which is i think a acceptable solution especially since some of the elements were in the wrong column they weren't even movie titles is we would grab the link element at inside of a italicized element so that would allow us to do that and then instead of doing dot a now we're actually grabbing the link element same thing here so we do that we see we get no more exceptions and i could print the length of the movies so 435 before if we just did this that'd be 442 so those kind of seven issues we ran into now everything else is there but we kind of avoid those exceptions all right so now that we're getting the title and the relative path let's go ahead and try to get the info box for these things so um info let's define a list called movie info list brackets and then basically each time we we're going to basically wanted to do an append a movie info list of the get info box for this relative path url so we probably want to have the full path so the base path is going to be equal to just the wikipedia kind of url so it would be something like this and then we would append on the relative path so basically what we could say here is that our full path would be equal to the base path plus the relative path and then we would want to pass in the full path as the url here and just because i don't want to like make too many requests all in this first go let's go ahead and like break out of this loop this is just kind of for debugging purposes so if index equals equals 10 will break out so this will just limit us from going through every movie to start just to make sure we can debug things and fix things as necessary all right so what does movie info list look like since that ran all right title academy award of review of adult walt disney kind of tunes that looks good snow white and the seven dwarfs it looks good this all looks pretty good so far let's just grab like the first element cool so one thing that i do notice is that you know it's not going to have not every one of these movies will have every column and that's just life let's just try to collect as much information as we can right now all right given that we had no issues running for the first 10 i think it's probably safe to go ahead and uh run it for all uh values so now we're gonna be basically making like 440 requests uh to get those pages and get the info box and we'll do that it might take a second or two so we want to limit how many times we run this full thing because this is kind of what i would say the wikipedia robots.txt was saying is you know don't go too fast and we'll try to limit how many times we run kind of this for loop that iterates over everything so let's do that and notice we still are printing out our exceptions so we can kind of dive into some of these uh movies and see why they're giving us issues uh after the fact and i would say that this is kind of something that you you find in web scraping projects is you want to automate as much as possible but they're going to be weird edge cases that sometimes maybe you have to manually go in and fill out some information another recommendation when you're running a long script like this is i guess there's two recommendations is maybe printing out your index every like iterations of ten so like index mod ten equals equal to zero you print out the index another thing is depending on how long your script takes to run it's sometimes good to periodically save your results because if if it completely breaks out you don't want to like spend 20 minutes of your time just to realize that you you get an error at 18 minutes all right cool it's done running let's see the length of our movie info list all right 427 so that was out of 435 total things that we had so pretty good job we did miss a couple so it's not perfect perfect but i think we should go ahead and save all of our our dictionary data and python dictionaries map pretty well with json so what i recommend is taking all these dictionaries and saving it as a json file feel free to pause the video again if you want to try to do this on your own it's kind of a subtask of the task so i'm going to go ahead and let's just make this markdown cell save slash reload data this is also nice to save and reload it um if you like pause the tutorial and come back to it you can just reload from where you started or where you left off so to save the data we'll do import json let's define a function called save data it's going to basically take in whatever title you want to save your file as and some data which is going to be the movie info list and what we're going to do is we're going to open up the file that we named title we're going to write it and again this is basically anything if you see me typing out a long command like this oftentimes you know i've done a google search beforehand and uh like for example i would look up how to save json data python um i have json data stored in variable data how do i do that and then you know these answers will basically be what i think this answer right here is basically what we're going to be using in our save data function so you could have looked at that a little bit more intensely but coding equals utf-8 as f and we're going to do json.dump of the data and save it as the file f um i'm going to say ensure ascii equals false and i'm going to just save it with an indent of equal to you can save this to whatever number of spaces you want that's good for save data and now let's just add a another function which we call i guess load data so import json again in case you run these cells independently def load data we just need to load the title now and so with open and same thing with this one do a google search on how to load a json file into a python dictionary and you'll get your answer that's you know i don't remember most of this stuff i'm just know what i want to do and i google search to find the exact syntax on how to do that and ask any experienced programmer they're probably going to tell you the same thing you're going to memorize some things but it's really more important that you understand how to think about it at a high level and then you can just do some searches on the exact syntax to uh what you're specifically looking for and we're gonna do return json.load f so now we can go ahead and save our data as let's just say disney data.json and the data we're going to use is movie info list and basically if we run this i think we're good you can check whatever repo you saved your jupyter notebook in you can check that and see if it has the file so i'm going to do that real quick so my data is in a disney data science test folder and as we see we have the disney data.json file here i'm going to open this with sublime text so open with spline text and if i pull that in look at that we have all of our data here so i would say that's good for task number two in task number three we're going to take what we just produced and do some cleaning of this data all right for task number three we're going to be cleaning our data and so that's a little bit ambiguous right now but we'll dive into what we have so far and kind of break out some subtasks for this that you can kind of pause try on your own and resume so first off maybe you stepped away from your machine or something if you wanted to load in the data that we've worked on so far you can run this load data function and then you could just define movieinfo list equals load data and this will also be stored on github disneydata.json and so if you hadn't if you wanna you're skipping just to this specific step this is how you'd get caught up to speed so now that we have uh that disney data uh actually let's look at the the actual json file i think that's gonna be helpful as we break out subtasks so i'm going to write some markdown here subtasks all right so let's just look through our data and see if we can find different things that probably should be cleaned up [Music] all looks pretty good here so far one thing we might want to do is like convert dates into actual date time or like python date objects so we could probably use the date time library for that um let's see i'm gonna scroll through okay another thing that i see i guess two different items here um when we're going to do analysis on this data we're ultimately going to want it to be in the right data type so something like running time where it's 83 minutes here we'll probably want to convert that into an integer so we'll probably want to strip off the minutes and just store 83 as the number for budget and box office similar type thing it might be a little bit more complicated here we'll want to convert these monetary values into numerical values that represent the the monetary value so like this would be one million four hundred and ninety thousand uh instead of one point four nine million as a string another item i see as part of that is that we probably will want to remove these references from our data so we'll have to figure out how to do that is there anything else we see reference so how can we remove all these we're going to have to also figure out a way to like standardize our data so if we're doing analysis it's a little bit tough if we have like you know a single value for one field and then like a range of values for another we're not to figure out you know some sort of way that we're fine kind of just taking uh you know maybe 76.4 and converting that into a number 76.4 million dollars instead of like taking the high end of that range so that will be another thing to kind of standardize our data whoo i see another item here um so for whatever reason if we look down here for this starring i don't know what's going on exactly but i see instead of this being a list like it's supposed to for a lot of the other uh movies it's like if we scroll back to the top we see like directed by it's a bunch of list items for these we need to figure out why exactly this one's just a really long string that will probably keep us busy there might be some other things that we'll stumble upon as we continue to go through but we're going to try to get the data as clean as possible and i would say in general wikipedia doesn't give us the cleanest data so this is really going to test our data cleaning abilities all right so what are some of the items that we just mentioned we wanted to clean up references so that was you know things like the bracket one we wanted to convert numbers into or convert like running time into an integer i would say we would want to convert dates into a date time object let's see if there's other things we wanted to split up the long strings so that was we saw that example right oh this is another one it looks like a bunch of different things all looped into one so we need to figure out why these long strings are happening we'll also want to convert budget and box office to numbers or at least add an additional field that has them in numbers that's a decent amount of cleaning items to do that might kind of clear up most of the issues and i would break this down a little bit further and i would say that some of these tasks are ultimately going to make us have to rerun this full code here i guess another cleaning item is we probably should look at what's going on with these error ones so what i was about to say was that some of these items will have to like completely rerun all these uh get info boxes probably to fix so we probably should handle those first so i would think that cleaning up the references we might want to just strip that out of the html before we run everything so that would be one thing so clean up references and then the other item i would say before we rerun everything would probably be to figure out a way in the html to split up the long strings what is going on in those examples so here are two items you can work on independently feel free to pause the video try these items out and then resume when you're ready we'll start working on the other cleaning tests after that so for these two cleaning items we're not actually going to edit any code right here i think what we'll want to do is go back up to where we defined our get info box method and also i guess where we got the info box for all the movies and probably edit that appropriately to factor in these these this information so the first thing we're trying to do is remove the references we're trying to move like things like this from our results so to do that i think the best place to start is probably to just looking at the html source code so let's go ahead and click on one of the movies doesn't really matter what we click on but i'm going to go with peter pan so over here on the right we have the info box and what we want to do is remove references like this so the best way to go about doing that is right clicking inspecting it and seeing what the html looks like so we see it right here it's highlighted and we see right above that there's a parent element sup which stands for superscript not uh not saying hi to you so if we look at all these different um tags here the references they all have the the sup tab the superscript tag so if we remove the superscript tag we're probably good at that cleaning item so how do we go about doing that and i think to make it easier as we test out changes let's insert another cell and here we'll like get the info box for a specific url so if we wanted that peter pan url paste that in so we see that that stuff is all in there right now but let's remove these references so i'm gonna do a handy dandy quick google search so you might type in something like how to remove script tags from html beautiful soup that will probably help you okay i'm going to click on the first article here removing tags find without fine that doesn't look like what we need yet prettify doesn't look like what we need yet all right i think we're getting to something okay so to remove a tag using beautiful soup there are two options extract and decompose extract will return the tag that has been removed and decomposed will destroy it we don't need these tags i would say so i think decompose sounds like it will work so it sounds like we just need to iterate over all of our tags that are of a specific nature and we can just call decompose on those tags all right it looks like it's actually like basically doing our task here so this is very helpful so it looks like you could even do like a find all here this could be very helpful and decompose it so let's try doing that so i'm going to add a method i'm going to just call clean clean tags let's say and it will take in some beautiful soup object and we'll just say for tag in soup dot find all of the superscript tag we'll call tag.decompose um and so i'm going to just see if that works i'm going to just throw that into our infobox function so clean tags uh and our soup is defined as soup i think i need to define this above just to make sure it's okay clean tag soup let's see what happens when we return the movie info now or we need to run this first that's not gonna do anything but if we call get info box right now we see budget it has the the two here we don't want that we don't want it here either so let's run it come on look at that it's stripped out nice um so that's good uh one thing that i kind of noticed as we were going about this was that if we look at the wikipedia page for the release date we see that it's just february 5th 1953 but when we looked at it in the info box we see february 5th 1953 and then you see some other stuff not quite sure what this other stuff is but i thinking i kind of want to get rid of it um so let's just look at the html code uh release date release date scroll up scroll up so february 5th 1953 i see a span here and it's in the span so i guess i want to check real quick if um this appears in all of our data if it appears in all of our data we might be able to use this for our dates but if it doesn't we probably can just get rid of it so look at our data and release date i see it here i'm guessing this is it at least uh release date i see it here okay another release date okay i found an example where it doesn't exist so this is our old data by the way too so it still has the references we'll have to regenerate this so because i don't see this in all of it i'm going to say we can go ahead and remove the span tags as well in addition to the superscript tags so let's do that so clean tags so i'm going to say find all now not just the superscript but also the span just so we can get rid of that extra information that it was passing and i think sometimes it will i think there's other cases i saw two or passed in weird extra information like this so let's see what happens if we rerun this and we rerun get info box cool now it's just what we expected it to be so that's uh item number one there all right now that we've done that part of the task let's go ahead and do the second part so i'm going to just say that this is done we're gonna have to rerun everything for all of the uh code but it's done enough so let's split up the long strings now and when i say long strings let's look at our data again and find an example of what we mean uh yeah here's an example just with the great locomotive chase uh all these names for whatever reason are on one line curious actually if that was showing up at all with the peter pan movie look at the data there doesn't look like it shows up there so maybe let's look at this great locomotive chase movie and see if we can find why the starring is not on not a list like it should be so let's go back to our wikipedia list of movies great locomotive chase where is that the great okay found it okay so we're having issues where the starring wasn't showing up properly so i think what we can do is we're looking at the info box let's do some inspecting so inspect this all right so we have the table row table head starring that looks pretty straightforward what's in the table data all right so it looks like it's even though it should be a list they didn't actually include a list element or yeah list element in the html instead there's these breaks um okay so let's handle this somehow how can we do that well i would say that we probably want to kind of write another if statement up here so if it doesn't have a list we don't want to just return the get text because that would give us that long string we saw so let's have an else f and hopefully this handles multiple cases we can do some checking afterwards but basically we want the alphas if else i can't pronounce else lf condition to be maybe if the row data you can find a break tag then we'll want it to enter into this one so what can we do to combine all those things if there's a break tag well really i'm curious if there's a way for us to kind of separate on the breaks or just basically get the text that's in different elements all in the same level let's do a google search and i think what we'll want to look for is maybe just let's look at the get text method again the documentation for that it might maybe have something that can help us here so get text right it returns all the text with in the document or beneath the tag simple unicode string all right so what we found last time was like the strip command and also the separator uh but at that point you might want to use a strip strings generator instead and process the text yourself so it looks like we can return the separated elements as a list if we use this stripped strings command so i'm going to say let's go ahead and try doing that so where are we where are we i'm going to paste in that code we just found so text for text and soup dot strips strip strings well we don't want to look at soup we want to look at row data i'm curious what this will print out so i'm going to just start with and this was in a list comprehension i'm going to just start with returning this and we're going to need a movie that has some has an example of the long strings so maybe we pass in the url for that great locomotive chase movie so i'm gonna run that and then instead of passing in peter pan let's pass in the url for the movie that we saw that had the long string issue so that is this and real quick i just want to run it first time around without the code we just added and then we'll run the code that we just added so okay what happens the first time so the first time we see that we have the starring all messed up with everything listed out one go but let's see what happens if we now add in this line okay oh wow did it already work so it is oh okay i see what we just did uh it looks like it's good um basically before whenever we're doing the you know get text and we're passing in the the space here we're joining all the different elements with a space character um and thus getting a single string which is what we wanted when we like separated uh production comp production from company but in this case we we don't want to join it and so we're using the stripped strings here to basically keep those things separate and and not process them together and join them together so looks like just doing that for the break category it works pretty well i would like to do a like another check before i call this good so let's find another example where this happens in our full list of data so i see already you know this one also has the same issue davey crockett and the river pirates so let's get that url and one thing that maybe would be good to store too i'm not going to add this myself but maybe you would want the url in your json here all right david crocker and the river pirates okay so paste in that url and i'm going to take this out to start see what it gives us to begin with run this okay yeah we see that same issue where it's a big long string of three things what happens if we now add in our line of code again nice looks pretty good to me i do see that we have some issues here like orchestration got separated from edward h plum and thomas blackburn got lyrics attached i don't know if it's going to be that easy for us to handle this what you might be able to do is just delete anything that's in parentheses but that's kind of up to you if you want to take it a step further to clean this more probably would be it probably would be safe to clean anything that's in parentheses and just remove that all right i think we're pretty close to being able to run all of this again the one thing i would say that we should do is just kind of do some investigating for why these things are failing here and maybe add some additional little checks into our code to handle those so we can start with like zorro the avenger so zorro the avenger why did that not work go to the top oh interesting it looks like it like isn't even have a full info box because this is like a separate thing so because this is an edge case i'd say i'm fine dealing that and i think the sign of zorro is the other issue we had this yeah exactly so those just don't have an info box because they're part of some series i'm fine just skipping those one little indian has none object has no attribute get text so let's go to one little indian so where would that be happening let's do some investigating so this looks good that's just the title we have the poster we expected that we skipped that over um we have the next oh interesting theatrical film poster oh okay um so it looks like this one doesn't have a table head same with the poster above so now the first and the s so this is the zeroth index first index second index they both don't have a table head like most of these other ones do so what we might do at what modification we might make to our code is to add a check to see instead of maybe the if index equals equals zero and else if index equals equals one i'm going to get rid of this line and instead let's add a check to see if it has a header because it seems like the issue with both the poster row and the row we just found is that it didn't have a table header so if i do row.find here of table head and then i just do like an if statement if header and only if that is met do we add the content to the info or to our movie info dictionary i think that should solve that issue we can test it by running uh the one little indian in our command in our little cell right here pass that in so run that run this looks like it worked but i think the thing to check is did it not work before so i'm going to just replace it again real quick run this cell and does it now work error so it looks like we just fixed that issue so replace it cool so we've fixed one of theirs i wonder if like the other ones had the same type of deal all right let us quickly look at the true life adventures one so we look at true life adventures a couple things hmm okay interesting okay i think that what's happening is that this is what's getting captured because this is in the same type of table so and this has the exact text true life adventures like we saw if i click on that it's definitely not the same type of a table as the other ones so i'm going to say that we don't really need to worry about this this ultimately should be a bypassed link so it's fine that it fails on this all right and i'll just do one more check i'll just check the nightmare before christmas and see if that looks exactly like the era that we already fixed with the one little indian just throwing that in here oh look at that inspect table row table row it doesn't have a table head that's right do any of these not have table heads what is this one yeah what the heck is this row there's just like a missing row here that doesn't have anything so that would be solved by the same problem though we just fixed with the other one so i'm thinking that the rest are the same as what we solved with one little indian so i think we're good at this point to rerun all the movies and save our new updated cleaned up a bit uh json file so make sure that this is just run properly and then let's go ahead and i guess this time around just to see the progress i will print out so if index mod 10 equals equals zero i'll print out the index just to see how we're progressing we should end up with like 435 i believe around that like 430 or so if it works properly run zero ten no errors yet uh we we expected zorros to still have issues we don't need to solve them because they don't have the same format as the other movies true life avengers shouldn't be included so that's a fine error i think we're done yep it's finished let's see what our total length is now her movie info list 432 and the only ones that broke on was true life adventures which we knew wasn't actually supposed to be there and then the other ones that were included just didn't have the info box so we couldn't actually scrape them but it worked on all the others now so that's that's good to see um let's go ahead and save our data so run the save data cell and i'm going to just call this disney data cleaned it's not as clean as we will want to kind of have for our final data but it's getting there so run this and then you can go to your jupiter notebook repo or like the folder that it's in so mine is right here and we see we have disney data cleaned we'll open that with sublime text and you should see better data than we had before there's still some things that we can you know improve and we'll do that in the next couple sub tasks all right so if you stepped away from your computer by any chance you can always reload the cleaned data by using our load data function and loading in disney data cleaned all right so let's think what we've done so far on our cleaning subtasks well we have cleaned up the references and we've also split up the long strings so let's as our next task uh let's convert running time into an integer so to kind of be more clear what we're doing so split up long strings is done i'm actually going to just delete this stuff out of the way so to be more clear what we're doing right now let's just do like movie info list let's grab like i don't know tenth from the top item and we see here that we have the running time so what we want to do is because it says 85 minutes what we really want instead of 85 minutes is just the integer number 85 representing that time because that will make it easier for us to ultimately do analysis on this value so feel free to pause the video try to convert all of the running times into a new key value pair which maybe you call running time int and it maps to all these values as an integer and then resume the video whenever you're ready to see how i would solve it alright so the first thing that i recommend doing is let's just see what the format looks like so if it's always like number then followed by the word minutes it's pretty easy to deal with but it's kind of unclear if that's actually the case so what i recommend doing is movie running time for movie in movie info list and we can honestly just run this cell running time oh okay so what we should do because sometimes they might not have a running time we can do movie dot get running time and we'll just have to handle the nun case uh differently running time and if it doesn't have running time we'll just say uh not applicable for movie and movie info list okay and i might just print this so it's all uh one line instead of hundreds of lines actually maybe it's probably fine to scroll through this so we have 41 minutes with like some additional info here 83 minutes 65 min so that's good to know um this is a list that has 60 minutes and i would say in this case let's just grab the first item in the list so it could be a list that's good to know 90 minutes 89 men 80 men and it's this one's weird but it's a list and once again we can grab the first item that looks fine if you really wanted to might be helpful you could just like look how many times min occurs in these and see if it occurs in all of them i'm just scrolling through very quickly to just check it looks like basically all of them have minutes or at least men or their list that has a couple values a couple not applicables that didn't have the minutes listed so we'll just have to kind of leave those as none in our final solution and yeah some of the not applicables are probably for the new movies that haven't come out yet okay good to know so we can hide this this will help us solve it so how should we go about solving it well i like to write functions when i'm doing tasks like this so let's define a function called minute to integer and it will take in the running time so whatever we're actually getting at this point right from the top something like a running time example that's very simple would be something like you know 85 minutes and we also saw like 85 men as a possible way to do this so what i'm thinking is let's just not even worry about what it says after this we know that minutes is always occurring so what we really can do if we just want this 85 is just split on this white space right here and just grab the element before the white space so the 85 so what i think is the simplest thing to do is just take running time so let's say value equals running time dot split we're going to split on a space and then we're just going to grab the zeroth index because this would give us a list of 85 comma minutes and if we grab the zeroth element that just gives us the 85 and what we really want to do is convert this into an in so we'll surround it by int so if we now ran this on a test example we do maybe print and then we want to return value turn value so print minute to integer of 85 minutes and we should get 85 as a number we do that's good but we have more i guess difficult cases so one of the more difficult cases was we saw we could potentially have a list so we could potentially have something that looks like this 85 minutes comma let's say 70 you know 90 minutes maybe there's a longer version in this case i'm saying we should just take the the first entry you could maybe do more complicated stuff if you want to but i'm going to just say that that's how we're going to approach this usually the first value listed is probably a safe value to utilize so in this case we ran the function we're going to get a break uh you know an error because it's expecting to run something on this but in this case we have a list so what we're going to do here is add in an if statement so if we can say is instance and running time is a list oh no what did i do we can then basically grab entry equals maybe running time zero so we just get the first item in the list and then we can do the same thing as before so i'm going to just add our original code in an else statement else so entries running time zero we just basically copy this paste it in and now instead of running time split it's going to be entry dot split we can return the value it's pretty good and honestly we could save ourselves a line instead of doing return value here we can just do delete kind of the intermediary step and just do return this turn that and same thing here we can just do return of this cool and even if we wanted to we could make this the same as before 0 dot split and just kind of simplify the line so it really depends on what you want to show you know this might look a little complex at first but yeah it's really personal preference on what looks better so let's run this again we see we get 85 in the list example so that's great however there's one other edge case we need to deal with and that's if the movie dot get is not there they don't have the the running time property so we could maybe add another if statement here so let's just say if uh running time equals equals uh i keep hitting enter by accident equals equals let's say not applicable then we'll just have this return none so that's like three different cases so we have the first case if we couldn't actually get the running time and the second two cases is we got the running time but we don't know if it's a list or a string if we passed in then not applicable here you'd give us none all right so let's now go ahead and add this to our json how can we do that well we can go ahead and do for movie and movie info list we can add a new key which we'll just call running time int and that will be equal to um movie dot all right i guess we want to use our function minute i'm going to say this is called minutes to integer minutes minutes to integer of movie dot get of the running running time and if it doesn't find the running time key then we'll kind of use not applicable as our fallback so basically for on this i think we have completed the task as long as everything goes well no what happened oh movie info it should have been movie of running time because we're taking this json and we're adding we're taking this dictionary and we're adding a new key to the dictionary and because dictionaries are mutable this is fine for us to do to actually change up movie info list so run that now if we look up movie info list let's say the negative 10th get hamilton again we see we've added an additional running time value and it gives us the 85 we are looking for from this 85 minutes that's originally there and if we wanted to we can kind of replete complete this step again and now just see all of our look at that looks pretty good these are all numbers didn't seem like it aired out in too many there's a couple nuns in there but that looks pretty good all right continuing on let's now go back to our different subtasks we've now you know done this conversion so we can cross it off all right so now we have two additional subtasks left we can either convert dates into date time objects or convert budget and box office to numbers i think this one's pretty similar to the last one we did so let's go ahead and try to do this so just to make sure that the task is clear you know it's very similar to the last one one of the properties of these movies a movie in full last we'll just grab like the negative 30th element whatever one matter it doesn't really matter is we see box office and budget so ideally you know these are string values right now but ideally we could convert them into a numer a number just like we did for running time um so that's the task feel free to pause and attempt it on your own i will note that there's a lot of uh ambiguity uh in ambiguity in this task so like for example um 120 to 133 million it might not be clear whether you should take 120 million or 133 million or not so you're gonna have to kind of make some decisions on what you think is best like in this case i would say go with 120 million and we'll just say you kind of go with the first value listed but you could maybe take an average if you wanted um also to help you kind of just stay and do this task the same manner that i would um i added some [Music] to the get get github repo i added some tests that you can run to see if you're doing a good job so what i recommend for this task when you're working on it is that there's this file called conversion.py so given a money value if you fill out this function and return it as an integer or float you can use this exact function in another file which is test money conversion which is pi test and you can basically run a bunch of cases and see if your function does the proper thing for each case so this is a very like fun way to work on implementing some things like working on converting 1.234 million that would equal this number and i can make this a little bigger so you can see so you know 99 million would equal this 3.5 million would equal this so you can use these tests to help you out but you're going to want to you're going to want to fill in so you can download this folder locally and you'll want to try to implement this money conversion function and then you can test things using pi test and i'll show that a bit as we go so going back to our jupiter notebook i think one thing we'll want to look at before we dive really too deep into the task is just to look at what our values look like so we can do movie budget let's say for movie and movie info list and we can print this out oh we have to do the get again okay so it's worth looking at some of these values to see the different types of syntax that we have so you have like i would say a dollar value followed by a quantifier on the dollar values like 1.49 million we also see we just have some straight up dollar values you see that a bunch of these are not applicable so a lot of them don't have a budget listed so we'll just have to work with the budgets that we have you have some weird edge cases here you have like this one that says 60 million norwegian kroner which is around 8.7 million dollars so it would be nice to grab the 8.7 million us dollars from that let's see if there's any other educations i see again we have just like running time we have some lists that we'll have to deal with so a lot of different things to consider we'll just do our best job you probably can always make it better but let's just do what we can alright so to you know complete this task we are going to go ahead and fill out that function that i mentioned you can fill in and test out the pi tests so let's go back to the git lab repo you know that folder was in helper and there was all these files here so if you want to if you want to download this there's a couple different ways you can go about it you can clone this repo you know fork this repo and then clone it locally and i'll have instructions i'll add instructions to the readme on how to do that you can also click this green button up here and download the zip folder of all this so that's another way to get these files locally you'd have to download the zip and then extract the files to wherever you know you want to work on your code so two different ways i've you know done that and let's go ahead and open up a file the file that has this function and i'm also going to open up the uh test cases here so let's start simple and just keep building out things you know try to like hit as many of the cases the test cases as possible first and then you know go from there so to start off let's just focus on this type of case where it's you know a dollar sign followed by some sort of number and if we look back at what types of edge cases we might run into and i think the one big thing to see with all the values basically is that anytime we really want to get the actual value we usually see the dollar sign so that's going to be kind of a key indicator for what we're looking for so in a complex amp example like 60 million norwegian kroner here with the 8.7 million in parentheses sees out here we're gonna want to start searching once we see that dollar sign and maybe like kind of strip out most of the other stuff so that's like one thing to keep in mind as we go about doing this so let's think how we can do this and off you know this is really striking me as patterns and so whenever we're dealing with patterns i would say that we want to be thinking about regexes so i'm going to import the regex library here as kind of one of the first steps and what does our regex pattern kind of look like that we might be searching for well i think the biggest thing is that we want to see a dollar sign and then followed by that dollar sign we want to capture some sort of number and so in my eyes a number is you know 12.2 is a number 790 000 you know it could be like 0.57 is a number but i think the biggest thing is we want some digits followed by maybe a comma and then some more digits followed by maybe a decimal point followed by some more digits so let's try to capture that in a reg x so that's going to be the first thing we do number equals and just to make it a little bit easier to see what we're doing i'm going to delete most of this stuff and just clear it out of the way just so it's a little bit uh easier to see exactly what we are specifically looking at i'll keep the cases there just to make it i think these are helpful so a number well in regex terms we can use r for kind of raw text and this will format the regex nicely but we want to find digits so to find digits we could do backslash or yeah backslash d and rexx's regular expressions are pretty difficult they can be hard to grasp but if you're not familiar with reg x's uh you know this part might be a bit confusing i'm to make a tutorial on regulars at some point but they are a super super powerful tool so i recommend playing around with them and trying to learn them i'll link in the description some resources if i haven't already made my tutorial that will be good for regular expressions to get started but you can do all sorts of really cool pattern matching stuff with them so i really recommend you try to get the hang of them so one thing we can do is just do digits so if i did like backslash d plus that would be any number one or more digits it would match and then we can kind of do stuff like print regularexpression.search and we see we need the patterns the pattern here is number the string that we want to deal with is going to be let's just say like one two three and we can see if this is true or false that this is a number the one two three here so i can run that and we see we get a match here which means it is if i made this like just abc i ran that we'll see we have none so this is a very simple just searching a very simple pattern which is the the digits plus but we need more than just digits right because if i did 12.2 right now it might actually match the 12. um so let's run that so it does get a match but we can figure out what the match is so it's going to be the first full match it finds with the search method so if i do dot group we can see what the match is and we see it's just 12 and we really wanted it to be 12.2 here so what can we do well we can follow one or more digits by an optional period and a period is a special regex character so if we actually want to specifically mention a dot we have to do backslash period and then we can do zero or more of that followed by let's say some number of digits again so zero or more digits so now if i do 12.2 here we get 12.2 because we've added in the pattern that it might have an optional period followed by some more digits in it so we're getting there with our number and capturing numbers in our pattern so what else would we want to add well what i see here is if we added you know 790 000 here it's not going to know how to deal with the comma so what we also want to add is the possibility that we have a group i'm going to say of comma followed by three digits followed by you know that's the the end of the group so what this is saying is that we have some sort of digits so 790 would be one or more digits and then optionally we could follow it by groups of three uh digits with separated by commas and then that could be followed by an optional period followed by some digits and i want to make this zero or more because we don't necessarily need to have these groups so let's see what that matches seven hundred and ninety thousand perfect so i think if we grab like point two five i can show two that that would be full thing so this allows us to grab numbers so that's a good start so i think the first case we want to handle in our money conversion function is something like 790 000. all right so with grabbing this number i think let's see if we can handle a couple of our initial cases so what we're going to do is say money equals and we'll capture this stuff here or maybe i'll just say like value equals because we're using money as our input variable equals regular search the number up here and now instead of this string we're going to pass in whatever we pass into our function so money and honestly taking the group of that and we probably just want to return the value here but we we need to make sure we make it an integer instead of whatever this is going to return so maybe like int or you could do float float might be more appropriate because you can have decimals potentially in your numbers so i'm going to say float of all that and we can return the value and so if we tried doing something like print money conversion of 790 000 and that has our example has dollars see what happens we get an error yay cannot convert string to float oh okay so i grabbed the the commas we'll have to actually strip that out when we return our solution so we'll keep that in mind strip out commas before solution uh but let's say we just had like 790 dollars would that work yeah that gives us 790.00 but let's just handle the 790 000 case real quick so we have an issue because there's a comma in that so i'll say value string equals this stuff right here and then maybe we could say value equals float of the value string dot replace any comma with just empty space because that would then make 790 000 with a comma just 709 so return turn value and you know i kind of like to figure this out and then maybe clean up code a bit and that looks good cool so 790 000 turns into this that's perfect that's what we want so as a starting point we have a very very basic solution this obviously is not going to handle 12.2 million yet if we type that in we can see for ourselves it's probably going to just return us 12.2 so there's more cases we need to handle but let's just show how we can use those tests to our advantage so i recommend opening up some sort of terminal window and navigating to wherever you have that test money conversion file that looks like this ultimately we want to run this with pi test so what that looks like is if i i'm gonna have to navigate in my terminal window so for me it's in youtube slash code slash data science slash real world scraping and then it's in the helper folder so i need a cd to the helper folder and now we see that the file is in this directory so if i run pi test test money conversion it will run our function on the tests and we see we had four that passed and 11 that failed so let's see if we can find the ones that passed so it looks like four in the middle there passed so let's look at what one's passed so the bottom [Music] three failed but oh it looks like all these um for past so it makes sense these are the ones with the commas particularly in them um one thing that's interesting is that this one failed and ultimately that's because it probably took the 60 because i was just looking for a number instead of taking the 8.7 so we also want to add into our check that it has a dollar sign before it um this one's a list so also have to handle lists so there's really if we look at these tests there's two main major cases that we need to break it up into i'd say that there's this like word syntax it's like it's a value followed by some word that quantifies it and we need to handle that so we'll probably have an if statement to handle those cases and then we just also need to make sure that we only start searching for a number at the the finding of a dollar sign okay so let's start uh kind of making this a little bit more concrete so let's call this um format i'm going to call it the value syntax i'm going to call this format the word syntax so basically we'll want to add a regex to capture both of these cases so the you know the value syntax i think is a little easier to find it's basically look for a dollar sign and then capture the value with the number as we did before so i'm going to go ahead and just move this down a bit i might abstract this stuff into its own function so i'm going to define a function that's called parse value syntax it's going to take in some sort of string that's close to the value syntax that you know it has like a string like this and it's going to do what we did before now instead of money this is string i think that's all we have to do cool and then now what does money conversion look like well first off we need to figure out if the value syntax exists so we'll do a regex search just like we did really with this one it's gonna be very very similar so regex search this time though instead of just grabbing the number because we saw it failed one of those cases it failed the one with 60 million norwegian kroner i mean obviously i guess it would have failed because of this anyway but we want we don't want it to just pick up the first number it sees we want to make sure it has a dollar sign in front of it so what we can do here is a very similar regex but this time we're searching for the pattern that will be we're going to format this raw it's going to be dollar sign and dollar sign is a special characters with preceded by backslash and then we'll put followed by the number that we're looking for and to make sure we can read this in as a variable because i want to basically populate this in but i don't want to i want to keep it neater so i want to keep this just called number i can make this an f string so you can make it kind of two multiple syntaxes so if we find a dollar sign and then a number we're going to call that the value syntax and we'll pass in we'll be searching for that in the money string and then we'll get the actual match there by doing dot group so that's value syntax what does word syntax look like well word syntax is something like dollar sign followed by 12.2 followed by the words million if we look at our test cases we see millions one billions one i also put down thousand as an option so that's kind of the basics of it i guess there's also the edge case that it's like 3.524 so that's like a separate case but let's just handle the generic case so i'm going to define another string that i'll call amounts and this will be another regex i'll say that it's if we see thousand if we see million or if we see billion in our string and i see i spelled thousand wrong um we're gonna be searching for that and this will make sense in a second i think um but what does that standard the word syntax look like well the word syntax so maybe i i honestly might pull up i might move this out of this i'll call this value regex equals that and then i'll just replace this with value rig x and i'll similarly do word regex i guess because i defined word syntax above it i'm going to define it above it here word regex equals well it needs to be a dollar sign so we can do i'm going to make this the same type of string dollar sign followed by a number again okay that's simple followed by a a space followed by one of the amounts so i guess the space is slash s followed by optionally one of the amounts and to make this very clear i'm going to surround these with parentheses too just to kind of group them i think otherwise it might match it too quickly i want all these to be together okay word syntax equals that's basically it because it's dollar sign followed by some number followed by some sort of followed by some white space followed by some sort of amounts so let's see if things start matching that so just to make our life easier um pass on this real quick print uh re regex dot search word regular expression followed by let's say 12.2 million and we just want to see if we get a match here oh wow it worked cool and would it match something like 12.2 to 13 million i don't think it would yeah that gives us an error so it fix it handles the the most basic case which is good um we can start building onto that so optionally we saw with our regular expression actually let's let's just get this working and then we can handle some of these edge cases so we'll say word syntax equals regular expression dot search of word regex and then followed by money and i'm actually not going to grab the group yet this is really going to just be whether or not the syntax exists and just to make that clear we'll just do if word syntax print word syntax lf value syntax print value syntax and then also maybe it would be helpful to print out value syntax dot group and print out word syntax dot group because this the dot group is actually showing us what our matches so let's just run a couple examples so print money conversion of 12.2 billion we see that we get a match there i guess instead of i could do a return too but that's why we're seeing a nun here if i try let's say 790 000 we get value syntax what if we did something like 700 like kind of made it in the middle like 790 000 million this would be weird i don't know why we'd do it but what would this match okay that matches the word syntax that's cool so really it it's we to i think i would say because the value syntax is a subset of the word syntax i think we want to have this if statement for word syntax before l uh the value syntax because otherwise we did something like you know 790 million dollars here it would match just this for the value syntax uh when really we wanted to match all of this so that's cool um all right and let's just now handle our groups properly so with the value syntax we already had this parse value syntax function that should work as before so instead now i can just call parse value syntax of value syntax dot dot group here to call the code that was working when we originally ran the test cases i'll remove this line and now we'll have to basically write a similar function for word syntax i guess i should have kept these guys with the money conversion function so let's also define um parse word syntax they'll take in a string and what do we want to do i'll just pass this for now we can come back to it in one sec and i guess ultimately we probably want to return this oh and i guess the only issue here no i guess we're good depends on if we want to replace the commas before or after all right i'm just gonna say return none here okay on a real quick just see if that changes anything ideally we should have the same four tests pass i don't know if any others will pass it might we might get lucky on some because the none is here and i think some of the test case assert is none all right we've passed five tests now so we look at the tests that were passed same four tests as before so that's good that this if statement's working well but um we failed all the others so i think if we build that word syntax will probably be pretty good there so let's do that all right so ultimately we have an easy time getting a number like 12.2 out of the string so really what we need to do is just multiply that 12.2 by whatever the modifier is here so what i'm going to do is write another function i'll just call this def word to value and we'll pass in a word and ultimately we're just going to define a dictionary so i'll call this the value dict which maps each one of these options to its corresponding numerical value so we could do a thousand that would map to one thousand million would map to million and then finally billion would map to one two three one two three one two three if you really wanted to for good measure you could do trillion but no um movies are in the trillions of budget trillions of dollars for budget or trillions of dollars for um box office but if you wanted to be safe you could do that then all we would do here is just return the value dict of whatever word we pass in so if we pass in a thousand this function will return us 1000 if we pass in a million this function will give us the integer milliamp etc all right so now let's start filling in parse word syntax well first off let's just print out what the parsing of the word syntax will do so let's just print out our string that we're working with so if i did parse word syntax and i passed in word syntax dot group and we passed in 790 million let's say what does that give us gives us the string 790 million so now we need to figure out how to really separate you know the 790 from the million part and there's multiple ways you could do this i think you probably could be safe to split it on the space and then just grab what's after on the right as your you know your million thousand billion word and the stuff on the left as your value it's one way to do it we're using a lot of regixes though right now so we might as well maybe just continue that and just use regis to match these things so let's do that so i'm going to say that our first off i guess let's get our value so our value is going to be if we have a string like 790 million we can get a value the same way we did with i would say the parse value stuff down here so if we do regex dot search um number string and we do group so now we're just getting the number part of what we have in all this and it's going to be the first match of the number so you'll see why this is important in a sec and for good measure we probably should also just potentially replace the commas out of this in case we have like 1290 million i don't know why they format it like that but it's good to capture all the edge cases we can so honestly i'm going to just repeat this code down here so i guess we could say value string equals this i'm going to just completely copy and paste oh no all this paste and you know i guess good code practice you probably have this also in its own function uh but i'm only gonna do so much for the video okay so the value would get the same way in the word syntax one but now we also want to get the word which is either thousand million or billion and so to do that we could do regex dot search the pattern we're looking for ultimately is amounts and we want to see what where that exists in the string and we'll find the first occurrence of that as well we'll do a group here so i'm going to real quick just print word out just so you see what happens i hope that the string that's 790 million will capture amounts and just give us million here so that's my goals that million prints out cool that's good and one thing that makes that really nice is like if if people decided to do 790 millions it would still just grab million because that's what we have in our match and ultimately that's nice because that would now always make sure that if we see million if we see billion if we see thousand that it always still maps exactly to what's in here if we returned thousands and then we try to use this dictionary it wouldn't work because thousands is not the same as thousands so that's kind of one nice thing about doing this regex search method and then the final thing i want to do is just do i guess word value and that's going to be ultimately word to value of the word and then finally our ultimate return value should be the value times the word value so if our string is 790 million or millions however we want to put it 790 gives us our first value here and then millions gets converted into one million here and we ultimately multiply that by the value to get our final answer so save that and run the code it looks like it's good 790 millions that's thousands or hundreds thousands millions awesome so let's uh run our tests again and see if we solved a bunch of them so i'm just gonna clear pi test pi test test money conversion the py come on look at that 12 passed now and three failed what once did we fail well we filled one that is a list so we just need to handle the list type that's pretty simple we just really have to grab the first item in the list and we'll use that we failed one that it says 3.524 million which makes sense so we'll have to handle that edge case and we also fill failed arrange one like this which is another edge case of the word syntax so first off let's fix the list example so i would say if money or if is instance money and is a list let's just say that money is actually equal to the first item in the list so money zero and just to show you where this would pop up if we look at the list one down here basically the code we just wrote is saying that if we get this list what we're gonna do is just take the first element in the list and say that that's actually our money string and then everything's the same as before so honestly already we could just with that change we could rerun it i bet you only have two failures look at that 13 past cool and again like try to look through all your data and get a feel of like what types of edge cases pop up we might not handle every single one with what we're doing right now but we're trying to handle as much as possible to ultimately build out our our data set and give us as much good information as possible um so one thing that's fun is we actually we're passing this edge case so that's cool so really we just need to now fix these two edge cases so that's pretty simple we'll just build out what matches in our regex for the word regex right here so right now it has to be a number and then immediately the amount but basically what we can do here is optionally maybe have um a dash character which would handle this case right here and if the dash appears there then uh we could have a and we the question mark is it exists or it doesn't exist if it was multiple dashes we wouldn't actually match that with the question here followed by maybe another number and again we'll use the question mark the number exists or it doesn't and then followed by the space and the mounts as before so let's see what happens if we save that and now run it look at that 14 past and quick note one assumption we're making here is that if we have a range like this we're just going to take the bottom value we're going to kind of take the lower limit so that's why we get this test case equals that so this one right here is very similar to the last one but now instead of the dash let's also check for a optional so i'm going to do or so the straight line is or in regex syntax space character backslash s is a space character um two backslash s another space character save that run it no run our tests hopefully they all pass now awesome all the tests pass and as i mentioned you know this might not be every single edge case maybe it would be better to take the average of this range you can build on this but i think what we did here shows a good uh you know how we can use regex's to match the majority of cases that we'll run into in our data set so our next step now is going to just be taking all the code we rotate here and actually moving it to our jupiter notebook and kind of uh actually changing our our data all right so we want to go back ultimately to our jupyter notebook and we had all our monies here but basically we now want to copy in our code from before and paste it in here and change our jupyter notebook so [Music] copy all this in in a sec um actually one thing i noticed real quick with regards to finding the word is that if i typed in something like 790 millions with a capital m it wouldn't work properly i also just like the idea of we'll make a couple more tweaks in a sec but if it's a capital m it doesn't work properly so one thing that's good to do is to just add a flag so i can do flags equals re.i and that's going to be ignoring the case of the word million or billion and we'll do it also down here in the word syntax search so flags regex dot ignore case re.i is ignore case so if we run it again and then i guess we have to lower case it just a small edge case all right let's copy all of this in and i think i see a couple changes i'm gonna make but um copy this in open up our jupiter notebook paste it all in one thing that's annoying is we see that it's using i think spaces here in jupiter notebook world but it's using tabs from sublime so yeah these are tab characters i wonder if i can real quick in jupiter notebook switch i think if i click tab size down here indent using spaces i just want to do that to mirror my jupiter notebook oh convert indentation two spaces nice i'm gonna paste that in again cool it's all the stuff i want one thing to note real quick is we have a bunch of these not applicables here so what we should add to our function is basically if money equals equals n a and let's just return none and i guess if we run into a syntax that doesn't match either of these it's also return none okay that's probably good let's run that real quick looks like everything's still working i'm gonna hide all this real quick [Music] okay now we need to actually need to convert it like we did for this one so i'm going to add copy this paste it down here so now running time let's say budget and it's not really a float it's a hint for movie.get let's see what it actually says this budget uh i'm gonna make this budget with a capital b box office with a helpful b okay so budget then we'll also copy this line and do box office float and this will be just box office and not applicable if it doesn't get that and now it's not minutes to integer but it's instead our money conversion function money conversion and money conversion so now we're modifying our movie info list with these two additional fields and let's run that hopefully we get no errors now let's see what our movie info list looks like movie info less we'll just look at like a single example negative i'm just i'm really just picking random ones when i index this i just don't want to like it too too much look at that 160 is the oh no looks like we grabbed the group wrong so we need to tweak this a bit uh okay all right it's 160 here but we really wanted to grab 160 times million why did that not work nice thing is we can just kind of paste this into our function money conversion move the input list negative 40 and then we'll get the budget from it 160. let's do some print statements in here and i guess see if it gets to the word syntax if that matches so it looks like for some reason it's not magic the word syntax i guess one of these last changes we made must have affected something print out the money real quick that string is right hmm it's number then dash number here's we passed in 160 to 2 million what would happen why does it not match 255 oh it does now what the heck but if i do what we just did why does that not work is this some weird character that i'm not aware of i'm gonna throw this in the string too wow it's like a big dash what the heck that's strange man that's annoying see this is the type of thing that you you deal with i guess let's now try it now it runs wow okay i don't know what the deal is with that but hey you know you know you might run into more edge cases but i'm glad that we figured it out like if you see a lot of you know nuns that you don't think it should be i would recommend looking through your data a bit and trying to like correct things i and that was a doozy just finding that dash but um that's kind of what it takes is like trying to break down throw in print statements when you're you're trying to figure out something and go from there all right so i think at this point we're pretty good with the money conversion you can you know go through your data a bit more and see if there's any edge cases or things that were missing but i'm going to just kind of continue onwards and clean the last item i have on the list as the next step so if we go back to our top of our list we've now converted the budget and box office numbers to two numbers i'm gonna cross that off so now the last thing we have to do is convert the dates into date time objects so we'll create another field in our movie info list and we can go all the way down here to do that so convert dates into date times all right how do we want to do this well again as always it's usually good to just look at you know random element to kind of just fill figure out what we're working with so you know this one has two dates so we'll have to probably just grab the first instance again but basically month followed by day followed by comma followed by year that seems like what we're dealing with as a good thing to do this is what we kind of did with the past examples could probably just print out all of our different options for dates just so we know exactly the edge cases that we're working with so do that right here so movie.get and now we're doing release date okay so i see yeah a lot of lists a lot of month day comma year month day comma year that seems like the main format is there any other formats that we see in here that's i think the big question it looks like maybe we need to replace the xa0 character this is day month year so that's a little bit different i don't know if there's many of those cases the main case is definitely either a string of month day coming year or a list of multiple month day common years you can look through that as much as you see fit but let's start uh implementing something that will help us out here so the format we're mainly dealing with is like june 28th comma 1960 or 1950 let's say so that's mainly what we want to convert when we're doing this so i think when we're working with dates i think one of the most important things is to import the date time library so i'm going to do from date time import date time this really allows us to convert to an object that python understands as a date and let's see looking back at our options for dates here we see that we have a lot of lists and a lot of stuff in parentheses too i'm thinking if we clean out the stuff in parentheses we might be left with a more succinct format so i think the two steps we're going to do here to start is grab if instance of list grab first item and then just delete anything in parentheses just so we have a very uniform month day year format let's do that so we're going to say dates equals movie release date and i'll probably do dot get release date for movie and movie info list now let's define a function called uh date conversion and it'll take in a date well the first thing we mentioned that we might want to do is if it is this is just like our money conversion i think i also like our time conversion so if is instance uh date of a list type let's just say our date is equal to date zero okay so that will give us our string no matter what now what do we do with that string well basically what we're going to want to do is make sure that it's you know just the date stuff here and not the parentheses stuff so maybe we also define a function that's just called like def clean date i'll be passed in with date two what we're going to do there is just take our date so if we had something like this what i think would work well is that if we wanted to get rid of anything in the parentheses what we could do is split it on the first left parenthesis so i'm going to do date dot split on the left parenthesis and then we can grab the stuff to the right of that so that's going to be this right here with an extra white space at the end so that's going to be the zeroth index and then what we can do ultimately is just strip off the white space so dot strip would allow us to do that so i'm going to just say return date and now date conversion we want a clean date so i'm going to say date string equals clean date of the date and now that's going to leave us with i'm going to say print date string and we can run this for a few so for dates and for date and dates let's try running date conversion we see that that's given us what we're hoping for so now let's use the date time library to actually convert this into a date time object and we see we get like one not a here so maybe we also add to our function if um date equals equals not applicable then we'll return none so now let's use yeah the daytime library to um you know convert these into date time objects so i'm gonna just say take date format convert to date time python converting strings to date time in python okay let's see if this gives us something that we want okay so basically what we need to do is pass in our string and show the format that we expect it to be in i think this will be easier if we go to the actual documentation for date time library date time library python because i know just having seen it before i know that they give you a good list of things so i'm going to look up strip time that's what we saw before okay this is helpful return a date time corresponding to a date string parsed according to format so that's what we were just looking at date string format and we can specify the format here so we know it's month followed by day so day like this then it's comma and then it was year and year with four letters is percent y so you can basically format how your data is coming in to convert it properly using this syntax so [Music] we can do let's see i'm going to say format format equals percent b that was month percent d which was day comma year so that was percent big y so that's the format we're looking for so if we do we can do return date time dot strip time or date string and then pass in the format so i guess i need to move this up here so i'm going to just also print a new line between this just to see what happens when we do the conversion and i'll print this okay so does that look right 1937 may 19th that looks good oh no it looks good but it looks like okay this is a different syntax so it looks like we don't always have the same syntax so this is day month year so what we could do is make this formats make it a list and maybe we just try each format that's in there so that's the first one the second one we see that we just errored out on is day month year so that would be percent d percent b for month and then percent y for year so now what we'll do is for format in formats we will try to return this and if it doesn't work rx i guess accept you can print out the error or you can just you know pass if you don't care too much and then if it doesn't fit either of those formats we can say return none so let's see what happens now i'm gonna run this so the one that broke was that weird one with the oh go go i'm going i screw rolled down way too long so we're looking for one with a day at the start okay 26 october 1953 look that works well now cool we might see a nun in here somewhere everything looks pretty dang good in my opinion honestly i think we might be good here and once again there might be some edge cases like we see we have some nuns because these are just years and you know maybe you handle those maybe you don't but for the majority this already worked pretty well so i think we can go ahead and i'm going to just remove this real quick i'm going to remove the print statement here and rerun this cell and then what we're going to do is just like we've been doing for a movie and movie info list we will say movie um release date i'm gonna just say that this is like the formatted one all right you maybe say date time just to show it's daytime object it's going to equal date conversion of movie dot get release date or not applicable and close off oh that's probably good run that now let's just look at a random movie info list release date look at that it's daytime object that will be helpful if when we like want to look at these things uh in pandas an analysis step and january 25th 1961 it matches it cool that looks good so i think that at this point our data is clean that was a hefty cleaning process there's still you know additional cleaning that could be done but i'm trying to make this it's already a pretty long tutorial but i'm trying to give you a sense for as much as possible while still making it end eventually so i think that's good for what we need to do our next step is going to be um attaching imdb scores rotten tomatoes scores etc to our data set all right at this point i think it makes sense to save our data again so let's go ahead and do that so let's just make sure that our save data function is loaded in run the load data just for good measure um okay so save the data so save data we can call it now maybe like disney movie data cleaned more i don't know just some intermediate just to kind of show the different stages json and we want to pass in our movie info list run that no okay um so let's see what our error is object of type date time is not json serializable so that's we have an issue with the field we just added so one thing that we can do here is there's multiple ways to save data in python so if you want to save like pretty much any type of data in python you can also use what's called um pickle you can pickle the the data so we can pickle this movie info list and that would be another way for us to save it so you could like do another google search like um save a python object pickle figured out how to do this um that doesn't look super straightforward ah that's decent uh save and load pickle python that will probably give us enough information i'm going to use pickle to save a dick that's probably close enough because we have a list of dictionaries okay try this that looks pretty straightforward so we'll use this we'll basically make new i guess save and load functions using pickle so let's just paste that in so now we want this to be disney movie data cleaned more dot pickle and we'll move this into a function called save data pickle i guess and taken data indent this indent this and just call this f f i don't really know what the protocol pickle highest protocol does so you probably can leave that in or you could remove it it shouldn't matter too much okay so that's a save data pickle method we have to import pickle so that's save function and i guess we shouldn't we should also give it a name we shouldn't define the name in here so i'll just do this as name and we want to dump the data so that's the save function we can run that and now let's also do a load function so import pickle def load data pickle we'll call this name indent indent and usually when you have pickle data it'll either be pickle or also sometimes you see pkl name then we really want to just return whatever it does so return pickle.load i'll call this f call this f it doesn't matter what you name that and this should be name so now we can go ahead and save our data by running this cell doing save data pickle call it disney data movie data cleaned more dot pickle and we want to save the movie info list run that and now if we open up our folder where we have this we see that we have disney movie data cleaning tomorrow.pickle and just to show you that it works with the load data we could then call like a equals load data pick all of the name pass in the same exact name run that and then we did a of 5. we see that we get exactly what we wanted and it was able to save that daytime object which is cool using that's a good reason to use pickle it's a you know you can't pass a pickle if you were to send this data set to someone you wouldn't send them a pickle file probably unless they're also a you know someone that's familiar with python so that's kind of the trade-off with like the pickle file you can save pretty much anything but it's a little you know it's not like a human readable file you have to ultimately load it back in to get what you want and if we showed i could do a equals equals movie info list and just show that it's the exact same thing as it was before it's pretty cool all right let's move on to one of the final tasks which is to connect imdb scores just movie ratings in general to our data set all right so in this task what we're going to be doing and just in case you stepped away and are you know just starting this task now we can reload our movie info list by doing movie info list equals load data pickle of disney movie data cleaned more dot pickle or pickle and just reminder we have that that's if you look in the github repo these functions are available or you could just duplicate them so we use this function um but what we want to do now is if we look at one of these movies let's just you know search back like negative 60. so we have the movie into the woods and so what we are really wanting to do here is in this dictionary we really want to add like imdb score rotten tomatoes score etc so feel free if you want i'm going to give a couple hints because it's very broad still this task right now but feel free to pause the video and play it when you are ready but i will real quick dive into a good spot to look for these scores and easily access them so loading in another web page let's just look up like imdb scores [Music] imbb.com i'm going to just search here like into the woods so 2014 movie and you know we see it as a 5.9 uh imdb score here let me make this a little bigger so you can see everything so really what we need to know is how can we programmatically do that one way might be to try to scrape like imdb.com and like rottentomatoes.com and i think metacritic is one of them we'll be able to access but as a programmer you know we can scrape we know how to scrape this whole video we've been scraping but i also often like to know if there's an api that we can use so i would look up something like movie data api and see what you find so there's like the movie database api looks like it's probably pretty good i've played around with this with this before so we're going to use the omdb api the open movie database api um and what you see if you start using this api is that i think you get a thousand free hits on the api a day if you are not a patreon patreon patron but it's a one dollar a month um charge to be a patron but yeah you don't have to do that um if you don't want you can utilize just a thousand free ones a day um so basically you can get a api key by going on the site going to api key um getting the free one and then you know filling it out and it should be pretty quick when i did it it was very quick but i'm not quite sure because i did end up being a patreon and and uh paying a dollar a month to support this api so what does the the open movie database api look like well we can go to usage here basically it gives us this endpoint to hit we need to use our key but what we can do is we can start searching by titles so i could do something like into the woods and search that get a response in json and we see that it gives us a bunch of data on that and we see that you know one of these values we go way down here make it even a little bit bigger is the imdb rating so that's perfect that's what we just saw with the 5.9 so this is already has um that built into the api one thing i would also note is that this has a lot of the same fields that we collected during our scraping process and i think one thing that's cool to think about is like basically what we did if you took it like the a few steps further you could probably replicate an api like this on your own and that's something that maybe you could you know make a patreon for for people to access so it's kind of cool seeing that like basically what we did is what this person does and if you look at his patreon page this person offers this api and they have like 2 286 patreons that are using it so they're you know bringing in if everyone let's say lower minimum is using one dollars per month that's a nice uh little cash on the side so that's pretty cool so basically what we're doing is what this person did but this person i would say has a more polished um api and they you know they're hosting it on a server etc we have kind of a more static database right now but that's kind of beside the point let's start accessing the the the let's start accessing this api so yeah you had to get this api key use your email they'll send you a key but we can now you know copy this go back to our jupyter notebook and you know let's just paste this in just so we have what the format looks like where we'll basically make requests all right so to solidify the task what we want to do is we want to attach imdb be rotten tomatoes and i'll also add metascore to that um to our data set and we'll use this api so i'd say go ahead at this point pause the video try to do this on your own and then resume when you want to see how i would do it all right so when we're working with apis the first thing that i think is important is to import the requests library and so ultimately we will have kind of our url that we'll be making requests too so i'm going to just call this base url equals this and then we're going to have to pass in parameters so what are our parameters well our parameters are going to be a dictionary of different things that we'll be wanting to pass in if we go back to the api we see that we can pass in the biggest thing i want to pass in is you know we have to pass an api key that's one parameter and then we also have to pass in the title so it looks like t the parameter t is the movie title so we'll use that when we're passing it in so api key is going to be all right so this is a kind of a security thing i could type in my key you know maybe my key is something like this uh but it's not good when i'm paying for a key i'm paying that dollar a month fee for the key it's not good for me to expose that information so what we can do what the best technique to do is to use environment variables so i can kind of show this a little bit if you're on windows it looks like this like i'm going to type in my search bar environment variables edit the system environment variables environment variables and what i can do is i can set a system variable so i see i have omdb api key and i'm blurring out what the actual key is but i can set this variable in here and basically you know you have to have admin access to access this on your windows machine it's a little bit different on a mac or linux to set environment variables but you can do a quick google search and find it but basically i set it here in the system variables and then i can access that value so maybe i set it to be something like this is not my actual key but it it was something like this and that's what i set in that environment variable slot how i reference it in python is os dot environ to get environment variables and then it was called omdb api key that's what i gave it as a name and so that is actually going to now look up in those environment variables and find the actual value that i have there this is a really good practice because especially for more sensitive api access and you know things that you really can't have throttled and people abuse you don't want to be committing that to your public github repo and also this is very true too for like passwords you don't want passwords to be publicly accessible on github gitlab so you can use environment variables to hide that information but the other thing we want is the title and so how do we get the title that's going to depend so we probably should wrap this in a function because the title is going to depend on what we're passing in so i'm going to call this get omdb info pass in the title indent this so now it's going to be whatever we pass into the function as one of our parameters okay what do we do next well we need to be able to make the proper request using the request library so what we might want to do is encode this stuff onto this url and there's python packages that can help us do this so i'm going to use the url lib library to do this and once again if you ever forget how to encode something as a url google search figure it out and then copy kind of and paste your code in so if we encode these parameters i'm going to call this params encoded equals url lib dot parse dot url dot url in code of the parameters so that just basically gives us a string that's now packaged up and we can attach to our base url to pass this information properly in a url syntax so then we have our full url ultimately is going to be our base url and i'm going to also add just a question mark here that signifies we're starting parameters plus the params encoded and then we need to make a request a get request to the omdb api and we can do that using the following requests dot get we want to pass in the full url and then we want to get kind of the json response from that i'm doing a lot of steps in one kind of making this a bit quicker but if you need to have any more information look up the requests library look up the url lib library and i think this step will get more clear um and so we can really return this i would say and let's see what this gives us so if i do get omdb info for the title into the woods let's see if it works just like it worked when we were using their website look at that so it gives us all that information formatted as json which we can access like a python dictionary we can easily get the rating from this we can easily get the meta score from this the rotten tomato score is a little bit annoying because it's hidden inside of this ratings list it doesn't seem like it has its own field so we'll have to do a little bit of extra work to get that so let's maybe write one more function that's a get rotten tomato score and we can have that pass in you know an omdb info so ratings will will grab the ratings from this because so what i'm doing just to clarify what i'm doing we can easily add meta score and imdb rating to our data set as is but i just want to have a function so that when we actually go ahead and do that we can also add the rotten tomato score so ratings equals omdb info dot get of the ratings and i'm gonna say an empty list otherwise if it doesn't um return anything so for rating in ratings let's just print out the rating i just want to show you what this is going to ultimately do so if we said that this was equal to info equals this get rotten tomato score of info would print out this stuff so then basically what we want to do is grab the source of that rating and if it's equal to rotten tomatoes that's what we want to grab so if it's equal to rotten tomatoes i'm going to say if rating.source or rating source equals equals rotten tomatoes then our value that we want to return is going to be just the rating and then the value field so that's how we can get the rotten tomatoes score but we have to iterate over the list to do that so that's kind of why we added this additional logic and if ratings doesn't exist it will blank for this default empty list so it ultimately will return we can have it it wouldn't be able to iterate over anything so maybe we just have it return none in that case cool that's uh honestly i think we just grabbed using these two functions i think we can grab everything we need so let's go ahead and do that so reminder if we look at this we're trying to add imdb score metascore and rotten tomato score to this dictionary so let's kind of iterate across the dictionary like we have been for movie for a movie in movie info list we want to grab the title of the movie so we can use that in our search function so that will be movie let's see how we get the title just called title in our dictionary so we can do movie title and then once we have the title well we can get the omdb info it's going to be equal to get omdb info of the title once we have the omdb info then we can start adding things to our dictionary so imdb i'm going to say equals omdb info dot get of let's see what it's called in we want to print out an omb info so we can check so uh it's called imdb rating so we can get the imdb rating and i'm using get just because we might run to the case where this score doesn't exist so we don't want it to fail and if it doesn't have the imdb rating we'll just say none and you could maybe do some manual searching to find it movie now that the meta score so that's going to be we'll just say metascore well if we look at our omdb info we can get the meta score with capital m to get this field and we'll also say that defaults to none if you don't have it and then finally the rotten tomatoes we can just do rotten i'm going to do underscore tomatoes so it's a single word that's going to be equal to omdb info dot get actually sorry we can't get that because that's in this ratings thing so that's why we wrote the function here get rotten tomato score so that's going to be just get rotten tomatoes score of the omdb info that should be good we can honestly i think run all this and cross our fingers that things work well come on it might take a little bit because we're making a lot of api requests it's going to be making like 400 api requests probably one thing i might have recommended doing is like grabbing the index of what movie you're on and printing that out because having that information can show how your progress is coming along look at that we got our it ran so now the moment of truth is in our movie info list do we have extra information cool what is this movie it's the jungle book nice uh run tomato score 94 meta score 77 imdb 7.4 nice it looks like it's there we can do some other checks maybe grab like another one this one's rated badly the nutcracker in the four realms 32 39 5.5 this is going to be fun to have when we start doing analysis but that looks good one note i will say is that we only grabbed you know three things from the omdb database one thing that could be cool to do is it provides us additional information like genre so as an extension to this task you could add a genre to it maybe you could add plot to your database you can add all sorts of new stuff you could add awards so use this however you see fit when we actually start doing analysis i might add some of the additional things but that's out of the scope of this video but i just wanted to mention it real quick you could also potentially use this as kind of a cross-reference on checking if you have the same exact information here as you do in our own database that looks good i'm going to actually go ahead and save this as kind of like a final thing so save data pickle i'm going to call this disney movie data final dot pick all and we'll save in the movie info list awesome all right the last task of this video is going to be to save our data in a json file as well as a csv file so the json file we've kind of done that throughout the video but now we'll just have to fix that issue with the date time object not being serial serializable and then for a csv file this is definitely something that we'll want to do is convert our list of dictionaries into a csv so it's ultimately easy for us to analyze this data set in a future video an optional extension you could do to this is to save your data in an actual database like mongodb or an sql database we're we're gonna not do that in this video but i do definitely recommend that's a good extension one reason i'm deciding to skip that in this video is just that our data set is not that big so it's still very feasible for us to just keep it in a json file or csv file but for bigger data sets it's definitely a good route to actually use a online cloud database all right so how can we save this info this data first as a json file so to remind ourselves we have the movie data info list movie data info i forgot the movie info list sorry i goofed movie info list we'll just grab like 50th element okay and so really all we just need to do is just convert this one back to a string for this so because this contains a list of dictionaries and dictionaries or mutable objects we'll want to just copy this list and to copy it we can't just simply do like a you can't just assign a new variable to movie info list because that would still have the same pointers to these dictionary objects so how we're going to do this is i'll just say movie info copy equals movie dot copy for movie in movie info list and that should give us a proper copy of everything because we're copying each individual dictionary so now we can see that the movies are still there if we do movie info copy let's say 20 see everything and what we want to do is replace this field with an actual um string instead of a date time up just to save it as json so to do that we could do something like for movie and movie info copy well we want to edit the movie release date object and that's going to be equal to well because first off we should know what our initial movie was so i get her current date is current date equals movie release date so this is either going to be a date time object or it could be none as well so if current date so if it's not none then what we want to do is take the current date and string format time it to be just like we kind of the reverse of what we did before so percent b is month percent d is day comma percent y is year and then otherwise we will have that same field be just none so if it was none before it stays none so it goes down i guess this assigning it to none wouldn't change anything because i think the only possibility is either current data is defined or it's uh none so it would stay none but we can just do this for good measure okay run that now if we look at movie info copy 20 we see that this is now a string that's perfect and just to check let's just make sure our movie info list is still not modified and we see that it is just fine that's great and so now because we've just changed the one problematic field we can go ahead and save this so i can use the save data function that we defined a while back and i call this disney data final dot json and we can save the movie info copy into that we will probably have to load in our save data function do that real quick go back and save cool all right now we need to do our csv so insert some more cells so how do we convert this to a csv well usually when we're dealing with csvs one of the libraries we love to use is pandas i'm going to do the import of that import pandas as pd and so what can we do with pandas to load in our dictionary our list of dictionaries to a data frame so i might just look up load list of dictionaries to data frame pandas and see if it gives us anything convert list of dictionaries to pandas data frame does this look kind of look like what we have yeah it does i would say in general wow supposing d is your list of dicks simply all we have to do is that that's easy enough so we can just do data frame equals pd.data frame and we can just pass in our movie info list we're gonna use our movie info list because it has the date time that will ultimately be helpful when we're doing analysis and we can go ahead maybe and do just to check things df.head down here so let's run both those cells and now what does the data frame head look like oh look at that does it look good title looks like it has everything we're not going to analyze this right now but one thing to note is a lot of these i would say columns might not be necessary when we're looking at it as a csv so you can you know decide to just filter this down more okay everything's good so if we wanted to save that we could just do a dataframe.2 csv disney movie data final dot csv that should be good and we could see that we have that saves right here cool and if you wanted you you could start playing around with this data frame a bit do dataframe.info to see some info about the columns so like we see that this is a datetime object which is good one thing i noticed is that imdb and metascore run tomatoes are not integers so one quick thing you could do is convert these to integers but you know it's ready really to analyze so if you wanted to see like um which movies are the longest you could do something like running time times equals data frame dot sort values we sort on running time the integer version and we could have it go ascending equals true so we could see the shortest movies first and then get to the longest i believe that would do and if we printed running times dot head oh no oh i put this in the list it should be outside the list run that you see that we get like the shortest movies first like roving mars sacred planet saludos amigos those are all pretty short we could even get more if we wanted uh bambi is pretty short it looks like winnie the pooh so we see that this is an order of the length i guess we see the integer number here that looks good you could do the reverse you could do find the longest by doing false that pirates caribbean i guess is long etc so you can do a bunch with this now we have everything there there's you know some minor cleaning but we're gonna do a bunch of analysis in a future video but we've created a data set we've scraped a bunch of wikipedia pages did a bunch of cleaning on those wikipedia on what we collected used an api to add info to our table we've done a ton of things so nice work with all of your efforts in this video gonna be a future video where we analyze all this data sometime in the future this video took a lot of effort to prep so if you did enjoy this video it mean a lot to me if you throw it a big thumbs up also if you haven't already it would mean a lot if you subscribe to the channel if you're looking for more exercises to do check out my other videos as well as make sure to check out datacamp i have a link in the description to that to stay up to date on everything that i'm doing make sure to follow my socials instagram and twitter but without further ado i think that's all we have for this video so thank you everyone for watching i really hope you enjoyed this one this was a heck of a project so i had a lot of fun putting it together and i hope you had a lot of fun kind of solving the tasks on your own until next time peace out [Music] you
Info
Channel: Keith Galli
Views: 141,393
Rating: undefined out of 5
Keywords: KGMIT, Keith Galli, MIT, Beautifulsoup, beautiful soup, data science, data visualization, data analysis, real world data science, data science project, regular expressions, python, python 3, python programming, pandas library, pytest, web scraping, selenium, bs4, bs, datetime, re, dataset, dataset creation, data cleaning, data exploration, data scientist, machine learning, ai, eda, exploratory data analysis, data engineering, engineering, python project, programming, programming project
Id: Ewgy-G9cmbg
Channel Id: undefined
Length: 204min 17sec (12257 seconds)
Published: Thu Oct 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.