Pandas & Python for Data Analysis by Example – Full Course for Beginners

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Learn how to use pandas for data analysis and data science in this full course for beginners. You will learn pandas by example from Santiago. He is an experienced developer and he also created one of the most popular data analysis courses on the internet. So you are in good hands. Let's get started. Hello, my name is Santiago and this is pandas by example, a collaboration between Free Code Camp and Data Wars. I'm originally from Argentina, world champions of the 2022 football or soccer World Cup. And why I'm telling you this? Well, as any other Argentinian, I've watched and played football pretty much my entire life. And you know what I have learned? That you don't get better at playing football just by watching games on TV. This seems obvious, right? To become a better footballer, you have to go out and practice. The same applies for data science or any other analytical discipline. Just watching videos doesn't make you an expert. You have to practice and put your skills to a test. That's exactly what we're trying to do at Data Wars. Data Wars is a platform 100% focused on practicing data science and applying your skills with real life projects. There are no videos, just projects to resolve with bite size activities that you can check at each stage of the process. The best part is that it's entirely free and we'll explain you how to sign up in just a bit. And this is also the reasoning behind this by example series. This video focuses on resolving real projects, brainstorming and discussing the solutions and the pitfalls along the process. We'll encourage you to try to solve the projects by yourself first, even pause the video before we reveal the solutions. We want you to put your skills to a test and challenge yourself. This video covers all the most important aspects of data management with pandas, including data analysis, data cleaning and data wrangling. Each project focuses on a different topic and the complexity grows from start to finish. We have separated each project as a chapter and you can jump back and forth. As we've mentioned before, we encourage you to pause the video and try to resolve each activity by yourself first. You can find a list of all the projects we are resolving and the instructions to sign up at datawars.io.freecodecamp. With all that said, let's just get started solving projects right now. In this project, we're going to be practicing our skills dealing with pandas data frames. We're going to be doing some selection, understanding index selection. We're going to be creating a new column and we're going to do some statistical summarization methods. And finally, we're going to do some question answering by creating conditionals, okay? So like selection based on different conditions or queries that we want to write. So this is kind of on the simpler side for this series. We're just getting started with data frames, again, doing statistical methods, some selection by the index, creating a column, doing some conditionals and that's it. If you feel comfortable with these topics, just move ahead to other sections of the video so we get a little bit more challenging. As usual, I'm going to encourage you to just pause the video before each activity and try to resolve it by yourself. So you can gain some of that application of the skills, right? Worst case scenario, you can just resume the video and take a look at the way I resolved it. And by the way, there are going to be different ways of resolving these things, these activities. So you might even stumble upon something new that you do by yourself and I do something different and that can be a very rich experience. So let's get started. This data frame, the data set we're using, is a data set containing a big list of English words and the index is going to be the word itself and we're going to talk about the index in a second. And then we have two columns, chart count or character count and the value of the word. The character count that's easy is just the length of the word, that's it. And the value is computed by adding up all the individual values of each character and the individual values are just defined in this way. So A is one, B is two, C is three, et cetera. So in this case, it's one plus one. One plus one plus the value of H, which I don't know where is H, 10, 11, whatever. Actually, we know it is eight because this is one plus one is two and the value is 10. So H is probably the word, the character number eight. And that's it. This is already given. So I have already read the data frame and we're going to get started with some activities. The important piece about this data frame is that the index is the word, like the words themselves make the index. So whenever we want to access a given word, we can just do it by the index. And this is why it's so important to understand how indices work in pen. So let's get started right off the bat with the first activity, which is how many elements does this data frame have? There are multiple ways of answering that. The first one is with the info method, which is one of the most common ones. Pretty quickly, we go with info and it gives us an understanding of the index. It tells us that the index, just an index with entries that go from AA, and that's it, to these words, which I don't know what it means. And we have a total of 172,821 words. We can also use the shape method. This gives us kind of the shape of the matrix if you want in the data frame. And of course, the values are the same. So we're going to plug this one in and see how it goes. There you go, it worked. Okay, moving forward. What is the value of the word this thing that I'm not going to even pronounce? And this is a very interesting activity. Because remember, this is a very interesting activity. Because remember, we're going to be doing selection by index. So as the index is the word itself, I can just use the dot lock method, and this should be familiar, dot lock, that does index selection. So it can just pass whatever value I want for an index using the dot lock method, and that's going to give me the results. In this case, it gives me the row itself, the entire row. So we did like this row selection, but the result is in a series. So the projection, to put it away, the result of row selection, which is kind of horizontal structure, is going to be a vertical structure, which is a series. You see the shaping, the transposing operations. We go from something that is a row to something that is a series. And it's not pretty special. Don't pay too much attention to it. It's just to refer to the fact that the row is represented as a series. And here, we can see that the value of this word is 317. Something also very convenient if you're doing selection is that the df.lock method, and we're going to increase here the size, df.lock accepts two parameter, the index selection and then the columns, right? Why is that? Imagine that we have 100 columns. It's going to be very hard to just browse, scroll to the value we're looking for. You can just pause here what the column we want to use. In this case, we're going to use a value right there. And the value is 317. So let's go ahead and try it. And it worked as expected. What is the highest possible value of a word? df.we're going to use two methods. Let me show you once again, data frame. What is the highest possible value? We can do df at the value and do a max method, right? That's going to give us the maximum value there. Or we can just do df.max to have the max statistical method computed for the whole data frame, right? In this case, we see again that the value is 319. Or something that is pretty common is df.describe. And this method is just going to give you this scribe. There we go. Oh, got it completed. It's going to give you summary statistics of all your numeric columns. In this case, we only have two columns and both are numeric, so we're good. And it gives you summary statistics of all the values right there. So for example, the average value, the average char count, the minimum value, the maximum value for character count, the maximum value for value, et cetera. It's going to give us some summary statistics. So I think we have multiple methods. We're going to go here with 319 and submit the activity and that worked. Now the next one, which of the following words have a char count of seven and a value of 87. And this is pretty interesting because in the way we did selection before, I'm going to copy this thing right here. We, the way we did it before was just by passing one value of the index. We worked with just one word. But selection in pandas, df.lock here, takes potentially a list of index values to retrieve. So here, instead of passing just one value, I can pass each one of them. And there's going to be a little bit of an annoying procedure. I'm going to just copy and paste all the words here. You feel free to fast forward. I will not copy this one because it definitely has more than seven characters. Just going to, you know, quickly and see which ones have seven characters. I think this one has more than seven characters but doesn't matter. It's either anyways, and now you can see that with df.lock, you can actually pass multiple indices if you want to get multi-selection or just one, you want to select just one individual row. And to answer the activity, which of the following words have a char count of seven and a value of 87? It's glowing right here. We can just put here glowing and submits and that works. Just a common, as we were doing before, the first element in this case is a list is the list of indices to select. The second one could potentially be just a column. So here I could say value for them. Oh, I missed a square bracket. And I just get the value for each one of these rows. Of course, in this activity, we did both the char count and the value. So that's why I didn't do it, but just a clarification. So what is the highest possible length of a word? We did it before with describe. We're going to do it again and char count. The maximum one seems to be 28, there you go. Moving on to the next section, which is exploring interesting words. So we did before this activity, there was that this word, and I'm not gonna pronounce, is the only word with a value of 317. Find the only word with a value of 319. And there are multiple ways of solving this one. We could, let's say we do some sorting first. It's not gonna be the most effective way, but let's do some sorting. Let me see if I can put everything in the same frame. df.sortvaluesby and value, it's gonna be, I'm gonna say ascending false. In this case, as this word was the one with the highest value, just by sorting, we were able to reach that result. But of course, that is not always the case. What happens if here, it's asking us for, I don't know, the value of 18. And that is clearly, or even worse, 50. It's clearly in the middle. I have to just browse the data frame to find the correct answer. In this case, what we need to do is some selection. We need to do df at, I'm gonna do the whole syntax and we're gonna explain it later. .lock df at value, it's gonna be 319. And this is the first idea of conditional selection in pandas. We have the word, I'm gonna just plug it in, see if it worked. And if we have a good activity, there we go, it worked. Now, how does this thing work? Remember that whenever I assume you have worked with this before, whenever you have these operations in pandas, and if you haven't, check out some other simpler projects we have in the platform that explain this in more detail. The idea is we are performing this operation that returns a Boolean array, right? So again, refer to the project in the platform. We have a very, very thorough step-by-step process to explain these topics. But basically, this creating a Boolean volume as saying select this word, select this word, or select this word, and indicating if we wanna retrieve it or not with a false true value. So in this case, we don't want the AA word, we don't want the AAH word, et cetera. But the ones that fulfill or match this condition will have a true value here. In this case, it's a long Boolean array, it's a 172,000 element Boolean array, and there's just one true volume, which is the one for this word. So once this Boolean array is resolved to put in a way in memory, we pass that Boolean array to.lock method, and that returns the value. So here is to kind of wrap it up with this idea of.lock, we have three ways of doing.lock. We can pass just one value as we did before for the index, right? We can pass several values for that match the index, or we can pass a full Boolean array indicating which elements, which rows we want to select by just putting a false or a true volume. One tiny comment that's kind of a question we receive all the time is that is this efficient? We could have resolved this thing with just iterating and not using any memory to put in a way because at the end of the day, we have to create this Boolean array, which is gonna set in memory, and then we do the selection based on that Boolean array. So the question is, why are we gonna create this intermediate Boolean array if we could just, you know, each radius, right? I can do something like four row in df if row at value equals 319 print row, just to put away in reality, have to also use the index. But anyway, in this case, I'm just doing a more imperative way of resolving this activity, and I'm not using any memory. I'm just, the data frame is data frame, there's existing memory. I create an auxiliary slot just for this particular row, and this seems to be more efficient. But the reality is that these Boolean arrays are actually very, very tiny. They might look like they are this huge, huge, you know, series to all these string values, but in reality, these are just ones and zeros. It's a very small one bit element array, right? And the index is pretty much the same we have. So just a comment on the side, I don't wanna get too technical, but again, the beauty of panda selections, this dot lock works by the index, and again, we can pass just one value, multiple values, or a full Boolean array. All right, moving forward, what is the most common value? To answer this question, we can refer back to, let me do value, scribe, there you go. And we can do different statistical methods, right? So in this case, we have the mean, standard deviation, minimum, maximum, et cetera. We also have the median, which is right here, which is a value 103. And there is one that is not showing up here, which is the mode of the given series or a given array. So we could have an understanding of the statistical methods. So we're gonna do something like df at value mode. And we get that 93 is the most common value. Now, this doesn't tell you much, because it just tells you of all the different values we have in our data frame, let's do that. 93 is the value that shows up the most times. And we don't know in which situations, because the character count might be different, but there are different characters, so the value is different. But a very common method, something that I use all the time is the df at value, and we're gonna do value counts method. This is a very, very helpful, I get just the first ones. A very helpful method is gonna give you this given value appears 1,965 times. So 100, as activity said, is the second most common value. And there are 1,921 words that match that value. The value 93 is the most common one, and there are 1,965 words that match that given value right here. We could do something like df.lock df at the value equals 93, and we can get here, let's get the first 10. All these words have the same value, even though they have different chart counts, right? Because they have different characters. Let me do sample here, so we get a random sample of values, so right here we have, for example, in jests have seven characters and the value is 93. Again, the sum of the characters results in 93. But let's do it again, I'm gonna keep that one there. The testable has 10 characters and has the same value, right? So again, to compute the mode of values, you can use the mode of statistical method, or for me, more common, very useful is the value counts method gives you not just the most common value, gives you kind of a ranking of the most common values in a column, and also gives you how many samples you can find of that value. All right, so, oh, we haven't done it. Let's see if it passes, 93 is our most common value. There you go, it worked. What is the shortest word with the value 274? Let's resolve that now. df, we're gonna do a little bit of selection first, so let me align that so we can zoom in. What is the shortest word with the value 274? So first, we can start by selecting all the words with a value of 274, there you go. And then what we could do is either sort the values by char count, and this is gonna be the shortest one, this is gonna be the shortest one, or we could have done a little bit of more selection. I think that just by doing this method, it works. We could have done something like we get the value, the char count, sorry, here. So we have that only the given column, and then we get the minimum value, right? So what's the minimum value, 20? With this thing, we could have also extended our query and do df.log, I'm gonna do that, and this, and the df.log, no, df at char count is equal to 20, that's the one we got before, and this is equal to that, and there we go, we can do a very quick selection. We could have done it also in a way that it's dynamic, so we get plug this value here, and then we can do a very quick selection. So we get plug this value here, and this is all the same thing. At the end of the day, I think that just, if we don't have too many elements, sorting the values is an easier way to resolve it. So let's go ahead and do it, there we go. By the way, this is something that we do in production to put in a way like in our day-to-day jobs all the time. Sometimes you will need to build more dynamic solutions, but sometimes you're just doing some analysis, just exploring data, you can keep it easy, just with the different methods. You can hear the important pieces applying the right technique for the right problem. All right, moving forward. Next activity, this is interesting. Create a column ratio, which represents the value ratio of a word, which is defined as the value of the word divided by the character count. So let's do that right now. Stop me if you wanna practice this by yourself, but the way to create this ratio is gonna be, let me show you the F first, though, head. It's gonna be df at ratio, that's the name that it's asking us, the column that it's asking us, is gonna be equals to df at value divided by df by chart count. And now we're gonna do this thing again, and we have a ratio computed right here, the value divided by the chart count. Let's check to see if it works. It did work, we are good to keep moving forward. So a few more questions, what is the maximum value of ratio, df at ratio dot max, 22.5. What word is the one with the highest ratio? And here, multiple ways to solve it, again, sort values, I think it's a good alternative. We're gonna do by ratio, but here, let's do it right here, in this way. By default, our values is sorting in ascending mode. We actually want it flipped, we wanna start with the highest one at the top. So we're gonna pass ascending equals false, we're gonna pass this order. And now we have that the highest ratio is 22.5. What word is the one with the highest ratio? 22.5, the word is x, u. As we know, the highest ratio, we could also have written a query, df dot lock, where df at ratio is equals to the maximum ratio, and that is also x, u is the same result. So again, two different ways of resolving the same activity. It's important for you to have all the techniques in your tool belt, then which ones you use, it depends on what problem you're solving. How many words have a ratio, how many words have a ratio of 10? So what we can do here is we can do df, we could have done, this is not gonna work so well, value counts, to get an idea of the values. And here we have just, this was, we were very lucky that 10 is a common enough value that there are 2,600 words with it, and we can see it. But if this was not such a common value, the ratio we're looking for, it's probably buried in this sea of different ratios because there are actually 1,333 different ratios. So as ratio is a continuum variable, there are multiple individual points, it's usually not convenient to use value counts. Instead, what we wanna do is just write a simple query, df.lock all the rows that have a ratio of 10, right? That's the query we wanna build. Gives us a bunch of rows, what we can do is just do a shape, and that's gonna give us the same number that we have right here. Gonna plug it in here. Let's do a quick, quick stop here in asterisk to mention the.query method. So df.lock and all the selections, usually the prefer method because it's the most powerful one, but sometimes you just wanna use the query method. That is just kind of a shorthand. You can write it in plain English. It's kind of Python-based, and the query method just works in this way. We can say ratio equals 10, and it's relatively simple. You can use it whenever you have simple conditions. There you go, here's for you to compare. I personally, myself, I usually prefer this method, but that's because I'm just very used to it, and I like thinking in terms of Boolean arrays and combining Boolean arrays and all that. Up to you if you wanna use query, the one that suits you better. Moving forward, what is the maximum value of all the words with a ratio of 10? So we're gonna keep, let's keep query here. This is the results. And what we can do here is we can sort the values. Let me put in context again the activity. What is the maximum value? Sort of values by value. Value ascending false. And here we can find what is the maximum value of all the words with a ratio of 10, right? Head, there you go. We had, what is the maximum value? 240, it seems. We could have also done here, actually for this particular solution, I think this method is gonna be better. We can do that, and we're gonna do value, just get the max. We should get the same result, 240. And that's the same answer here. The beauty of sort values is sometimes you see the data. In this case, you have to trust it, which is fine. We have an analytical resolution, we can trust it. But it is really like you can trust it. Like the ratio is 10, chart count is 24, the value is 240, it's usually a little bit, I don't know. It leaves you at peace to see that the data is okay. Let's keep moving forward. Of those words with a value of 216, what is the lowest chart count found? In this case, we have pretty much the same thing. We're gonna do here. The F, let's do query now. And we're gonna do, of those words with the value is 260, so value is 260. What is the lowest chart count found? So first, this thing is gonna give you all the words with a value of 260. Now we can sort values by chart count. And that's gonna give us, what is the lowest chart count found is 17. So here we're gonna find 17, there you go. And the last activity we're gonna be done based on the previous task, what word is it? This is the word and we are good to go. One important, one minimum comment about the query method and we're just gonna wrap it up with these particular projects is that when your columns have spaces, you need to surround them in backticks. So let's get, let's invent a new activity just by ourselves here very quickly. I'm gonna use describe, there you go. Let's say I'm just writing this on the fly. On the fly, find all the words with chart count greater than the mean or the average chart counts. Chart counts, yeah, character counts. All right, so this activity we're like just making up on the fly. How can we, first we can compute and let's say we have, we're gonna mean chart counts, it's gonna be count. It's gonna be.mean and we're gonna print it. I'm gonna show you how to write the query using this variable. We're gonna do df.query where chart count and this is not gonna work is greater than nine, right? So this is not gonna work because there are spaces you have to surround them with backticks. That's the first thing you wanna work. The second thing is if you wanna reference this variable in here, if I do this, that's not gonna work either, just gonna blow up because by default, anything that is a name within this query is gonna reference a column. So if you wanna reference an external variable, what you have to do is pass here the ampersand symbol. In this case, gonna say, give me all the values in which this column surrounded by backticks, so respect the white space is greater than this thing and this thing is referencing an external variable. So go out to put it away to the global scope or local, whatever, and find this value, plug it in and run the query and here is what we can do right there. So again, the query method is convenient. I personally, myself, I prefer to use the.log method. I like to combine it with column selection. I like to think in terms of Boolean arrays, but it's perfectly fine to use query. Just remember these tricks, right? That sometimes you're gonna need to use backticks for columns with white spaces and you can reference external variables with the add symbol. That is it with data frames. Let's keep moving forward and solving more projects applying those data science skills. In this project, we're gonna practice how to filter data, sort it, do some selections, some querying, all very useful data analysis techniques. The data that we're gonna be working with is a dataset containing Pokemon information, including their type, their stats, some total attack, defense, and all that, their generation, and their legendary status. So let's go ahead and just read the data to get started and we're gonna start working with activities one by one. As usual, just pause if you wanna resolve it by yourself and then you're gonna hear me with the explanation. What I'm gonna try doing in this project or for this project is giving you all the different possibilities we can think of. So solutions with.lock, solutions with ilock if that applies, and also solutions with the.query method. The first thing you have in this notebook is a little bit of an information in terms of distribution of the Pokemon we have. So for example, these are all the type one Pokemons and we also have some analysis of, for example, their stats. So for example, total, we see that there are a few very powerful Pokemons, both in terms of a histogram and a box spot. So all these visualizations are usually pretty useful, right? When you're starting your data analysis with a new dataset, you just do a quick visualization of how your data is distributed in terms of, for example, categorical variables, how they are assigned, or also the ranges distribution of some numeric variables, like in this case, the total ones. So we can see the most powerful Pokemons. In this case, there are legendary status. We can also consider a categorical variable. And on that, let's get started with the activities one by one. So how many Pokemons exist with an attack value greater than 150, right? So let's take a look at the data first. That's what we're dealing with. And we're going to focus how many Pokemons exist with an attack value greater than 150. So let's get started, pause if you want. And we're going to do df.lock df at attack greater than 150, right? That's going to give us the full list of Pokemons with that attack. And we can kind of see that there weren't like too many with that value already with the fox plot. With the query itself, we can get a real sense of how many values we have. In this case, we have just three. We could also do something like shape to get it. But again, we can clearly see that there are only three samples that match that given condition. We can also do the query method. So I can do df.query where attack is greater than 150. And we get the same results. So let's try it out. See, there you go. It worked, correct. Moving forward, select all the Pokemons with a speed of 10 or less. So in this case, we need to create a sub data frame. So it says right here, story your results in slow Pokemons.df. So in this case is a speed of 10 or less. So that's gonna be pretty, pretty slow. It's right below here. We have probably 10. It's gonna be a pretty small amount of Pokemons. So let's take a look first, how many fulfill that condition. speeddf at, oh, mis-type. Speed is 10 or less. So this is less or equals and 10. And we have only a handful, one, two, three, four, five. And we can of course assign that to this variable to have a final check with activity. Let's do it on the side in parallel. There you go, it worked. And of course using.query, its query speed is less or equals than 10. From now on, I will try to stick to the.lock method is the one that I prefer. Instead of using query, but you can use either. I prefer to work with Boolean arrays. Remember this expression, it's gonna be just a Boolean expression, right? So how many of these values are, the values are true are basically that the values that match the given condition. This is pretty useful if we wanna count the values. And this is something that I could have done in the previous one, which I didn't realize showing before. Basically, if I did something like this.sum, I'm gonna get the same result, right? How many values in this full array, so array is not, attack is an array, let's say a series, a collection that has integers. And then we ask, give me an array, a Boolean array of all the ones that match this condition, this case greater than 150. So it's gonna be false, false, false, false, false. And of course, we're gonna have three in this huge array that are true. So in this case, as I told you in the previous project that we did with English words, if the array, in this case, it seems that there are just, you know, strings saying false and true, in reality, this is a Boolean array. So in reality, these are all ones and zeros, very efficient, very compact array, ones and zeros. We can sum that and we have the final result. In this case, we can count, we can sum, write all the ones, we have just three ones and that gives us the answer to the previous question as well. All right, moving forward, how many, let's go here, how many Pokemons have a SP depth, value, special defense, value of 25 or less, special ability defense. So we're gonna do df.log, how many Pokemons have a special defense value of 25 or less? We're gonna do df at sp.def less or equals than 25, because again, it's a value of 25 or less. And now we have a bunch. So now we can use the technique that we did before, that is just this.sum, gonna crease the sides here a bit. And we have 17 as a result. I could have done the same thing here, shape, and I get pretty much this table has 17 rows. Of course, this is a little bit more compact, let's say easier. So let's try it out, 17 and see if it works. There you go, they work, we're good to go and keep moving progress, making progress. So select all the legendary Pokemons. This is pretty interesting one. We have to sort the result, select only Pokemons that are legendary and sort the result in the variable legendary.lgdf. So that's gonna be df.log, df at legendary, but let's take a look first using the info method. Let's take a look at the legendary column. The legendary column itself is a Boolean column, a Boolean series given by its type. So that means that it is itself a Boolean array, with true and false values. We could ask df at legendary, how many legendary Pokemons we have in this dataset, and that's just the sum of the legendary status. And that's it, because this is a Boolean array. So remember, if we did df at attack, for example, greater than 150, and we had this huge Boolean array, false true, false true, and then we selected the ones that had a true value, we can pretty much do the same thing here. And what I'm gonna do is, we could say something like, this thing is true, right? So give me all the legendary Pokemons. What did I do wrong? False kind of, oh, Boolean index, df is equals, there you go, my bad, is equals to true. But the reality is that this column itself is a Boolean array. So let's just get rid of the true, and we say, select all the Pokemons that are legendary, just that, just pass the mask, the Boolean mask, and perform the selection. If we wanted, let's say we wanted to get all the Pokemons that are not legendary, the non-legendary Pokemons. We can, of course, do something like false, we're gonna get how many we have. But the reality is we can just invert the array. We had false, false, true, false, right? That was our original array. We can do inversion of this thing, negate it with this operator. That's gonna give us true, true, false, true. This is false. So what I can do is I can just do the negation here, and I get the same result. So sorry, I got a little bit sidetracked, but what we wanna get is all the Pokemons are of type legendary. And to do that, we're gonna assign that to this variable. And we're gonna get the head, first five rows. There you go, all legendary. And let's check if the activity passes. This is why it's so important to understand the column types of your data. Because sometimes the selections or the Boolean arrays you're gonna be working with are not the same depending on the type of the data you have. Okay, moving forward, find the outlier. So we have this distribution of Pokemons, and we have to find this particular one, the one that has a difference value that is really big and also an attack value that it's pretty low, right? So this Pokemon has a ton of defense, but a very low attack. Of course, we could do something like df.defense.sort. Let's do df.hand, show the whole thing. Then we're gonna do df.sort values by defense, and show the first view. By default, the sorting method is in ascending mode. So we are starting in this way, we're going in this direction. We wanna invert that, we're gonna do ascending equals false. So now we have the strongest one here. And here we have the one that has a defense of 230 and an attack, right? Of, in this case, attack here of 10. This pretty much gives us the answer right away. Let's actually try it out. Let's see if this is the one, this is the Pokemon that activity requests, at least for this particular value, it is that passed. But I wanna make a comment here is that you can actually sort by multiple criteria. We could also say something that sort by defense first, and then as a second criteria, sort by attack. There you go. And then for defense, we want it to be in descending mode. So ascending is false, but for attack, we want it to be in ascending mode. Again, the result is the same because the data, it's pretty clear who is this outlier right here. But again, the understanding is we can sort by multiple criteria. What is gonna happen is that the sorting starts here, and then we're gonna sort by attack in the other direction. We can combine as many as we want of this criteria right here. All right, moving forward, more activities. Now we're gonna get into advanced selection with some Boolean conditions. How many far fly Pokemons are there? So we want type one equals equals fire and type two equals flying. So what we're gonna do is we're gonna do df.lock. And here, again, pause if you want, we're gonna do a little bit of Boolean operators. The first condition is gonna be df at type one is equals to fire, right? And the second condition, and how many do we have? The second condition is that type two has to be flying. Flying, there you go. But what we want is the combination of both, type one fire and type two flying. The way to do that is to combine them with an operator, an ampersand operator. So I'm gonna break it into different lines here. And I'm gonna put the operator right here so you see the conditions. How many far flying Pokemons are there? Seems like we only have five. Let's try the result first and then we're gonna explore a little bit more the data frame resulting. And as you can see, I haven't even looked at a single Pokemon. What I'm doing is just combining conditions and summing Boolean arrays. So this is a pretty interesting thing because what is happening here is like, let's say we had only five Pokemons in total, right? We have the first array here is gonna say type is fire and the second array is gonna be is flying, right? And we had something like true, true, false, true, false. So this is of type fire, type fire, not type fire. Yes, type fire, not type fire. And then for flying, we had false, true, false, false, false. There we go. So in this case, this is the only flying Pokemon. What happens and that's why we have like different results. Of course, we had here a bunch of fire Pokemons and here we have a cup of a bunch of flying Pokemons. We have true values for the condition. But what happens when we use this ampersand is that we do one by one, bit by bit, Boolean value by Boolean value, we do the operation, the ampersand operation, which is like the and operation in Python. So we do true and false for the first one. So the first one can be true and false. What is the result of true and false? False. And what then we do the result of true and true. What is the result of true and let me show you true and false is of course false. True and true is true. So we have that false and false is false. True and false again is false. And we could say this one was true. False and true is false. There we go. So in an and operation and we have operator one and operator two or operand, operand one and operand two, for this expression, for this whole thing, for this whole expression to be true, both OP one and OP two must, well put it this way, must be true for this operation to be true. Anything else either OP one is false or OP two is false or both are false. Anything else, anything else is false. This is just some Boolean arithmetic, Boolean operators just, you know, a little bit that applies to Python as well. But the interesting thing is that that same operation we're seeing here, comparison bit by bit, value by value, a line is what we're doing right here. So we had 47 Pokemons were of type one fire and we had 89 Pokemons were of type two flying. And then we combined those things using an ampersand, an and operation and the result was only five Pokemons. So we can actually visualize those sample points. We're gonna do dot lock and we're gonna pass that condition. And these are the only five Pokemons are of type one fire and type two flying, okay? Because we are using these Boolean operators. Okay, moving forward. How many Pokemons, oh no, one thing, one thing. If we wanted to use a query method, we could have rewritten this thing with query and we're gonna do, remember that if you have white spaces in your column names, you have to surround that with backticks. In this case is gonna be fire and type two, type two equals flying. And I made a mistake here. Flying should be a string, there we go. We have the final result. So again, this is easier to type probably. We need to put the backticks in the column that has a white space. And here we can just use Python and we can use a string. You have to make sure you're matching or on the contrary, don't matching the string quotes you're using to surround the whole expression. And then within the Boolean operators, we use the regular Python ones. We don't use the and percent, we use just the and. When we are using dot lock form, we must use the Boolean operators are and or a not. And if we're doing something here, let me see if I can do a not. Yes, we can do a not right here as well. Okay, now moving forward, activity number seven. How many Poisson Pokemons are across both types? So as you can see, right here, the type of a Pokemon is a Pokemon has two types. Let's say it has a strong type and a secondary type. The type poison can be applied to either the primary type or the secondary one, because I can actually show you that to you type one counts. We have that there are poison 28 Pokemons that have poison in type one and type two. There are where is it not 31 Pokemons that have poison as secondary type. So we want to basically say or find how many poisonous Pokemons are there in general, right? How many poisonous Pokemons we can find in either type. So the expression and let's start now with a query method. So you can visualize it first. So it's going to be type one equals poison. Type two equals poison. And there you go. How many do we have here? Let's copy this thing and do a shape to get the final result. How many 28 it seems, wrong answer. We can check that in a second. I can try using the dot log. Oh, no, no, no, wait, wait, wait. It's just typo here. 59, let's try 59. There you go, it worked. So it was just a typo, but let's now think about it in terms of the dot log method. The way it works is pretty much the same thing we did before, the F of type one equals poison, right? This is going to give us a Boolean array and we can combine that Boolean array, which pretty much the same expression, just type two points. But here, what we want to do is an or operation. And the way it works is let's say we have, let's see if I didn't delete our previous arrays. We're going to use them again, here, nowhere, here. So let's say we have poison type one and type two. The way it's going to work now is as this is an or expression, we care for any of the samples. So if this Pokemon is poison here or here, doesn't matter, we want to retain that object. So if it's true here or here, we want to still make the whole thing a true operation. That's the result of doing true or false, for example, or doing false or true, right? As we did before, and I can delete this thing and go back to our previous notes, we said that if we have these two operations, that it's OP1, operand one and operand two, and I'm still in the end, sorry for the back and forth, but let's start again, let's pick up where we were before with the end operator. This thing is going to be true only if both operands are true. So OP1 and OP2 must be true, and this is an operation. OP1 and OP2 must be true for this whole thing to be true. Any other case, everything is false. We can do a very similar reasoning with the for operation. So OP1 or OP2 must be false, both of them, for this thing to be false, anything else is true. And this is, you can just flip things and that just works. That's the idea between Boolean algebra. You can just flip things, true with false. In this case, true or false, or false or true, true or true, all these things will be true. The only way to get a false result back is by both these things being false. I'm going to get into our logic here. We say true or false is true, and true and true is true, and let's do that, we align false and false is false. So these two are false. So this was fire flying, so it's not poison. In this case, it was poison rock. I'm just making up types. It was, for example, flying poison, right? So that is also true, so true and true. So what we're doing here with dot lock, we are generating the first Boolean array and combining it with the second Boolean array with the or operation. And again, conceptually speaking, it works in this way. The result of this whole thing, if we wrap it in parentheses and do a sum, we're going to get the same value out. Remember, these are just Boolean arrays. Okay, let's keep moving forward. What Pokemon of type one ice has the strongest defense? So we have to find, and you're going to see that I usually start my activities doing the F dot head is because I need to take a look at the structure of my data. I'm going to do df dot lock type one, type one, sorry, df at type one equals ice. I'm going to start in this way. So these are all the Pokemons that have type one ice. Now I can keep going by doing something like give me the one here we can do and the defense dot max. So the maximum defense of type one ice Pokemons is 184. I could extend this one by saying, give me the Pokemons that are type one ice and, and here I'm going to use an and operation because I want both things to be true. And df at defense is the maximum that I found before 184. And if you let me, I'm going to put everything together and I could do that and give me just one Pokemon, right? So far so good. Let's put it in here, see if it works. It worked. That was the correct answer. But now let me show you something a little bit easier. In these cases, we're looking for d1, like the most powerful or the most powerful, like if we have several, but we're looking for a maximum value. What we could have done is just sort values here. We're going to sort by defense and we're going to sort in descending order. We want, let me show you the standard one. It's going to show you the least powerful, like the weakest Pokemons first. That's because this is sorted by defense, but in ascending mode. But what I do is sorting descending mode. To do that, we're going to do ascending equals false. And now we get the most powerful, the same result as before, which is of type ice. We have the most powerful one that we can see right here. So this is a more of a visualization. I'm doing some analysis with my own eyes so I can just get the first value here. It's not so programmatic. This is a little bit more programmatic. I could automate this thing and I can do something like, for example, I lock zero one and I have just the name of the first Pokemon, the name. This is a little bit more programmatic. But again, it's pretty much the same thing as far as what we're doing serves the same purpose. Okay, keep moving forward. What's the most common type of legendary Pokemons? What is the most common type of legendary Pokemons? What is the most common type? One from the legendary Pokemons. So what we can do is, let's first lock all the Pokemons that are legendary. And we did this before. So this is gonna give us all the legendary Pokemons. And now what we can do is do type one. This gives us the whole thing. And now we can do just value counts. And we can get that psychic is the most common type of Pokemons. And let's do that there. See if it works. There you go. You can combine this thing. Kind bar. There you go. To do a little bit of a visualization. But in a way, the result is correct. The important thing here in terms of filtering and all that is that we filter by the legendary status with just the Boolean array. And then we pass the type one column to get only that column. And then we got a summarization using a value counts. Okay, what's the most powerful Pokemon from the first three generations of type water? So we need to find the F.head as usual. So we take a look at the structure of our data. Find the Pokemons, the most powerful Pokemon by total. So we wanna find the total being the powerful representation here. From the first three generations, that is of type water. So let's start step by step with different conditions. To filter types of Pokemons are type water, that's simple. We're gonna do df.lock df at type one equals water. There you go. But now we have to combine it with the expression that is from any of the first three generations. How many generations do we have? Generation, we're gonna do now value counts. We're gonna get the, there are, we can actually, I think, let me see, let me see the, let me ask here, what's the syntax of the value counts, value counts method? Because what I wanna do is not sort these values. We can visualize it in a quick VAR chart to put it in a way. Value counts is the same. How can I, what parameters does it accept? Does it accept, I'm closer to find the documentation of pandas than asking the AI here. There you go. Sort false is the method, the one that I was looking for. So let's do sort false. There you go. And now I'm gonna do plot kind bar. There you go. So these are the generations we have. One, two, three, four, five, six. This is a little bit, this is not useful for the exercise in reality. I just wanted for you to take a look at the generations. Basically, what it's asking us to do is the first three generations are gonna be generation one, two, and three. Continue with our exercise. How can we get the documents of the first three generations, the way to do that is gonna be DF. We could do something like generation. Let me actually type it outside. We can do something like generation equals one or two or three. And we wrap this whole thing in parentheses. There we go. But the reality is that it's a much easier method, which is DF at generation dot is in. So we're gonna pass here a collection one, two, or three. We're gonna get the same results. So it's basically the same expression as before. This value right here can take any of these forms and that is good for us. That is the same as doing this full or. So putting everything together, what we're gonna do is let me zoom here and we're gonna break this into several lines. We're gonna do type one is water and the generation is one, two, three. There we go. And I already forgot what the question was answering was the most powerful Pokemon by total. Here we're gonna sort values by total. And as usual, let me show you what it returns. Returns in ascending mode. We actually wanna do it in descending mode. So we're gonna do sort values by, oh no, no, no, sorry. We already did sort values. We're gonna do ascending, false, and we're gonna get Kyog or Kyog or my English is bad and on top of Pokemon is worse. Is the most powerful Pokemon by far in terms of total. So we're gonna pass it finally and see if it works. I'm gonna do a quick summary. The first one is this idea of work just as usual, combining things with the Boolean array or sorting values to get answer. The interesting part is we're using this is in operator. That lets you pass several values and it's basically a simple way to write an or statement. So let's say we wanna get all the Pokemons that are of type, type one, either fire or water, we can say is in and we can do here fire or water. So any Pokemons that are fire or water, it's basically this volume has to be in this subsets. Okay, what for? What's the most powerful dragon from the last two generations? This is getting interesting. So find the most powerful by total that is of type dragon, either type one or type two. So either type from the last two generations and enter its name below. So it has to be a dragon, either in type one or two and from the last two generations, last two generations. We saw before the last, the generations are already deleted it, basically generation value counts. There we go. Is five and six are the last two generations. We're gonna write the whole thing now. We're gonna do, we're gonna break it in lines. Let's say DF at type one is dragon or DF at type two is dragon and DF at, DF at generation dot S-N. And here we have the last two generations, those five or six and here, quick notes, we are, I'm using here a set. You can pass a tuple, a list, whatever. I just pass an inline set because the most correct way of doing it in terms of performance, I don't think it affects enough, but it's just the correct way of doing it. But anyway, in terms of solving the activity, let's go ahead and the way we're gonna do is I'm gonna just run this thing. We have to find the most powerful Pokemon in this set of being type one dragon, type two dragon and this generation five or six. But you're gonna find something interesting. So I'm gonna show you these results first. You can see what we get. It's type one dragon, type two dragon. That is good. So let me show you the data, type one dragon, type two dragon, so the first condition seems to be matching, but what you're gonna immediately see is that the generation is incorrect. We get Pokemon from generation one, three, four and that is because we're combining the conditions that it's either or this thing or and this thing. So the operators in the way they work is that it did the operator of this thing that works and then it did the operation of these two things. But that is not what we're looking for. The way we want it to work is to perform these two first. It's either this, either one of these and this one in here. So to do so, what we need is to surround these two operations with parentheses. So it's like the order of the operations here is important. Let's first perform this thing. It's either type one dragon or type two dragon and we want it to be generation six or four. I don't know if this syntax works. There we go, it worked. And now we can see that it's dragon in type one, dragon in type two and the generation is only five and six. So this is the expression that we want to get. Now, the only thing we need to do is sort values by total ascending false and we get that these Pokemon is the most powerful one, type one dragon, type two electric, that doesn't matter and it's generation five. So let's try the answer right there and it works. Okay, moving forward and this is important. I don't think we have another activity in this project that has like the precedence, the importance of the precedence or the order of operators, but it's an important concept. So just remember parentheses don't hurt. It's cheap to put parentheses. So if you have these complex expressions, just make sure you're surrounding the ones that you want to run first with parentheses. You can indent, you can do all sorts of things as long as you know, it makes it look better for you, could have you something like that. So is this thing and this thing, you know, actually, parentheses are not necessary here. Anyway, just keep in mind the precedence of the operations is important. All right, select the most powerful fire type Pokemons. So here we're going to say, select all the Pokemons that have an attack value above 100, above, and type one equals to fire. And we're going to do df.lock. We can actually do a query now. We can do, oh, by the way, we can solve this one with a query method first. So sorry, let's go back to the previous one. If you're passing and resuming, you're hating me at this moment. Basically, we're going to say, we need parentheses here. So type one equals dragon, or type two equals dragon. And, oh, we can't do the in here. We can, we can do the in. So we can do generation in five or six. And we get, oh, we have to sort the values. And we have the same result. Let me break this into multiple lines. It's difficult to read. But anyway, what you can see here is we are producing pretty much the same expression and the order of operations is important. Type one dragon or type two dragon, that's the first thing. And generation is in five or six. Let's see if it doesn't break when we got the wrong results. Here, when we remove the parentheses, you can see that our pokemons of generation three, one, so this is clearly not working, have to go back and put the parentheses for the quarter. So sorry, now let's move forward to the other one, which was select all the pokemons with attack greater than 100 and type one equals fire, we're going to use a query, we're going to do attack is greater than 100 because it says above and type one equals fire. There we go. We're going to run that thing. And we're going to assign it to our variable. Visualize it. Let's see if it works. It worked. We're correct. Next one, select all water type flying type pokemons. So select all these documents that have type one for water and type two flying. We're going to do pretty much the same thing as here, we're going to do VF query, we're going to do type one water and type two flying. And let's see the data we get back, water flying for both types. And let's check the activity. There we go. And finally, well, not finally, there are two, three more, but select specific columns of legendary pokemons of type fire. Okay, so this is interesting. Pause now when I saw it by yourself, it says performance selection in your data frame of all legendary pokemons that are of generation one type one type that are of type one fire, sorry, but select only the columns name attack and generation. So we're going to do the F dot law, we're going to use and there's going to be a good reason that our DF at type one equals fire and DF at legendary. That's the first result. So type one fire and legendary, I switched the order here, but we don't want to get all the columns, we want to do just a selection, we say we only want the columns and here we're going to say name attack on generation. And let me break this, these two modes. So we get it. I would get the regular right DF dot lock accepts conditions for index columns. Right. So here we have and let's actually do more parentheses. We don't need them, but let's do more parentheses. It's more clear of what we're trying to do. We need more parentheses. There you go. So this is the condition and then it comes to columns and the final result that we have right here. So type on fire and legendary and the columns are name attack and generation. Let's give it a try. And that worked as expected as well. Select slow and fast pokemons. So this is getting interesting. Now we're going to see what this visualization shows us. So select slow and fast pokemons. We're going to get all the pokemons. Take a look at this region of Pokemon speed in histogram, including the notebook. The red lines separate the lowest bottom 5% and the fastest top 5% pokemons right here. So basically we create a histogram and then we plot two red lines in 0.05 and 0.95. So it's the 0.05 quartile 5% and top 95% percent yields. And what we want to do is basically select those pokemons that are either above this 95% or below the 5%. The two extremes of the distribution, let's say. So what we're going to do is df.log. And we have the values, by the way. We can actually take a look at them. We can do here. Let's actually do this thing. We're going to do bottom 5 is this thing. Top 5 is 95. There we go. And let's show both top 5. There we go. So we know that the cutoff for a very slow Pokemon is a value of 25. The cutoff to be considered a very fast Pokemon is 110. We're going to use first the dot-lock method. We're going to do df at speed is less than, because we want it to be select... Where are we? The slowest Pokemon and the fastest Pokemon is very slow with speed below the bottom 5 or very fast above top 95. So we want it to be this operation. Or, and here's another one, very fast. So it's going to be df at speed is going to be greater than top 5. Let's put this whole thing here. Slow fast Pokemon. There we go. This is our table. And let's try it out first to see if it works. There we go. It works. But there is another way, which is basically using the dot-query method. What we're going to do is df.query. And you know this already. Speed is going to be less than. And here, to reference an external variable, we've seen this already. We're going to use the add symbol. So on this thing, we don't need parentheses. Or speed is greater than top 5. And we need dt. There we go. And we have the same result. 60 arrows, 13 columns, the same result. You can trust me. But basically, the important point here is we are using an external variable and referencing it with the add symbol to reference something outside of this particular query. Speed references the column. The add this thing references an external variable we have defined before. All right. And I think we have the last one. And this is a very interesting one as well. It says take a look at the scatter plot correlating defense to attack. What's the name of the Pokemon indicated by the right arrow in the image below? So who's this guy right here? It's a Pokemon that has a strong defense and a strong attack. So what we can do is sort by both values. So we have like this case. First is legendary. And second, it has a very strong tab for attack and a very strong defense. So let's take a look. The first thing we're going to do is lock filter all the legendary Pokemans legendary and sort values. We're gonna sort by doing by defense and attack. There you go. By both those criteria. Or actually, we can stop it. Defense. So we take a look at it in a more vertical way. So we're gonna sort in this way and then in this way. And of course, we want to do ascending false. And here we have this Pokemon has an attack of 116. So it's above this line and a defense of 110. So it's kind of right here. In reality, we're looking for this Pokemon that it's very close to 150. Okay, so let's keep moving forward. We have these Pokemon that has an attack of 150. So again, right here it matches and a defense that is 140. So this one looks a lot better. Let's try it out and see if it works. We have our value. And there you go. All correct. All complete. But basically, what we're trying to do here is using this combination of visualizations with the analytical power. We're filtering, we're sorting, and we're putting everything together to answer the question. So this was the whole practice about filtering and sorting Pokemons. Very interesting one. Again, if you have a chance to try to solve it by yourself, or there are a few more projects related to filtering and sorting, all yours, let's keep solving more projects. In this project, we're going to be working with a pretty interesting problem, which is the birthday paradox. The birthday paradox comes actually from the birthday problem, which is a more general problem. That is basically answering the question. If you put n people in a room, what is the probability that two people share a birthday? So for example, I have 70 people in a room. What is the probability that two, any pair of two people within that room, within that room of 70 people share a birthday? And there is a formula to calculate that. The birthday paradox, what it tells you is that it's basically what is the number of people that I need to put in a room for that probability to exceed 50% and the counterintuitive fact that only 23 people are needed for that probability to be above 50%. So again, recap, only 23 people in a room are needed for the probability of two, of any of those people, any pair within that room to share a birthday. So only 23 people are needed for that probability to exceed 50%. And this is a pretty counterintuitive fact to be, at least to me. The first time I got asked this question, my reasoning was, well, what is the probability, sorry, what is the number of people that I need to be 100% sure that two people are going to share a birthday? And the answer is, of course, 366. There are 365 days in a year. So for me to be 100% sure probability equals one, that two people share a birthday, I need to put at least 366, that's going to be 100% sure. So when I got asked, well, now what is, how many people do you need for that probability to be 50%? My first reason, it was like, I don't know, 100, 150, half or 365. You know, it's like, and then when I got the answer and I got to calculate the answers, like, obviously, it's just, well, only 23 people are needed for that probability to be 50%. And we will actually understand why in a second. So basically, this project deals with the birthday paradox and us calculating the birthday problem, the probability of two people sharing a birthday in a group of n people. And then we're going to apply that to NBA teams. And this is going to be a pretty interesting thing because the teams we have here, we can actually run this thing. We can do tf.team.value count. So we're going to have teams that have 27 players, but then we have 24, 22. So we're going to see how many, like, if this probability stands, you know, if we're going to have here the probability of two people sharing a birthday in a team of 24 people, the probability is going to be greater than 50%. Are we going to find matches here or here or here, et cetera, right? So that's going to be the problem we're going to deal with. But before we jump to the data, we're going to start with calculating the probability and understanding the intuition behind this thing. The way we calculate this probability, and again, is this formula, is what is the probability of two people sharing a birthday in a group of n people? So we swap n here. We say we put 10 people in a room, 20 people in a room, 50 people in a room, and we get a probability back of two of them, at least two of them sharing a birthday. And the interesting part of this formula is right here is the combinations piece. And this is when this number, only 23, is needed to get 50%. It starts to make sense. Once you take a look at this combination formula here, and combinations, just a quick recap, is if you have A, B, C, D, E people, Rav, John, Mary, Susan, Violet, et cetera, the way you can combine them is basically A with B, A with C, A with D, A with E, B with C, B with D, B with E, C with D, C with E, and finally D with E. These are all the combinations you get. So pretty quickly, if you start thinking in pairs of people, this number grows pretty quickly. If I put just one more person here, let's say one more person joins the room, I start adding, I will add five more combinations, because it's F with A, B, C, D, and E. So it's got to be A and F. Now B and F, C and F, and D and F and E and F. So one more person, what did I do wrong here? One more person that I added to the room, and this number of combinations grows by the previous amount of people in the room. I add one more person, G, and I'm going to do G, F, E, D, C, B, A, where it's going to grow by six. I add one more person to grow by seven. It keeps growing and keeps growing and keeps growing. So that's why, at least to me, that was what made this problem click and understand that it's a very small number of people that we need, only 23, for the probability to exceed 50%. We can actually calculate the combinations piece. So let's get rid of this and buy F. So F is no longer invited to our party, so we have A, B, C, D. We can calculate the result of combinations of N taken by two, and this is a pretty simple formula. It's N factorial divided by K factorial times N minus K. N on K, in this case, is N people or N taken by K. In our case, it's going to be N taken by two. But we can very quickly hack this formula. It's going to be death. We call it NCRR of N taken by K. And the way it works is you can just do math.factorial of N divided. And if you let me, I'm going to do something just here, shortcuts. This is the beauty of Python. I can just assign a variable to a function. factorial of N divided factorial of K times factorial of N minus K. And I will return this whole thing. We have that the combinations of five people, NCR, of five people taken by two is 10. And we said if we add a sixth person, this thing is going to grow by five, right? So it should grow by five and have half now six people 15. If we add another person right here, this thing is going to grow by all these previous ones. So it's going to be by six more. So we're going to do if we have seven people, they should grow by six to be 21. There you go. So we're calculating this thing on the fly. That's the way it works. So now let's get to the activities. And this in this case, the the project is asking us to calculate the probability right when N is equal to 10. And we can actually take a look at there is a table here we can kind of cheat. That is the value, but we can calculate ourselves. And to do so, what I'm going to do is something like we have one minus 364 divided by 365. And here we're going to do we're going to use our formula or sorry, our function that we calculated before that is equal to 10 people taken by two. So and taken by two in this case, and it's going to be 10 10 taken by two. And we're going to use parentheses here just in case on our on our probability is zero point 11 614 in the table was point seven. I don't know how they computed what rounding they did, but it's pretty much the same value. Let's see if this is okay. It is okay. It's asked us for now equals to 15, but I'm going to go ahead. And it's asking us to implement the function. So given number of people calculate the functions, we can do it generically. And the way I'm going to do it, it's going to do return this thing. And I have to swap just one parameter that is here number of people. There we go. And now we're going to do birthday probability of 15.25 and put here 25 0.25. And it works as expected. And we can check now the birthday probability function of once. And it worked as well. So we're good to go. So again, this was quick recap that is for this person not to be extremely mathy, but it was just a quick recap of the birthday problem. And now we're going to apply that to our data set to see how many people within a team share a birthday, if any, right. It's going to be a pretty interesting one. So the first thing that we're asked to do, I'm going to delete this thing solution, is we need to extract. Let me show you again the data frame, we're going to extract the birthday birthday from the birth date function. So date is the actual date, day, month, year, we want to extract only the birthday, right. So for two people to share a birthday, they just need to be month day. It's like, hey, I, my birthday is September 29, yours is September 29, we share a birthday. But potential is going to be different years. That doesn't matter. So we need to extract here the month and the day from this data frame to do so. We can use the STRF time function string format time function. There's actually part of Python in this case is implemented for a tuple and it's implemented. Let me show you this here. It's implemented in the column birthday with a DT special accessor. Don't worry too much about it. The only thing we have to do just go ahead and Google the syntax to get this thing to get this format as expected. The way it works is this function is going to receive a formatting. So for example, we can get the year and the day right there, we're going to get they and year. And here we can put whatever we want. We can put an ad sign and hello world. And it's just going to be a formatting. It's going to format this string that's going to replace with special value, starting with the percentage symbols. And the question now is, how can we get only month and day? This is what it's asking us for birthday. We can actually go ahead and ask our assistant so we can say, how can we format the column birth date in the format month day? Do we have month day without space? And let's just go ahead and ask the question. I'm going to precise here. There you go. And we get pretty much this format. Pretty interesting because it's actually the correct answer. And let's try it out. Going to copy this thing. We're going to go. And this is the correct format so you can see. If you want, you can Google, we can say STR of time pandas. We're going to find this method. We can here find Python string format documentation. And here you can validate that percentage D is going to give you the day with leading zeros, which I can't find. But trust me, there you go. Day of the month. Zero padded. And then we're going to use percentage month, which is zero padded month. And then you can get year. You can get year with the four digits. You can get day of the week. You can get and PM four times. Again, you can do any formatting you want. In this case, we only need the month and day. So this is the format that we are looking for. Close this one, and we're going to assign this value here. And we're going to check to see if it works. There you go. It works. Now, we know already what we want to answer is how many or which players or how many players within a given team, this is important, share a birthday. So we need to separate all the players within a team. We need to find the ones that share the given birthday. We could potentially think about these problems in different resolutions, for example, finding duplicates within a team, doing group by operations. But those are all at least given the projects we're solving. Now, things are a little bit more advanced. And basically, what I want to communicate with this particular project is that sometimes we can find clever answers, clever solutions that will pretty much give us the same results. And the one we're going to do here is we're going to use combinations, community targets to pretty much solve the same problem. And here we have an example. We have the built in combinations formula, and we have something built in. We didn't have to actually write the one we did before, but just for the sake of the example, we did it. And here, the way to think about it, forget for a second about the NBA players, the way to find these people matching is first we're going to use this combination function. What it does, it puts, it aligns combinations within a given collection. So the same thing we did here with our people, so combinations right here, this thing, so remove F and G, there you go. It's going to put A with B, A with C, A with D, A with E, B with C, etc. It's doing the same thing here. John with Mary, John with Rob, Mary with Rob, we can add someone else, we can add Susan, we can add Violet, and then we're going to do, John, Mary, Rob, Susan, Violet, Mary, Rob, Susan, Violet, Rob, Susan, Violet, Susan, Violet, those are all the combinations. And then we have all the birthdays, so let's add two birthdays, let's add July 20th, and let's add September 20th, so we have another repetition. There you go, it's going to do March 5th, September, March, it's going to do the combinations again. And then what we can do is we can put together these things with a little bit of pen. We can create these separate data frames aligning these people and aligning these birthdays, and then combining them all together with this simple operation. Oh, and I just realized that I have overridden our original data frame, that doesn't matter, we'll have to rate it, re-rate it. And now we have all these combinations of people, their birthday, and now we can just ask which ones of these people are matching, and we have that one and six are matching, so one is John and Rob, because they are March 5th, both of them. And six is Mary and Violet, because they're both September 20th, right? And we can apply the same reasoning, the same solution for our NBA data. Here what we're doing is we're not doing any group by operations, we're not doing any algorithmic nested for loops that could potentially work. What we're doing is we're being very smart and using combinations to reshape our data. We're going from, let me do a little bit of a drawing here if you allow me, we go from a small data frame with one, two, three, four, five people, we're reshaping it, we're creating these long formats with all the different combinations, and now we don't need to algorithmically iterate this thing over and over and over again, we can just answer data by finding chunks here. How many, which combinations here are true? In this case, it's one, no, I'm not good drawing, one and six. What I'm saying again, and we're going to move forward in just one second, is sometimes you need to work smart, you need to understand your data, reshape it, be creative in the way you work with your data to resolve the problem, and there's going to be a lot more scalable than think it algorithmically. Because trust me, when you start working with big pieces of data, these solutions that are more declarative are going to be more scalable than algorithmically. We cannot parallelize an algorithm, we could potentially break this thing. Imagine you're working with millions of records, you can put this thing into different servers and check it in parallel, that could be completely possible. But anyway, I'm getting ahead of myself, and we're going to work now by solving the activities that we are asked in this particular project. So as I told you, I am stepping over this data frame, I'm going to reiterate it again. Where is our data frame? Right here, and we're going to recreate our birth date column. There you go. We're going to go straight to the activities. How many pairs of players share a birthday for the Atlanta Hawks? And again, we want to use the same solution as before. To do so, the first thing we have to do is to only get the players from the Atlanta Hawks. Because we don't want to compare every player with every player, it's going to be like a huge number, right? So we want to get, we actually have the number of players in the Atlanta Hawks going to be, let's put it in a variable, it's going to be df.lock, df.team equals, let me copy directly from here. And that's, we have teamsdf.shape. We have 22 players. So now, what we want to do is pretty much the same thing we did before. I'm going to do, I'm going to copy this thing. I'm going to paste right here. I'm going to do, instead of combinations of names with a variable we defined, what I can do is just do teamdfs at player. Here I'm going to put player 1 and player 2. There we go. And I have the correct results. One interesting comment here, completely on the side. We were QAing this project with someone that resolves the project and make sure that the quality is good, there are no errors. And this person was finding an error and saying, I have too many names, right? And I don't know what's going on. And it was a very imperceptible problem that the person was using df instead of teamdf. And he said, is this possible? I have 150,000 rows. And this is an interesting piece because you have to kind of understand the big range of data you're working with. We were working with the Thunder Hawks and we needed to create these combinations of players. We have 22 people in Thunder Hawks. How many combinations of pairs you can create. And this is, it's important to have a range, a quick ballpark estimation of what data you're working with and what should be the result. Because I honestly don't know if it's 231, but I do know that 100,000 is too much. It's definitely too much, way too much. So this hints is like understanding the data you're working with, this kind of quickly gives you an understanding that there is a mistake in something you did, right? So if you forgot you were filtering by the Atlanta Hawks and you wanted to create pairs of only these 22 players, and instead you created combinations of 551 players, that number grew by a lot. We actually have our formula, NCR of 551 taken by two is the same number, 151,525, and NCR of 22 players, 22 players in the Atlanta Hawks, give us 231 rows. So this makes a little bit more sense. So anyway, just a comment on the side, that meant is like, again, having an understanding data we're working with. So let's go back, rerun this thing, the names DF, we have the 231 rows. I'm going to do the same thing for the birthday, birthday and do birthday here. And this is going to be birthday. And here we're going to do name it birthday one and birthday two. And finally, we're going to put, by the way, they are aligned to 131, 231. We're going to combine them with the same concat operation we did before. I don't want to overstep the name. So I'm going to do check DF and the names and birthday. And now let's take a look at our data frame. We have the combination of the player and the birthday of each one of the players. Now we can ask the question, how many pairs of players share a birthday for the plan the hogs? And do is a birthday one is equals to birthday two. And we're gonna sum this whole thing. Seems like there are two players. And then what we're going to do is teamdf.lock this thing. No, check the. There you go. We have these two players. So teamdf. Let's see if all the variables are correct, because I might have done something wrong here. Quick recheck. Teamdf is that. We have birthday. Check the F. And we have two players. These are two pairs. Let's go ahead and try the answer of two players there. Wrong answer. Try it again. Let's go ahead and recheck our thing. I want to make sure I use the rec team. And then we have the player and the. Oh. This is a very unlucky mistake. It's a very unlucky mistake for sure. What I'm checking right here is birthday. I'm using birthdays. And check the dates that we have in this particular point. This is just bot naming on my side. I used the old data frame that I used for the sample. And the dates are in this format. That is September 20th, whatever. And I just recreated this thing for the entire data frame. So my mistake, let's go ahead and delete birthday. And delete it. And we're going to use this variable name now. And try to recheck it. And it seems that we have three now. And effectively we do have three people. Let's try it out now and see if it works. Now it works. Again, the previous mistake was just using bad variable names. I should have used something like testing, a combinatorics birthday, or proof, whatever, instead of the real variable names. But we have just fixed it. How many pairs of players are birthday in the Cleveland Cavalier? So this is pretty much the same thing we did before. We're going to change the date. We have 22 people. It's going to be the same probability as in that now. We're going to do that now. And we have only one pair of players that share a birthday. And that's going to be June 26th. So we're going to just one pair. And that works. And finally, we got asked in the Dallas Mavericks who shares a birthday with JJ Boraya. What we're going to do is rerun our code. We have 24 players. What is the probability here? NCR of 24. NCR 24 taken by two. And we have 276 combinations. No, we actually want to do the birthday problem. There we have a 0.53 probability of two people sharing a birthday. Let's actually go ahead and do it. And we have one shared birthday. And it is JJ Boraya and Doron Williams on June 26th again. And that is the final result. So anyway, a few things we saw here. I think it's pretty interesting the way we have solved the project without using any algorithmic, imperative, and nested for loop solutions. We used a data-driven approach, thinking about how we can expand this data, kind of map-reduce sort of solution that I think scales pretty well. And also, I made a few mistakes that I think were interesting to see live. And the fact that we're combining these conceptual things, resolving these conceptual challenges, mixing it with real data. We validated that this birthday paradox, we have two or three teams, sorry, we found it for at least three teams right here, Atlanta Hawks, Cleveland Cavaliers, and Dallas Mavericks, that there were pairs of people, pairs of players, sharing a birthday. So that is pretty interesting to see something that is statistically proven, or the probability is high enough. Now we see how the resolution in the real world data also applies. So anyway, try to take a look at the project by yourself and try to resolve it by yourself, and let's keep exploring more problems together. For this project, we're going to be combining our data-cleaning abilities with our string handling abilities. And these two scales together are extremely powerful when it comes to the process of handling data that is coming from the real world. So, so far we've done a few projects that had a data was pretty much perfect, we didn't have to do much handling, munging of it, but that is not the real world. Usually when you're collecting data from different sources, you're pulling it from different APIs, databases, or it's even human generated, we're going to be dealing with a lot of strings. And as those are going to be different sources and humans make mistakes, we are going to have differences between these data points. And when strings are involved and humans are involved and multiple sources are involved, the result is pretty much a problem. We will always face a little bit of an issue in terms of the data we're working with. So it's very important to have these sharp skills of data-cleaning and string handling, that's what we're going to be practicing for this particular project. Let me tell you what we're going to be working with. We have two data sets that they're explained here, but I'm going to just go ahead and show you the first few rows of both of them here. There you go. And what you're going to find is that these two data frames or these data sets, they come from different sources, but they have pretty much the same information. That is company names. What changes is that the way these company names was inserted input is different. So the companies by themselves, like the strings by themselves, are going to have differences, but the companies are going to be the same. So in this case, we have, for example, from the first data frame, this company is pretty much the same thing as this company, but here it has a full name and here it has just a shorthand. So it's like saying Apple. Apple is not the real name. The real name is Apple Inc dot. That's the real string. As a human, it's pretty easy for me to understand that this company refers to this same company. It's just pretty basic. But the question is now, how can we make that programmatic? How can we build this sort of artificial intelligence to find the companies that are matching so we can match the strings and of course continue with our analysis from these two data sources that are coming from different sources? The answer is no artificial intelligence yet. We're going to use a pretty simple heuristic deterministic method, which is the Levenstein distance. It's a simple formula that basically will give you a number based on how different two strings are. And the good news is that this distance is calculated. This formula is already implemented for us in a package that is already installed in our lab that is called the FOSS. And the FOSS has a couple of very simple methods. We can take a look right here. Ratio and partial ratio that will give you kind of a similarity or a distance, if you want, between the two strings and coupons. In this case, I think it's easier to think about in terms of similarity. So these two strings are pretty similar. The only thing that changes is an exclamation mark. So the similarity is pretty high, 97. So we're going to be using this library to compute this distance. So let's move forward with the first couple of activities. The first thing that we have to understand is how are we going to match these companies? We could do something like a for loop nested. So let's say we have two companies, so A and B, three companies, and C, and we have A, A, Inc, and B, and Brothers, and C. And how can we match? Of course, we have more than those. We could actually take a look. We have shape of the F1. We have 266 for the first one and 368 for the second one. But the way we could do it is with kind of a nested for loop. We could do something like 4C1 in companies 1 and 4C2 in companies 2. And we can do something like distance. Let's similarity is equal to, let's just use some quick, zuda code here of C1 and C2. If sim is greater than 90, and we can store these things, we're going to make something like companies, similar companies. We're going to do similar companies.append or we're going to put C1 and C2. But this algorithmic approach doesn't let us visualize the data and do a little bit of an analysis of how the data changed, how the data is represented under the hood. We're going to just do this hard cut in 90 without the ability to see if maybe there was an 89 that was fairly good. And don't worry about the number itself, because in 1989, 1985, this algorithmic imperative approach is not extremely, it's not very analytical from some point of view. It's not easy to see why or how things are working behind the scenes. So an alternative way that we're proposing for this particular project is to create combinations of the columns and pretty much do the same process, match every company with every company, A with A, Inc, B and Bros and C, B with A, Inc, B and Bros and C, C with A, Inc, B and Bros and C, match every company with every company. And to do that, there is a very simple method which is the eatertools.product method. So we have imported the eatertools method and we're going to do here, let me remove this thing. And we're going to do companies 1 and companies 2. These are plurals. I'm going to do eatertools.products of C1, CS1 and CS2. And we need to wrap this thing in a list to visualize it. There you go. And we have pretty much the combinations of everything with everything, which is again what we were wanting to do. So now what we can do is we can just pretty much do 4, let's say we use an imperative approach again, 4C1C2 in this thing, right? And now we can say pretty much the same thing we were doing before. We can compute the similarity of these two companies. But the result is we went from a nested computational intensive 4 look to something that is a declarative, expressive, data-driven approach. In this case, we're generating the data and now we can do some different things. If we think about this thing in a data frame, we could potentially add a third column that is the similarity. So now we're going to have A, A, Inc. and the similarity column. And now we can start doing some analysis. We can filter all the values that have greater similarity. We can do some plotting for some of the companies that have some certain similarity. We can visualize, we can explore, we can analyze the data in a more declarative, analytical way than just doing a simple, not simple, but just a more rudimentary, naive iteration or imperative algorithm. So let's move forward, let's get rid of all these things and we're going to start approaching the first activity, which is basically asking us to build that data frame, build this data frame that contains all the companies from CSV1 and the product of CSV2 and also compute the ratio for all of them. So we have to put these two things together. The first thing we're going to do is we are going to do DF at, we're not asked for the ratio right here. This screenshot is incorrect. We're just going to compute, we're going to just compute the product and create the large CSV. So what we're going to be doing here is we're going to do DF1 at client, dot values, that's going to be our first list. So remember, we have either tools dot products. Here's the syntax, collection one, collection two. So we're going to pass, for collection one, we're going to pass the row for DF1 and for collection two, DF2, we're going to pass then its firm name dot values. And we can pass this whole thing to a data frame, PD dot data frame. This whole thing, going to break it into a different row and we can say column names, columns, it's going to be CSV1 and CSV2. And I have a non-matching parentheses because I need to close that one. And we have an error with columns. No, firm name values. Name, there we go. Sorry, we have a few mistakes. This is pretty much the regular process of data analysis or data processing. And now what we see is we have created this huge, let's actually take a look, the F shape, this huge data frame with the same expected rows as the one we have right here, 97,000 rows, which is going to be the result of the product of the F1 shape and the F2 shape. So let's give it a try and see if it passes. There you go. It worked. So far, so good. We have our product of different company names so we can do a little bit of matching. Now, we are going to apply the range, the Levenstein distance, right, to compute the similarity between the companies. Of course, the similarity between these two is going to be a lot. The similarity between these two is going to be 100. I mean, sorry, the similarity here is going to be pretty low. The distance is going to be pretty high. The similarity is going to be pretty low. The similarity in these two companies, Adobe for both of them, is going to be pretty good. So we're going to be using this library, fuzz.partialratio, partial ratio because the companies have white spaces. So it's going to pretty much compute the words in different orders and this might be helpful. All right, so what we're going to do here is we're going to compute the ratio score and we're going to swap to the activity first. This is the second activity. That is asking us to compute the ratio score we have right there. We have the ratio score here of these two companies. And how can we do that? Well, it's pretty simple as we already have the values. So we have df at, we can just do values here. Values. And we can do something like, let's say, score equals fuzz.partialratio of C1 and C2 for C1, C2 in df.values. It's going to take a little bit of a moment. There you go. We can get the first 10 scores and we can now basically apply that to ratios score and take a look at the different ratios. And now we have that Adobe Systems obviously has a 100% similarity while something like Adobe with Advanced Micro Devices Inc has a lower similarity. The only thing similar is probably the Inc and the individual characters. All right, so, so far so good. The first part of this process is understanding how we created a declarative, descriptive approach by transforming our data and working with a different foundation. It's like we move the floor. It's like we started with these two different data frames and now we have just a single one that has this very useful column with the combined ratio of the similarities of each company. So now let's move forward with a few more activities and this is pretty much the whole data analysis process. I'm going to create a division here and we can get started. How many rows have a ratio score of 90 or more? And there's going to be df.log, df at ratio score greater or equals to 90. And let's take a look at that. I made a mistake. There you go. And we can now check how many we have and we have 106 companies. There you go. What is the corresponding company in CSV 2 to AECom in CSV 1? And this is the one we pretty much saw here at the beginning. This one right here. But let's go ahead and give it a try. There you go. So what we're going to do is we're going to find, we're going to do df.log or let's actually use queries. We change things a bit. I'm going to do CSV 1 is equals to this one and ratio score is equals to what? Or greater than what? I'm going to say, let's say 80 to take a look. There you go. So these two have a 100 ratio score because again, we're using a partial ratio. If we were using a first ratio, only ratio, this is going to be lower. It's actually 30. And that's because there is, in reality, there is a lot of dispersion or distance between these two words. There is no much similarity. Again, to the human eye, there's a lot of similarity here because we can zoom in and just focus on these two words. But if you think about it, it's like this whole thing makes the second string completely different. And that's why we use the partial ratio, which now gives you a 100% score because it's doing the same thing we do with our eyes, which is basically focusing attention in different words and give you the combination of individual words. So anyway, let's try it out. This should work as expected. And basically here, we are basically trying to understand what's going to be the cut-off value that is going to make us decide if a company, programmatically speaking, in an automated way, if a company name matches another company name, it's like, where's the cut-off value? So CSV1 company, the fifth activity, is Starbucks. What's the corresponding company in CSV2? We're going to do the same thing. And let's keep the ratio equal to 80. And we have that it's Starbucks Corporation. And let's enter it right here. It works. But again, the ratio score was 100. So by doing some analysis, we can start understanding how these different ratios are going to be distributed, something that I'm going to do at the end just as some other activities. So activity number six, is there a matching company in Pinnacle West Capital Corporation? So let's copy this thing once again and put that. And there aren't, like, definitely these two are not related. But what we see right here, just because I randomly choose to use 80 as the ratio score to do the cut-off, it's like there are two companies that have nothing to do with Pinnacle West Capital Corporation, but they have a ratio score of 83. And the answer is that because they have the corporation word. So we have to be very careful. So I'm going to say here, no. Let's see what's the answer. That works. And let's move forward. Seven activities. How many matching companies are there for a county of LA, blah, blah, blah? And now we're going to put this thing. And we're going to keep the ratio equal to 80. And take a look and keep understanding what we have. So there is a very clear difference here. We have two companies that have a value of 82 and two companies that have 100 and 195, so above 90. So I'll be tempted to say two here. There are two companies that match the name for a county of Los Angeles, blah, blah, blah. We should take a look at the full names. So we can do that with PD options, options. Maybe I can ask our AI assistants if I don't remember this display. Max, max, max call with none. There you go. That was one. If you ever have the dub, you can just ask the assistants. Go for it. Don't worry in Googling or asking the AI assistants because that's pretty useful. And here now we can take a look at them and we say county of Los Angeles, the first compensation program, county of Los Angeles, the first compensation program. And then we have city of Los Angeles. So I'll potentially say how many companies seem to be in CSV. And we're going to say two. There you go. They seem to be the same company. Or they're not companies, entities to put it away. So finally, a activity. Is there a matching company for the Queen's Health Systems? And what we can do here is run pretty much the same operation. And we find the Queen's Health System just with the apostrophe and the S. So is there a matching? We're going to say yes. And by doing this analysis, and we have I think we're missing one activity. I think I have completed. Oh, let's check this activity. There you go. We have no more activities. To complete all the project passed. But by performing this analysis, so first we did a little bit of data wrangling. We did a little bit of managing to get to the ratio score. And now we're kind of close to start generating the final CSV. If you want the final representation of what would be the matching companies. And here we don't have an activity in this particular one on purpose. Because there is still a data cleaning process that it's kind of annoying. It could take a little bit of time. But we can start taking a look at, we're going to say, for example, df.log df at ratios score greater than 90. And we can take a look at how many companies are matching that query. We can go one step further. And when you're doing data cleaning, something very useful to do, at least to me, is to start visualizing your data. So I'm going to do ratios score. And here I'm going to plot kind hissed. I'm going to see that if a ratio surpasses 90, most of the samples are going to be between 98, 9, and 100. So it'll be interesting to see what companies we have here. Let's actually, we can actually take a look also at a box plot. So I'm going to do box. And we have pretty much the same results. Like everything is concentrated in 100. And then we have a few samples in 96, 95, 93. So let's actually take a look at those. And we're going to say, df.query, let's do query now, where ratio, ratio score is greater than 90. And is less than 97, let's say. And now we can take a look at what's going on in here. And this is, again, this is the data analysis process. So in this case, we have one that seems to be very compelling. This one is pretty much the same thing. Idaho Power Co is the same as Idaho Power Company, jacking the box in and jacking the box in with 95. Let's actually sort values by ratio score. And we're going to start with the ones are more likely. So the Queen's Health System, pretty much the same thing. Jack in the Box, pretty much the same thing. County of Los Angeles and Seattle, whatever we did before. Sacramento City, Sacramento County. Of course, we have to understand if the city and the county for us represent the same entity or not. But nevertheless, Idaho Power Company is the same. Safeway Inc is the same. These two are probably the same. But now we're getting into a little bit of even with this one, we're getting into terrain that it's a little bit less clear what's going on here. So I'm not so sure. And this is because I don't know the business we're running. If this company or institution is the same as this one right here, potentially, right? But this is where you have to start to ask. So in this case, we have Contra Costa County and Sonoma County. And these are different. Again, I don't know the business. But I'll bet that these two companies are not the same. The same thing with again, Contra Costa and Marin County. So I bet that these two are definitely not the same. While there is a little bit of a mixed ground between, let's say, this one and this two in terms of if city and county and city and county are the same institutions, and then there is no doubt in the last two. So basically, this illustrates the real process of data cleaning, which is, first of all, if we would have done this process before, it was a nested for loop that I deleted. Remember nested for loop for company one in companies one, for company two in company when we do it manually, it's very hard to visualize this sort of relationships. So what we did right here was by first reshaping our data and having a clear picture of all the data we had, we were able to create a more descriptive representation. We calculated this ratio score company, we started doing some analysis, we answered some basic questions like, can we find a company here, there? Can we find this cut of value? We know I just chose 80, but we could have used 75 and see how it went. And then we took a look at the distribution of the ones that seem to be more likely, and we found like this middle ground, that was something like, we probably need to go back to whoever is running the business or, you know, just Google it and start understanding the data that we're dealing with. Because data cleaning at the end of the day might be, you do need a ton of domain knowledge. I can't just clean data with my data cleaning techniques without understanding data that I'm cleaning. It might be a great data scientist, but if I'm cleaning biological data or data from a physics lab, I will have no idea what data I'm working with, right? So that's a very, very common case. You have, you know, you're working with high tech and you have measurements from instruments, and you have something like, I don't know, a column T, and you have a value that is 0.0091. And it's like, is this fine or not? And maybe the physics guy comes and says, hey, this is completely wrong. You're missing the columns. The T value should be in the orders of hundreds, not in the orders of, I don't know, very small, right? So I need to understand the data, but to understand the data, to understand the domain, to be able to ask questions, to be able to Google things, to be able to go to the archives of the company, I need to put that in a good representation that allows me to do that. And that's what we're doing with this representation that we have created. So anyway, I have talked a lot so far in these projects, because for data cleaning, this is the most important thing, is being able to reproduce your experiments, visualize and understand what's going on, take a look at the project, try to resolve by yourself, or keep working with other capstone projects we have for data cleaning. It's a very fun activity pretty much all the time. This project is focused on data cleaning and some data analysis after the data is cleaned. And it's a very interesting one, because the data that we're using, it comes from someone that is scraped, right? Just scraped the HTML out of the Google's Play Store, right? So as you might know, whenever you're scraping data, you should be worried of issues in the resulting data, because web pages are not extremely structured, they change a lot. So if you have a scraping job that takes time, because we have a lot of reviews here, we have like 10,000 apps that were scraped from the App Store, that's going to take time. It's not that you can just hit a button, it's going to complete in just a second. It might actually complete in a matter of weeks, because Google bans your IP, the one that it's scraping. Of course, I think this is even illegal, right? But someone has done it. And again, it's going to take time to scrape, and time equals that things might change, you might duplicate, you might scrape the same app two times, the results might not be ideal to put in a way. So the way we are going to go with this is first understanding the data we're dealing with. So that's pretty much the first step. We're going to do df.info. And we're going to kind of brace for impact here. We know there might be things going on wrong. So the first thing is we have 10,841 entries, the total end number in the index, and then we quickly see that rating has far less value. So there are no values here. Also, we see something like, for example, reviews right here, which should be numeric, and or rating. We see that rating is actually numeric, so this is a good thing. But there are none values. Then we see that something like reviews, in this case right here, that is not numeric. It's an object, it's a string. So what's going on in here, that's the question we're asking ourselves. So let's actually get started with activities, and it says which of the following column or columns have no values. We quickly saw that rating is one of those. We can use the missing NO library, this one right here, missing NO, just gives you a quick visualization. And as usual, I like to start my analysis with a quick visualization. When I'm working with something, trying to understand the data, a quick histogram boxplot, one of these missing values visualization is recommended. The second stage is doing an analytical review of what's going on, et cetera. So let's do a right here, missing NO. We have already installed the app for you, or the library, and it's already imported here at MSNO. We're going to do a bar matrix, is the most common one, like this one. I'm going to do a bar, and we're going to do the whole data frame. It's going to show us that very quickly that rating has a good chunk of missing values right there. What else? Well, there are a few here, but it's hard to see with the bar, and one here, 840, one missing value here, and this is hard to read because the bars don't help. So now we can switch to something analytical. And of course, we kind of saw that because we have 840, 843, which we know that the full index has 10, 841. But to do so with an analytical way, we can just do is and a. We can do no summon, and that's going to give us, there you go, really quickly, the apps that have the columns that have no values. And we can sort this, doing sort values, ascending, false. There you go. So we have here, we can start answering rating, of course, both versions type and content rating type and content rating. There you go. Both are now claimed. There you go. No values in all of them. So now, second activity, and this is, again, as usual, I encourage you to pause right now, try to solve it by yourself because it's an interesting one. So three-part activity. Remove the invalid values from rating, if any. Just set them as NANs. Okay, invalid values. And let's take a look at rating. We're going to do now, again, as usual, we can start with a quick visualization at rating.plots.kind histogram real quick. And it doesn't look so. So might be okay. But we already see that there are some that are above five, which is, of course, not a valid rating. And this is my usual comment. You have to understand the data. You have to understand the domain you're working with to do correct data science because we know that the rating of an app goes from zero to five, or actually from one to five. So if you have anything above five or below one, they'll be invalid, right? And it's a number, so the column is okay in terms of the type first check that, for example, we saw already that reviews wasn't. So reviews is an object. This is already, here, reviews. It seems to be numeric, but then when we check the analytical version with info, we quickly see that it's not parsed as a number. So now it's raising suspicious. We're going to take a look at that later. But here in rating, we understand our data, and we know that the top, the maximum number that a rating can be is five, and this goes above five, so that's clearly invalid. Let's turn this thing into a box plot. We're going to do, we're going to do avert false. And now we can clearly see that there are a bunch of samples that are completely invalid. Let's take it now to the analytical version. We're going to do rating. We're going to do describe. There we go. And we have maximum value, which is 19, definitely invalid. All right, let's check it out to say, df.log df at rating greater than five, and there are a bunch of apps here, and we're going to need to claim these apps, and the rules set them as NAN. That is it. So here we're going to do these apps at rating. We're going to do equals to np.NAN. That's it. But then it says fill the null values in rating. So the ones we have just done and the other one, because there were like a thousand apps without rating, using the mean. So basically what we're going to do here is, we're going to do df at rating. Rating dot mean, we're going to use this value. I want to say df at rating dot fill. I'm going to keep that. Rating dot fill NA with this value in place equals true. And now it's pretty much the same thing. And now it says we can't check that yet, because it says, clean any other non-numerical columns, but just dropping the values. And this is understandable because we don't have so many. So we have 10,000 apps in total, like 10,841 apps. But there are apps like occurrences, rows, symbols. But there are only a handful, like 13 apps that have either a missing version, a missing type, a content rating. So it's not that bad. We don't have so many missing apps. So we can just get rid of them. That's basically what this activity is saying. So we're going to say df dot drop NA in place true. That's it. And that works. All right. Moving on to the next one, reviews. We already talked about reviews. We saw that it was parsed. Let me take this right here. It was parsed as an object. Pandas is going to try to parse any column as a number. Okay, this is by definition. It's going to try to parse a column as a number. If it can't, because the column is not a number, they're string, they're violent characters, it will just default back to object. And that's exactly what happened here. But again, the question is like, where we take a look, df ads ratings, no reviews, dot head, and they look numbers as numbers, but it's an object. So let's check it out and see what happened. We can actually double check doing a two numeric. Two numeric. And we know this is going to fail because somehow, I mean, Pandas was not able to parse this as a number, so there is no way this is going to work. It didn't. Because basically, there are values here that are invalid. So somewhere in this list of 10,000 apps, there will be values that are not fully numeric. And let's explore for that. A good way to explore for that is I have my AI assistant here that I asked earlier. For example, what is the syntax of pd2 numeric when I was checking this project to resolve? And it's telling me that there is this method, which is, I can pass an optional attribute, which is errors equals course, right here. And that's going to, instead of raising errors with a, you know, loud, red message, it's going to turn the null values, the invalid values as NAMs. So basically, that means we can do that. I can create a new column, reviews numeric. That is a result of that. And now I can say df.lock show me the rows in which this column that I have just created is NAM. Because that means that wherever there is a non-value, no value, missing value in this new column, that is because there was a number that wasn't, that pandas couldn't parse. And when it couldn't parse it, it just assigned NAN. And that is basically what is happening right here. Reviews numeric is NAN, NAN, NAN. And we can take a look at the actual reviews and we see that there is this N value here symbolizing for millions. So two million, one million, six for five million. We want to turn this thing into numbers. To do so, what we can do is df at, let's say, df.lock df at reviews. I could have just done df.lock at 72. No, here, we can do it. I'm not going to do it, but I can just do 178. I can just do two million, one million at reviews. Of course, here, reviews. But of course, it's not very scalable. So I'm going to do something else. Just make it automatic to put it away. Automate it. It's going to be completely generic and this is what we're looking for. Basically, find me all the reviews that contains an M. So these are the invalid values. What are the reviews now? There we go. And now what we can do is.str.replace. We can do M with nothing. So it's going to remove the M. I'm doing this step-by-step, by the way. I could have just done it in just one line. But I'm showing you a step-by-step resolution. And basically, we can now turn this thing into numeric, p.2 numeric of this thing. There we go. But we saw there was two million. So we need to multiply back to one by one million. We have the value. But we want to assign this thing to the reviews column, which is a string type. So it's usually recommended to turn this whole thing back as type str. So we want to make it a string. Because now we can take these values, new reviews. New reviews. And there you go. We have to break into a new line. Break into a new line. There. These new reviews, did it work? Let me put it right here. New reviews. There. These reviews, basically we want to assign them back again to which one? This. No, this. I'm already making a mess of myself. These results. So basically when I get these things, I want to assign it the previous value. So we can do, do you want to do the whole thing? Let's do everything. Just one line, as I promised. We're going to do that. Let's see if it works. So now we can look again for this column, these apps. And let's take a look at the whole row. I can see now that the reviews have incorrectly cleaned. But so far this is just a string, right? We have not, we have just fixed the three occurrences which we could have fixed a thousand occurrences because again, this method is pretty generic. It works for pretty much anything. We are going to turn the whole thing now to a number. So we're going to do PDF at reviews. Let's check first to numeric. There you go. It worked so it didn't raise an error. That is good. We now can assign that back to reviews, uncheck our activity finally, finally, and it worked. Okay, moving forward to the next one. How many duplicated apps are there? And in this case, it's asking us for the total duplicated apps. Let me do a quick recap of the duplicated method. Undo duplicated, df.log, df at duplicated subsets app. Sort, sort values by app. And then there you go. We have all these apps. Oh, here is the occurrence that I'm looking for. So basically here eight ball pull. So what it shows by default, if you have, so if you have, for example, Twitter, Twitter, and you have Facebook, Facebook, what this method is going to show you is only this one is duplicated and only this one is duplicated. That happens because, I mean, this is the real one for pandas. The first occurrence is like the real one. The second one is a duplicated one. The same if you have multiple apps. So if you have six occurrences or five, whatever, one, two, three, four, five, there is one good and four are duplicated. That's basically what pandas does by default. In this case, activity is asking us for a different behavior. It's saying if there are two occurrence of Twitter, that should count as two. You want to mark like Twitter is here is two times duplicated. To do so, what we need to do is pass a parameter which is keep. Sorry, I already made a mess. Keep faults. And somehow, because that should go inside the parentheses. There you go. And now it seems that it's a little bit more interesting in this case because we have every occurrence of the given app. And there is also one more thing to comment here is that let's say you have this app and you have the number of reviews, which is 19 100 for one of them. And the other one is 19 100. There you go. So these two rows for pandas are completely duplicated. These two rows. But if you have something like, I don't know, 10,000, 891 and 11,000. If you have these two rows right there, if you just do the F dot duplicated, let's do keep false some by pandas by default is going to tell us there are 18, 800 or 890 duplicated rows. But for this to be counted as duplicated row, the whole thing has to be the same. So in this example, this would be just two, these two are duplicated. But these two here are completely different rows to me. That's what pandas says because the values are different. For us to say we want to count duplicated apps, we need to do these subsets up. So let's do that. Copy this thing. Now we're going to do there. And now we can count 19 hundred and 79 apps because again, there are some rows are completely the same. There are some rows that are different, at least in some other column. But the app is the same. Let's actually try to visualize that. I can say and and it's my time a while. And it's not this thing. There you go. That works. So basically, these are the apps that are duplicated. But there is something different between these rows. So for example, between these two rows, there is something different. And it's here the reviews. This one has six, six, five, two, one, six, six, six, two, four, six. So this one has more, more reviews on the second one. And again, remember where the date is coming from. It's coming from the scraping process. So there is a chance that this app was scraped two times at two different points in time. And of course, we have a duplicated rows with different data. Anyway, let's answer the activity. How many duplicated apps, but counting them whole, do we have 1979? Let's check if that works. And it goes it. Next activity drop duplicated apps, keeping only the ones with the greatest number of reviews. And this is a very interesting one. You will sort the you will need to sort data frame up on reviews. And that will change the order of the data frame. That is fine. Because what is going to happen is and again, you have to understand. Let me copy this thing once again, but I'm going to sort now by up by reviews. This these lines are extremely short. There you go. So what happens here is that we have, again, this scraping process that potentially scraped two times, or more than two times, you have like two, four, six, seven times. And the reasoning, at least what I'm thinking is that the scraping process took like an entire week. And somehow, these persons scraped the same multiple times. But there is one important hint, at least to me, just that the number of reviews in an app is probably always going up, right? Like you will seldom delete reviews that you've done in an app. I don't know if that's even possible, potentially as possible. But basically, what I'm thinking here is that and basically what the activity is asking is that we want to do drop duplicates. And what is the freshest copy of this scraping process? What is the app that get gets closer to the real number today? And that is potentially the one with the greatest number of reviews, the one with the most reviews. Because again, the reasoning is that this was scraped on Monday, this is cream of Tuesday, Tuesday, Wednesday, Wednesday, Saturday, Sunday, right? Because all these apps have a different number of reviews. And this one, the one below, right at the bottom, because we sort by reviews, has potentially the greatest, it's potentially the greatest, freshest copy of the app from the scraping process, let's say. So what we have to do here is we can sort this data frame in this way. So we can do, let's keep a copy. This is a good technique. As we're changing the data frame, we're making several modifications. It's a good practice to keep a copy in memory, just in case you can always go back to this copy. The way I do it, it's just copy, I can put a number, like this is activity number five, DF copy five, or DF copy before removing, or something like that. And I usually just comment it out immediately. Because if I execute this cell again, nothing happens, I don't hurt my copy. So again, if we make a mistake, we can always roll back, we can do DF equals DF copy, copy five, dot copy, and we have back again the data in this previous stage. But anyway, let's, let's keep moving forward with activity. And what we want to do is we want to sort the values. So DF dot sort values, app reviews in place, true, and now we're ready to do DF dot drop duplicates, drop duplicates, there you go. We're ready to do this thing. But the interesting part is that we need to say subset is by app. And there is a parameter in drop duplicates, which is, I don't know why the documentation is not working. What's this syntax of DF dot drop duplicates? Let's quickly, I could also do something, I don't know if you know this trick, like to help this method right here, gives you the help that should work with the same as this one right here. But let's wait for the assistance. Try and trade off this. There you go. And there is this parameter, which is kip first. And what kip first does is basically, we, the way panda is gonna, gonna work is, it's gonna find the duplicates, right? In this case, eight ball pull is the app is gonna, again, one is going to be marked as good, everything else is duplicate, but the question is, which one is marked as good. In this case, as we sorted the data frame in this order that the one at the bottom is the one with the, with the graders number of reviews, we want to do a kip last, we want to remove all this and want to keep only this copy. Okay, that's again, the syntax, let me show that to you for a sec. Right here, kip last, that's what we want to do. So, let me close this. And we're going to do a drop duplicates, subsets, app, do kip last, do in place. True. And this is when we're changing the data frame again, and we can now check the activity. And it didn't work. It's strange, and potentially could go back again to the copy. And we can try once again. And reviews. Oh, potentially this activity is checking. That might be it. So let's give the copy what it is. Let me just try this out. TF ads, or use numeric, let me get rid of this column and check the activity once again. There you go. It worked my mistake was not supposed to create a new column, which I did. Okay, moving forward, format the category column is an interesting one, because let's take a look at category first counts, there we go. Category, or uppercase. Yes, they're all uppercase, we can see that and it says that white spaces are actually we're using underscore for that we can check that as well. We have to change that we have to make this thing remove the uppercase just use white spaces and capitalize it so autumn vehicles in this format becomes autumn vehicles in this format. So we're going to do this very quickly. The category is equals to the F at category dot str dot replace. We're going to replace here to underscore with a white space. That's the first replacement we're going to do. The second one is we're going to capitalize the whole thing. And now we're going to check we can check again this thing. And where's autumn vehicles, autumn vehicles, ideally, it's going to be the same, let's check it out. And it works as expected. Okay, activity number seven, clean and convert the installs column to numeric type, you get rid of this. There we go, clean and convert installs column to numeric and cleans. Some values that have a plus modifier to remove the string and honor the original number. Okay. So as usual, we can do something like I don't want to create a new column. So I'm going to do df dot lock, df, we're going to do remember the PD dot two numeric of df at installs, errors, queries, right, is an A. So basically, the apps that cannot the rows that cannot be parsed as a number, well, look, use that for a selection, we're going to do head real quick, and installs 510,000 with a comma separator, 1 million the plus with the comma separator. So basically, what we can do, and I think I could do this whole thing just one line would be df at my bot, my bot installs at str, the replace, we're going to replace the plus symbol with nothing, we're going to remove it. Just chain the whole thing, chain the command, and we're going to do comma with nothing, we're going to separate that. And when I do PD dot two, Merrick of this whole thing, there you go, seems it's not failing. So that means this is working. And finally, we're going to assign this thing to installs. Let me see if there is anything else. No, that is perfectly fine. There is no m bright, because I only I only remove the plus symbol and the comma and that just transformed to Merrick. So it seems like I'm faith I'm safe, sorry. And we're gonna know how sign this to installs and check the activity and it works. Clean and convert activity number eight, there, there you go, let me center it activity, activity number eight, clean and convert the size column to numeric representing bytes, size columns of type objects, some values are either m or k. So that's basically megabytes or kilobytes. Let's take a look at our data frame once again, there you go sign size, this is 3.6 megabytes 9.1 megabytes, 203 kilobytes, we want to turn this thing into a into a bytes, it seems right when I turn this thing into bytes, nine, eight, nine, eight, two will become this thing, which is the bytes thing, that is that is good. Some other rules are a remote supply the rules as previous tasks like the one in the plus. So if we do the F PD time, let's do the F dot lock the F I do PD to numeric. And I think there's a good time to keep a copy the F copy age, the F copy. There you go. Oh, my bad. Reloading, reloading, reloading, scanning, scanning, scanning, scanning. Where were you size, we didn't save the notebook, the F copy eight equals the F copy. And we're gonna comment that out. And as usual, we're going to do the F at size dot str dot replace m, m with nothing and str dot replace k with nothing. We don't want to do that right now, because we want to convert this thing to the real numbers. But for now, if I do PD to numeric of this thing, what happens? Simple to parse string var var var var is with device. So that is because some other values are completely invalid, there is no way to infer numeric type from them. So these are to replace with zero, okay, so we can start right there, we can say the F dot lock, the F at size equals or as with device that shape how many Oh, okay, a ton. So there are a ton of these sort of apps. And here want to do that these column size for that condition should be zero. So I'm gonna do just I'm gonna do zero as type str, as usual, remember, that this is not gonna work potentially PD. This might be fine. Let's let me try it out. If it doesn't work, we have the copy. As usual, we want to keep the values the string, the F at size is still an object, that's good. And now it seems that it's working. So there you go, it seems that it's working the F at the F dot head, we only have to replace our their df dot lock, the F at size dot str dot contains a comma. Cal must know continue No, I, we we made a mistake. Let me go back again. Copy. This is a good thing. We're gonna basically say PD, the series zero thoughts. And I we made a mistake again, the F copy, copy, there you go. Because we might not be able to assign it. So what I'm going to do is just more string handling, which is the usual way of doing this process, which is the F at size equals the F at size dot str dot replace, we're going to replace this whole thing, or as with size for a zero in this way, there we go. And there we go, it works. So the copy is saving us. So there is nothing with a comma that is good. And it seems like we can potentially convert if we fix the issue with the M on decay, let's go ahead and do it once again, we did it already, this process, basically, what we're going to do is the F dot lock, let's start with the kilobytes, the F at size dot str that contains the K, K lower, lower case, K. Really quickly, let me check that. And what we're going to do now is we are going to take these reviews, not size, that's what we're looking for. And basically, we want to convert this thing, we're gonna convert this thing dot str, dot replace the K with nothing, right? We want to go we want to now turn this thing into numeric PD dot to numeric, there you go. But these are kilobytes. And what the activity is asking us is for megabytes. So we're gonna do now, sorry, for bytes, these are kilobytes is asking us for bytes, we have to multiply this whole thing by 20, 10, 10, 24. And now these are kilobytes. So basically, what we are able to do now is as usual, we're gonna get this thing, those sizes, and we're gonna assign them to this thing right here. And this is going to be a long string. And the size of the lines is not great. There we go. So what we're doing is selecting all the rows that have a K, turning that K to a removing decay, turning into a number, multiplying it for 1024, because it's gonna be bytes and then at the end, as usual, we want to keep that as a string. There you go. We're gonna do the same thing now with megabytes, I'm just gonna replace m, m, m, m, m, m, and of course, here is 1024 times time 24 for megabytes, I'm gonna do that. Finally, PD to numeric of the effort size, it works, let me show you the size now. So this was three something maybe we have a copy here size is pure 3.6 m. So that is 3.6 m, this one right here was 3.6 m. So 3.6 times 1024 times 1024 is pretty much the same value we have right here. So it's good, but it's still an object, because we're using strings for everything, which is a good thing, keep strings, keep strings, keep working with strings until you're good. I'm sure you want to convert that into a number, we can do that right now. And let's check the activity, this was a long one, took a little bit of wrangling. There we go. But finally, we have successfully cleaned it. All right, moving forward, clean and convert the price column to numeric, clean and convert the price column to numeric. The value represents with a dollar symbol at price dot different zero, the f dot block, we're gonna check a few prices. That zero is a string, there you go 1.49 seems to be good. Let's check if there is one that has a comma or anything like that contains a comma, nothing. So I think we can just let's let's try it out. What if we do PD dots, I mean, precise here to numeric of this thing dot replace dollar symbol with nothing, just that's oh, no, there are value values, oh, because there is a bunch of price equals three symbol that is for price. All right, so we can do the effort price, we can actually check free. Yeah, there are a bunch that are free. So the effort price equals the effort price dot str to replace free with a string representing zero. And now it seems to be working. So let's do that. And now let's check the activity works. Finally pater free now you have clean the price column, let's create another one distribution to this distribution equals the college to contain free paid value spending on the apps price. So this is a good one, we could iterate over each price, right over the whole data frame of price, and just assign one thing or the other. We could use an apply function, we could do something like equals we could do the effort price dot apply a lambda with a price p and put we can free if p is greater than zero else paid to do that. And I could assign that to distribution as it is, let me check just as we're doing it. But there is no didn't work actually. That's a good thing. Because oh, paid is on the other way, it's paid. If pray other ways free, I just and that's why I don't like the apply one and potentially at this stage, if you're just still doing data cleaning, you might not know this method. But there is another simpler way that is just hacking pandas, which is all initiates. Let's actually see if this works first. Distribution or no, I want to use my method. So distribution is gonna be gonna say distributions, like everything's gonna be free, like let's start that way is free, we can check activity is not gonna work, not didn't work. But then we're gonna say is df dot lock, all the apps that have a price that is greater than zero, right that to price here, we're gonna get the the distribution column, I want to save that to paid that way. Now we can check activity. And that works. So the way and I gotta get rid of the apply for a sec, the way this method, which is again, the most convenient is just if you have a few rules, what you can do is just, you know, start with the one that is going to give you like 70% of the apps are free. And then just work your way, creating other conditions to replace the others. This is a very common technique. All right, now that we have finished with data cleaning, we can just do a quick analysis to answer all these questions. So what company has the most reviews, we can do something like df dots, sort values by reviews ascending falls head, by the way, as usual is a good time to stop the video, try to resolve it by yourself, I will just you know, move very quickly over all these activities because this is more analysis that we've done, I think it's still interesting. But if you can try to resolve by yourself first, and then take a look at the solution I've done. So the first thing is the app that has the most reviews, this works also as usual, df add reviews equals df add reviews.max. That is Facebook. Good, which is a category with the most uploaded apps. And this is df categories, dot value counts. Family is the category it seems with the most there you go ups to which category belongs the most expensive app. So we can do df dots, sort values as usual, by price, ascending falls dot head, and I'm rich Trump edition lifestyle. What's the price $400 for an app? Okay, I had to that's and it works again. It's a live type application. What's the name of the most expensive game? What's the name of the most expensive game? Find the most expensive app in the game category and enter its name. So basically, we're going to do the same thing as here. But we're going to lock here, we're going to do, we're going to query where the category is game. And the game seems to be find the most expensive app in the game category. The world ends with you, it seems there you go, which is the most popular finance app. So basically, we need to first filter by finance. So category is going to be finance, and then it says, with has the most in installs. So we have to sort by installs. And that seems to be in finance, the most popular app, which is kind of obvious, I hadn't thought about that is Google Play, Google Pay, sorry, that makes total sense. And that activity also passes. What team game has the most reviews? So what app from the game category and catalog as teen in content rating. So df at content rating, the value counts. We haven't looked at all the content ratings yet, teen, okay, so when I find when I find the app from category equals game, it seems category game, and content rating equals teen, that's, and we're gonna find the one that has the greatest number of reviews or the maximum number of reviews as sort values by reviews ascending falls. First five asphalt eight airborne seems to me seems to be the most popular in terms of reviews of all the games are content rating team. Let's take a look. Yep, that works as expected. What paid game has the most reviews. So we have to find a game that let's say let's use our distribution column column that is here paid and has the most reviews and it's going to be the same thing. It's been paid and it's again has most reviews price equals greater than zero. By reviews ascending falls hitman sniper. No, because we probably need and price greater than zero. What paid up from the game category category game has the greatest number of reviews. What is cash of clans let's let me try this because I found an error. Yeah, so this activity should be what free game has the most reviews. So we have to fix that. Ideally, by the time you're watching this that has already been fixed. So let's keep moving forward. Finally, how many terabytes we can actually report an issue right here. For the previous one report an issue, the activity is asking for paid apps, but the solution is for free apps, submitting ads, just in case, finally, how many terabytes were transferred overall for the most popular lifestyle app. And this is a very interesting piece. This app produced the greatest amount of byte transfer enter the answering terabytes as a whole number running down the nearest integer example, if you find the total transferred to be 780 point just enter 780. So the app most popular is going to be df dot query, I'm going to say, as usual, category equals live style. Yes, no, what's the name, the live style, there you go. Sort values by sort values by by installs is the one that produced it. So this is the app that I was installed in the greatest amount of times, like most times. And we're going to do ascending falls, do I log zero, I'm going to show you that first. So Tinder was the app with the most installs, I do that I locked zero, so I get this first row, I'm going to show you up, these are up. And now we have to do up. So this app wants to install these many times. And the size is the size in bytes, right, so that's going to be up at installs times up at size, right. But this is in bytes, we have to divide until we reach terabytes is going to be kilobytes, megabytes, gigabytes, terabytes, there you go. So let me show you the previous one. These many bytes, these many terabytes transferred. And I think what do we have, we can we can actually do it in the following. So six, I think it's petabytes. So we can ask assistance just in between what follows after terabytes. So basically, this is the answer in terabytes, it says to just enter the whole number rounding down. So we're going to put just 6484. There we go. And this is still working, can see if we can answer real quick. I think it's petabytes that follows terabytes, but it's a cup broke. Anyway, the greatest the app that produced the biggest amount of data transferred in that case, was Tinder with 6000 terabytes, which I think is 6.3 petabytes, I think I'm not sure. Anyway, this is a great project, we did a lot remember, we started with the whole data craning process, which, as usual, data, we had a lot of detective work, we had to find out why some of these columns were incorrectly parses objects, so suspicious, we started removing things, turning to the numeric, as usual, keep trying to keep everything. I think there are two or three very important things I'm trying to transfer to you here. And the first one is that this detective work, you know, trying to turn things to numeric, and try to find out where the issues removing things, and doing this iterative work. The second one is if you're modifying data frame, always keeping copies, we were able to fix an issue by keeping copy. And then the third one is keeping everything as string until you're safe to turn it back into a number, the same thing is going to apply if you're turning something to a date. First, you know, keep working as string, string, strings until you're ready to turn it into a number. But anyway, I think this is very interesting process also the same the whole idea behind where data is coming from, it was scraped. So we were finding the duplicate data, we were assuming that the app with the most reviews was going to be the app with the freshest copy. So doing all these, you know, analysis based on our data is basically data science process. So if you are comfortable with this, congratulations, because you're you're pretty, pretty, pretty close in becoming a data scientist. Now for this project, we're going to combine our data wrangling data cleaning skills to finally perform some data analysis using group by operation. So it's a pretty interesting one, the data that we're working with the data we're going to be analyzing is information about Premier League matches, Premier League is a top tier top flight football soccer league in England, potentially in the world, and the data set we have has information about matches from multiple seasons starting 2006 2007 season. And it's a very, very interesting data sets because we have the teams played and the goals is scored and the final results. So anyway, we're gonna start one by one with activities as usual, I want you to just pause if you want to resolve them by yourself, pause, resume, pause, resume, as you see me perform the different activities, and you can borrow some ideas from my solution. So let's get started replace invalid values from the seasons column. And before we can replace invalid values, we have to identify invalid values to do so. We're gonna, we're gonna take a look at the seasons, we're gonna do here value counts. There you go, I see that there are 2007, 2008, 2008, 2009, 2009, 2009, 2009, 2009, 2010, etc. But then we have this question mark here. So this is one of one of the first examples of an invalid value in a column. In this case, it's a string column of here string column, and the invalid value comes because this is a question mark. And of course, it's not within the format of the season. There is no enforcing in these strings, they're just strings, but we can see very clearly that there is a clear pattern, a format that is an integer dash an integer. And of course, question mark doesn't match that. So we have to claim those values and replace them with the value unknown season to do that. What we're going to do is df.lock, pause now if you want, df at season equals question mark, the season value, there we go, all these values, we're going to replace them with unknown season. And now they should be removed. Let's check the activity to see if it works. It works so we can keep moving forward. Identify invalid values in goals scored. This is a pretty interesting one as well. Because it deals with a little bit of data cleaning. But conceptually speaking, the data cleaning process, like the decision tree that I'm running in my mind is different. Because to identify invalid values in goals, I'm going to take a look first, again, at the info method. And you're gonna see that both home goals and away goals are integer numbers. So that is that is correct. Those are supposed to be integers. So how could we have an invalid value here, it'll be different if, for example, we see an object here, because that means that pen has found something that cannot be parsed as an integer. So it turns it into a string, right? But in this case, they're all integers, and that's pretty much the valid. Now what we're gonna find is given that the we have identified the value type, which is an integer and the column itself is an integer. What is the next process? What is the same the next decision we have to make? Well, if the data in if the range of potential values is met, right, so there are a few other projects deal with some other things. But for example, we have rainfall, you know, that's something that can take a given value, we have, for example, power, electric power in a house, you know, the, the, the valid values are going to be between, I don't know, zero and 110 115. But there are valid values for everything. And we have a range now as a second step to define the validity of that column. So to do that, what we're going to do is just use a little bit of plotting to aid in the process of identifying the same value values, we're going to do the F at home, home goals, we're going to plot kind, first, we're going to do a histogram. And we can quickly see that there are some values that are invalid, because they are below zero and in a football match, the minimum value, the minimum goal score is zero. We don't see anything weird on the right hand side, we don't see any stream goal count, if we see something like for example, 20, 2530, that is a little bit more difficult because it's well, it's valid in terms of a range. But it's an outlier. So is there possibly a Premier League football match in which where 20 goals scored? I mean, it's possible, it's highly unlikely. And that's this the third step, and we have to start digging a little bit more understanding the value again, hopefully, or thankfully. In this case, that doesn't happen, I'm going to turn this into a box plots. So we can visualize it, I'm gonna swap swap it to be horizontal, I'm gonna do vert false, I go, and we can see pretty much the same thing. Most of the goals are going to be concentrated here. And we have a few samples below zero that are invalid, I'm going to add away goals now. To visualize away goals, and we can also see like this, the zero line, we can also see that there are a bunch of invalid values right here. Interestingly, the way goals are all shifted to the left, that means that home teams tend to score more goals, which makes sense if you watch football or soccer. So we have pretty much understood the invalid, like the validity of the column, we know it's an integer, we know that it has a range, because we know the sport. And we have identified that there are values that are outside of that range. So now we have to take a look at them. And here, what we're going to do is we're going to do that are less than zero. And we're gonna sum that. And we have 34 home goals that are invalid and 39 away goals are involved. Sometimes people have issues with this sort of thing that I did. If you want to think about in a different way, we can do something like df.lock. At home goals is less than zero. Or df at away goals is less than zero, we can take a look here at the different ones, you know, we can start finding how many we have, or actually, we could say something like it's easier if we do something like shape here, or some here, we have only 34. But as this is this, the result of this expression is a data frame, we can do pretty much everything all together. And now to answer home goals invalid 34 away 39. So that's going to be 34 home goals 39 away goals, and let's check the activity. And it works. Now, moving on with the third activity, it's pretty much the same thing we did before, we're gonna do df.lock df at home goals that are less than zero, we're gonna home goals, we're gonna that thing, we're gonna replace them all with zeros, I gotta do the same thing with away goals. And we're gonna check the activity. And it works, identifying cleaning valued results in the result column. And this is a very interesting one, because we can keep looking at the different data cleaning scenarios or techniques we have to approach. We had one that we had just simple strings, and we had to find a pattern, there is a pattern year dash year, this year is the year before in this year, right? So there is n dash n plus one, a four digit integer, there was a free form pattern here. And there was a pretty clear value was invalid was a question mark, we then jump to a integer, which was the correct value type, or the correct type of value for goals. But then we had a given range that was available was valid. And of course, there were values that were outside that range, we've made it invalid. Finally, we have the one that we're trying to deal with here, which is the result one. And then the result column is interesting, because what we're gonna find value counts. It took a little bit of time. The idea, oh, because I got auto completed. But the idea is that valid counts is, let's say a categorical variable, it's not let's say it's it's a categorical variable, the values that we can take in result are restricted, it's either h for home victory, a for an away victory and d for draw. And that's it. Anything outside of that range is considered invalid. And these are usually the simpler to put in a way values to not to claim but to identify at least, but we can quickly see if a value is outside of this subset of potential possible values. These are the only three possible values. That is different, again, from what we saw before with the seasons, because we have to apply a pattern here, this is just a free form string, there's a pattern underneath. So we can have any value, we can have, you know, 1950 1951, for example, something like that. So that's a little bit more difficult. But again, in this case, it's pretty obvious that everything that has a question mark is an invalid result. Now how can we clean that? Well, we already have the we can already calculate this result, the result column is a calculated column, because you can just get that by computing the result yourself, you know, one one is a draw. If the value is here, and the values here on the same, if there is a number and equal number of goals, that's a draw. If the home goals are greater than the way goals, that's a home victory. And if the way victory is greater than the home goals, that is going to be an away victory. So how can we fix that fairly easy, we can do the F dot lock the F at, for example, home goals, greater than the F at away goals. There you go. This should all be all these results should be we can do assault value counts. All these results should be valid, right? So the digital OB sorry, an H, but we know not only have this question mark, we also have invalid values, we can calculate the results. So this is why this is a pretty interesting activity. So now what we can do is just here to like step over the value, and we know that it's going to be a home victory. And we can do the same thing for pretty much every other possibility, we can say, if away goals is greater than home goals, that's going to be an away win. And finally, if let's put here, if home goals is equals to away goals, that's going to be a draw. That's pretty much the same thing. And now we can count again. And we have the possible values. Let's give it a try. And it works as expected. All right, moving on to the analysis piece is when we start doing some interesting analysis with group by activities and all that. So what's the average number of goals per match? And we don't have that value. Goals per match, we don't have it. But we can easily calculate it on the fly. So home goals, DF at away goals, we have thoughts. And we can sorry, zoom out, we can here have the average very quickly. And activity is asking us calculate the average number and put it with two decimals. That's 2.66. If I'm not mistaken, I'm not saying it 2.66. And activity works. Next one, create a new column total goals. So now it's pretty much asking us to create a new column, DF at total goals, bypassing this thing. And total goals, there we go. And we can check activity real quickly. And it passed as expected. Now, calculate average goals per season, and we have to sort the values by by season. And we need to sort in the goals per season variable. So the way it's gonna work is gonna be in a zoom in. What I do group by group by season, and here, we're going to get total goals, you're going to get the mean. There we go. We need to sort these by the index, that's the season name. So there's going to be sort in Val sort index, which gives us the same result just we were lucky enough. And we can store that in the goals per season variable. Let's check activity and see if it passes. There you go, it passed, we are good to keep making progress. What's the biggest goal difference in a match? And this is an interesting one as well. Why we have two potential results when home one or when away one, right, we have our home victory and away victory, and we have the difference of those two. So we have something like the F at home goals, let's say minus the F at away goals, and we can do here something like max, right? We can do the same thing by doing away goals and home goals. And then we have the maximum. So this is a maximum difference in the biggest goal difference for a home win. And this is the biggest goal difference for an away win, we could have used the same value here to do something like max and min. And that's going to be minus six, right? Because in this case, it was an away win, the way side one, and that gives us a negative result. But basically, if I don't care about the symbol, because basically what I care is a magnitude of the value eight or six, that's a magnitude of the value, then what I could have done is just do it's something like do absolute. There we go. And now we can get the max value. Or to bring away, we can just do a sword values, ascending falls head, we can get the results here. So anyway, we arrived to the same results using different techniques. Basically, the most interesting piece for me, at least here is for you understanding the absolute method, because again, it's six or minus six, in this case, it's the same for us, because we care about the magnitude, the goal difference, we don't care if it was the way team one, or the home team one, right? So the the absolute method will give us that magnitude. Let's try with eight anyways. It was pretty obvious that it was a home win. If you know a little bit about football, soccer, those, those a case, game over. What's the team with most away wins. And this is a very interesting one as well. Why we could start doing something like df dot lock df at results is away, right, we have all the way team way victories. Now we can group group by the way team, because basically what it's asking us what's a team with the most away is when so this is in our way when who was a team is basically away team. And now we can just add results dot size, and we can get we can sort values by by nothing because this series ascending false. And let's do just head and we have that the team with the most away wins is Chelsea. Let's give it a try first. And then I'm going to show you something different, a different way to calculate this thing. There you go work. There's a different way. And it's actually the solution proposed here. So I'm just gonna go ahead and reveal it. And we can I can I can just copy and paste these things. We don't waste so much time. And I can explain it to you right here. Basically, what is happening here is we can group by the way team. And when we do this grouping, we're gonna have the head, we're gonna have all the results for an away team. So for Liverpool, we're gonna have these results, and pretty much any other result from Liverpool. And that's gonna include draws, home victories, and away victories. So basically, what we want to do is we want to group we're gonna get this group that is let me get here lock. If at away team is Liverpool, as they say, I go. So when we group by away team, we can we create, we create like we have so we have this full data frame that has for example, Liverpool here and Liverpool here and Liverpool here and Liverpool here and has a let's do how can I there has a Manchester City. Oh, I got green. Why? It's Manchester City, Manchester City, Manchester City, Manchester City. And we can get another football team from the Premier League can get here walls walls is another team walls walls walls walls. What the group by operation is gonna do is gonna identify we're gonna use black, it's gonna identify the values are the same. It's gonna prepackage them right in this. I don't know if you've heard about it, but the term is splits, apply, combine is gonna create tiny groups with the different samples that meet their grouping criteria. So it's gonna put all the Liverpool, all the Liverpool matches here is gonna put all the Manchester City matches here is gonna put all the walls matches here. It's going to give you the option now to work with this thing. So now we can do for example, now get me the away goals and do a sum or give me the results and give me the signs, we can do any operation with this groups, and then it will assemble them again in a table and it's gonna say, Liverpool, whatever Manchester City, whatever walls, whatever Manchester United, whatever, get us gonna combine all these more pieces that it has created. But before we do the operation, we have the chance to run a custom function and that's the apply one right here. So we can run we can pass this custom function that is going to be run in each individual group. And what the custom function does is it gets all the rows. So here is going to get a draw a an away win, a home win, an away win, etc. And we're going to perform the the operation that we want, which is counting the ones that are an away win. So we're going to get only the ones that aren't away when we're going to do a sum and you are going to return that value. So here's going to be two. And here's going to be for example, just one. And here's going to be is not zero, we're gonna get just one. Finally, painless by itself just takes care of saying Liverpool to Manchester City, one and wolves. That's too many wolves zero, right, the combination at the end. And that's the way that function works. Let's clear this whole thing and run it. And we get pretty much the same results as before is Chelsea Manchester United Arsenal, etc. So the way to think about it is you can fill your data before starting, or you can perform like these operations after you have grouped the data with the individual groups that you had before. Okay, that's it. Let's keep moving forward because we have just three more activities to go. What's the team with the most goals scored at home? So this is a pretty interesting one as well. Let me get ahead. You take the time now to pause the activity and solve it by yourself. I selected Chelsea, I don't know why it's a mistake. But again, take the chance to pause and do it by yourself. Basically, it's asking what's the team with the most goals scored at home. So it's basically the home team. We have to group by home team, home team. We have to count this home goals. So here just count home goals, I will need now to get the sum. We're gonna get the whole thing. But of course, we need to sort values by ascending falls. And we're gonna get head. There we go. And we have that Manchester City seems to be the goal with most day the team sorry, with the most home goals scored. So let's try that out. And that is correct. A pretty, a pretty dangerous team in attack. Manchester City for sure, especially now this outdated, especially now. Okay. Second to last, what's the team that received the last the least amount of goals while playing at home? And this is also a very interesting one. Because if I was just naive, I was gonna do something like DF, the team group by home team, what's the team that received the least amount of goals and away goals. So it's team playing at home away goals, but some sort values head. There you go. So charting athletic have only received 20 goals, right? But the reality is that they only play df dot lock, df or now let's do df at home team equals charting. There we go that some they only play 19 games at home. So it's interesting because what these activities asking us is for the ratio of goals received over home games. And I think if you have not passed, I think it's a very, very good moment to pass and try to resolve it by yourself. But anyway, let's just do it. We're gonna start first, we're gonna keep the same group by operation. But we'll try to get two things, we will try to get the total goals they received under the total matches they played. So we're gonna get we're gonna get away goals. And we're also gonna get first, we're gonna get just here home team, I got two different operations. For that, we're going to use the aggregation method and we're going to use a dictionary, we're gonna say for home team, perform the size operation for away goals, perform the sum operation. Let's take a quick look at what we have. There we go. We're gonna sort values by by by home team. It's actually ascending false. By home team, I did something wrong here. My bad. Sort values by home team. Why is Oh, because it's both. So let's replace replace, rename, rename columns, we're gonna call home team, total matches to total games, I'm gonna put away goals equals goals received. There you go. Let's break this in a couple of lines really quickly and ascending faults. There you go. Total games. No, because this is total games now. Games. There you go. So Liverpool, Tottenham. So these are the usual suspects, right? The teams that have always played at the top of the Premier League, we can do now is sort by two things we can sort by goal by sorry, by total games, and we can also sort by goals received. But we want to flip the sorting here, we want to get all the players that all sorry, all the teams that played a ton of games, so ascending false, but then we want to get the ones that received few goals. So we're going to be ascending true. And now we're going to get the Manchester United. Manchester United is a team that played the most games with the least goal received. We can, let's say we can rename this thing, we're going to do something like, we're going to call it games, per team games, home games, games per team, and call that head, we could do something like ratio, what's going to be it's going to be goals received per match. And we're going to call this thing we're going to be the result of home game games per team at goal received divided by home games per team at total games. And let me break this into multiple lines. So it's a little bit more readable. And now we're going to do pretty much the same thing as before. But we're going to do is value sort values by goals received per match head, I get the same result, just Manchester United with the best ratio in terms of goals received per match. It's actually tried out because I haven't tried out. Let's see if it's the team we're waiting for. That is correct. Basically, what I'm saying here is that we tried we first got we have for the same grouping operation, we computed two different things total games, sorry, total goals received. That was a way goals because it was the whole team. But we also computed how many games they played. So same group by operation, if you allow me a very quick draw, we have like the big chunk of teams. And then we did like this separation of small small groups for home teams. But then we computed two different things for home team, this gave us the size for the way goals is gave us the sum, right, some sum all the way goals. And we did these two operations in all the individual chunks to compute the final results. We quickly identify Manchester United as the one meeting the requirement of activity. But we went one step further to actually compute these goals received per match value. Why is that because when I make this thing analytical and potentially automated, we need it we need an analytical way. What else there is I think we can take a look I think there is a pretty interesting solution here. No, I thought we had a solution that had everything in a one liner with a apply operation, it is potentially doable if you want to give it a try. But I think readability matters in this case. And finally, what's the team with most goals score playing as a visitor away from home. And this is going to be pretty much the same thing as before. Goodbye away team. And we're gonna get the team playing away from home score more goals are gonna get goals go forgot already the name of the columns. It's away goals away goals away goals, sorry. And what we're gonna do is some get a preview. And now we're gonna do a sort values ascending folds. And it seems to be Arsenal. Let's give it a try. And it passed. And we completed our projects. So this was pretty much the the activity we started with a little bit of data cleaning data wrangling pretty interesting. Because again, we had a good variety of different issues with our data, we had the issue of the invalid seasons, which is a string invalid pattern, then we moved to the invalid count of goals, which is an integer out of range, then we cleaned the categorical result variable was just potentially just a result categorical. So the results should be in these in the subsets. We had a way to clean that but just calculating something else in data frame, those good thing. And then we started with a bunch of group by activities all very interesting. So take a look at yourself if you want to replay it. In other case, we have a bunch of other projects dealing with data wrangling group by and all these operations. This is a very interesting project that combines some data wrangling skills, which involves merging data frames, creating new columns with some data cleaning skills that it's gonna require you to take a look at the data and do some cleaning to make those merge work and finally, do some analysis, some question answering with group by operations with transform operations. So it's a very, very interesting project for sure. As usual, I encourage you to pause the video first, try to resolve it by yourself. And then of course, take a look at the solutions, because maybe I'm choosing a different way of what you're doing. And you can see something different. So with all that said, let's just get started. I have the first few rows executed already, as you can see. And let's take a look at the data that we're going to be working with. In this case, we're going to be working with NBA basketball information. And basically, we have the 2017 season stats that's going to be in this data frame right here, which involves the usual, you know, just information for each player as an aggregate of the whole season. And then we have a player's DF data frame that involve the includes the personal information of the players of their name, their weight, birthday, college, etc. The important piece here is that the players data frame contains information about all the players in the registered history of the NBA. So that's from like 1950s. The information in the stats in the seasons only from 2017. So we're going to start by doing a merge. But of course, going to be an interesting one, because we have to merge in a specific way. The first activity and here again, encourage you to just pause involves merging data from the 2017 season with the players, but performing a left showing. So basically, we want to include here in our season's DF data frame, we won't include the personal information from the player, we want to bring the data to our existing data frame from this particular season. Okay, so as you might remember, there are different types of merge to do, we can do an inner, we can do a left or right outer, we can do a full outer, in this case, by performing a left outer join, we're going to be able to recontain to, like to keep all the information from these data frame 2017, and bring only the matches from the players DF data frame. So let's go ahead and do that. Again, this is your your opportunity to pause the video. To do a merge, what we're going to do is we're going to start with season 2017 DF, and we're going to merge merge with merge players, the F. We have to first, I'm gonna players DF, we're going to do a how left, because that's basically what it's asked. And again, the reasoning is that we want to keep all the information on the same thing from 2017 DF. And we're going to pass the columns to merge. So on the left side, we're going to pass the player. And on the right on, we're going to pass the name, it's the same information, but it's two different columns. If you remember, we have the on on based attributes to refer to individual columns. And we also have the indices, if potentially you're merging per index, in this case, we're just using the regular columns. And we're going to assign all this to data frame. And let's check the activity to see if it passes. We can take a look at the DF. This is going to be a long data frame, I can actually show you something like df dot head, and we can do just the first row under transpose. So we can see all the information right here. So we had me potentially do a little bit of drawing. If I'm allowed here, we're going to select red. Basically, what we did is we didn't select right, there we go. So basically, we have all the seasons that we're going to make like a long data frame, and I use blue to the players merge right there, the resulting the resulting merge to we can use another color, let's say here, purple, the resulting merge is gonna basically, as it's a left outer join, and this is left, we're doing is we're bringing this data frame to the side here and basically make it match with whatever information we have in between. So if in the season 17, there were, you know, two players were repeated, we could have potentially duplicated that data. And that's perfectly fine. Because if we have the same player here two times, because of any reason, because a player played in two different seasons, two different teams in the same season, we actually want to duplicate the right hand information in both of them. Because again, what we're doing is we're complaining with the personal information of the player. So said that let's take a look at the whole data that were where we have it's it's a large data frame. But what's gonna give us the key if the merge was successful? The important piece is the name because sometimes you're gonna find that some columns are gonna be nullable, like for example, college is nullable. But this doesn't indicate that the merge wasn't successful. This indicates that just college was no, but the name column is not nullable. So that means that if we find a, a value, a row in the df variable, that is null in name, that means the merge was not successful. So are there any misses in that data frame? And this is the second activity, we can quickly take a look at that by doing df at name dot is an A. And here we can do for example, any, any, this is true, we can actually answer that real quick. And the next one is going to ask us how many will we can do a sum right there, we can find there are four missing, four missing players, basically, we couldn't match. So there are four people and I think now it's gonna be a little bit more clear once we visualize and therefore, four rows in the seasons data frame that we couldn't match for a valid player, which is strange, right? Because we know that these players df data frame contains information for all the players in the NBA. So how is it possible we couldn't match four players, let's actually take a look at those people. And that's the fourth activity that we have right here that is extracting names. Before we do so, I just want to show you who are these missing players. So let's take a look at the values in df that are that have a non or a null value in name. And those are these four players right here, we're gonna take a look in a second at these players and see why we had that miss much. So this activity is asking us to basically find those four people. It's asking us to put them in a list. So what I'm going to do is just going to do that. And I'm going to plus here the player, and we're going to do values. And finally, I'm going to make here a list. Now player misses contains the list of players are missing, we're going to check the activity. And it passed correctly. So this is a very interesting activity. And this is the one that I think is the most representative of the data cleaning process in general for any data scientist data data analyst data engineer, which is actually fixing these missing issues, right. And there are sometimes different strategies, sometimes you can just delete the missing rows. Because I mean, if you have millions of rows, and there are a few that are null, you can just delete them. Sometimes you can replace it with something else. Sometimes you have to actually fix it, you have to realize like what happened, and just fix it. And the way we're gonna do is just doing some detective work. And this is just perfectly normal in the whole data engineering, data science, data analysis process, just taking a look at your data, and trying to understand with domain knowledge, and this is a very important point with the main knowledge, trying to understand what went wrong. And domain knowledge is very important, because I can, I can, and we will actually perform this, this cleaning, we're gonna fix these values. But we need to understand a little bit of what happened with the basketball season. The same thing happens, for example, and I think I have already put this example in another project, if I'm working, for example, in a biology lab or in a physics lab, in a nuclear physics lab, as a data scientist, there are going to be a ton of data cleaning tasks that I will just not be able to perform not because I don't know the techniques, because I'm very good with pandas, but because I need to understand the domain that I'm working on in order to be able to perform some operations and there's understand my data, right. So for example, if you have no idea about basketball, and you fee and you see a value for points that is 100,000, maybe that value is good for you. But of course, it's pretty hard to make a hundred thousand points in a season in the NBA. So by knowing the knowledge, you have you have just immediately placed the value in a in a given range that of possibilities. So your data cleaning process, it's a lot more, it's, it's easier, and it's more precise, because you knew the knowledge. So the same thing is going to happen here. And actually, let's take a look at the activity. And it tells us that with confidence, and I already know what activity says, with confidence, it says that the players are actually in players DF. So there is something wrong with players DF the the right hand data frame. So we're gonna players DF dot head. So there are some, there is something wrong with the names here that somehow didn't match what we had in player in the stats of the 2017 season. And the objective is to modify players DF so they match whatever we had in the in the 1720 2017 season. That's that's correct. So let's first identify where these players are in the 2017 season are not in the players DF. So basically, we can make a quick drawing here and potentially. So we have in 2017, we have this this guy, look, and bar, which is not potentially in the players DF. So in the players DF, the activity tells us for sure that it's there, we have to trust that for now. But these players like what name does he have? Basically, we have to find these much. To do so one thing that I can do is just and this is, again, a very detective work. There are multiple ways to basically find this, but I could do something like, let's keep it simple. Maybe this player has a middle name. And that is part of the whole players DF. But in the season, that player wasn't listed with that name. So we can just go ahead and do something like players DF dot lock. And here, I can do players player DF at name, dots contains the same last name. So is there any player in players DF that contains this last name, and we can find the given the given player, what happened in this case, and we can just take a look at what happened is that this player either added a new surname, or always had the same surname, but in the 2017 season, he was not listed with this surname. So we have seen we are seeing like the first incompatibility, let's say, so we can say that in players, DF, and we're gonna do season 2017. These player should be named should be named, we're gonna do that should be named the name in 2017. So we want that that player to be that name. And we can keep working, you know, with this detective work. Maybe you know what happened, for example, with maybe you don't know this, but Muhammad Ali was born Cassius Clay, and then he he switched religion, or I don't know how to say this, but basically, he converted to Islam, and he changed his name to Muhammad Ali. So the same thing can happen to anybody, we can change our names. And that's what happened here as well. So this person potentially goes by a surname. So for example, Sheldon Maclellan, which seems to be the original name, I can just go to Wikipedia, born Sheldon Maclellan, here, here he is, right? So the guy either changed his name or just made it shorter, we don't know, to be honest, we don't care, we have just found another player with a different name. So we're gonna do that. So we know that in in the let's actually do Mac here. No, Sheldon Mac. Where is Sheldon Mac, right here. So in players.df, we have Sheldon Mac, and we want that player in the 2017 season, it's like we're rewinding, say what was the name, right in 2017. And that's the name. What who else James Michael, let's take a look at Michael here, James Michael, and see if there is any changes. There is a new last name. So again, in players.df, this person is James Michael Michael do and we want this person to be renamed where it's like, where we won in the names to 2017. And in at the end, we have meta world. And here, I'm going to do another trick, take a look at this, we're gonna first lower the names, and then I'm gonna do meta meta with a double t. There we go. And seems like meta world was the name in 2017. But now it's something different. So we're gonna rewind again, if you allow me the expression to the old name in 2017. We can store these names mapping. There we go. So we can keep the names as a reminder. And of course, I'm missing the comments as usual. As we will do. So how are we gonna do this thing? Well, the way to do it is just we're gonna find a player in this case, we're gonna find a player with the with the same name. So in this case, let's do it with the I don't know, first one, first name. So this is the player in players.df with the full name we wanna rewind a player back to the name in 2017. And we want to take just name, right? name. And we want to assign that to the name in 2017. So we can change that. There you go. If we do the merge, let's go ahead really quickly, I'm gonna zoom out. Let's do the merge once again, really quickly. And let's find the players don't match. Which, as you can see, it's just three, because we fixed the first one. So this works. What we can do now is we have four, but we can just automate, quote unquote, this process. Let me reread players because I have modified players, let's do the merge again, let's go back to the state where we have these four players missing, including look, look, and bar. And now we're going to do the same process, but we're going to automate it's just we're gonna use for loop so we don't have to type it one, one by one. And the way to do that is going to do for all name or it's actually like new name, let's say, and name 2017. So it's the new name and the name they had in 2017. In names mapping dot items, we're gonna find the person we're gonna find the person with a new name. And we're going to replace it with the name. 2017 body. And of course, I'm missing now a column. And there you go. We can now try we can check the activity to see if we did everything correctly. It seems like it worked. We can now perform the same merge again and see how it goes to let's take a look. So copy these cells, same merge. And we we know what the activity is going to check, but we can borrow this cell and try it again and see if we had any misses, no misses. So this will potentially there you go. It worked as expected. Okay, remove unnecessary columns. It's just, again, data wrangling, very boring, just, you know, we have to get rid of a few columns, as I show you the first row transpose, there are a ton of columns we're not going to be using. So what we can do is just drop these columns, how to drop the f dot drop is the method columns, columns to drop in place equals true, because we want a modified data frame. Now we're going to check the activity. And it works. Rename teams to their four names. This is a very interesting one, because it has a very simple solution. But it's usually not expected, I see some, some very crazy solutions for this sort of activities that involve our renaming. Basically, we have the team of the player is a three letter acronym, we want to resort back to the full name of the team. To do so, what we can do is there are replays methods, we can do df at team dot replays, and we can use this mapping that was given for us team mapping. There, and you can I will just execute this thing. And it didn't work, because it's actually team that I go TM. So we're replacing one by one these acronyms, whenever you have OK, see, it's gonna find here, OK, see, it's gonna replace with Oklahoma City Thunder. We're gonna find, for example, d a l, it's gonna look for Dallas. Where is it? I don't know where it is. Basically, where is that last second one? So basically, it's gonna it's just a mapping and it replaces one by one. And we can store this result, right? In the column team, that's gonna be it. So we're gonna do DM, df at team, which is the full name, it's gonna be this column with a replace mapping performed. And now we can do df. Let's do df at player. TM on team, let's get the first 10 rows. And we're gonna find 10 rows, and I find a few things. So I n d indiana pacer's as a as San Antonio's purse, am I in Minnesota timber wall, so this looks okay, and see what happens here with this tot in a second. But for now, this is good. And let's check the activity to see if we did the correct thing. And there you go. It's working. Convert birthday to a daytime object. So this is another interesting one. The birthday of the player is a string, we're gonna convert that to a date to a time some or daytime. It's very simple in pandas, but we're gonna do is PD to daytime, and we're gonna pass directly the birth date column. That's but of course, this is an immutable operation that's just returning a new series. What we want to do is just step over whatever value we had before. And of course, this is a change we cannot undo so easily. So in this case, we are safe to just like step over the previous values. If not, you can just create a new column, you know, in this case, we're safe, we can just convert the column back to daytime, we can check the activity. And that worked as well. So going back again to the previous values that I show you we had let me borrow this thing here. Delete all players from the tot team, the tot is for the players. And that's actually, let's do something. df at at. Let's look for duplicate players. df dot lock df at player dot duplicated. Yes, and we're gonna find we're gonna do play duplicated. I want to now hear only these columns keep false sort values is a long operation. But don't worry, it's pretty simple. Sort of values by player Ted, Ted, there you go. Long operation. Some players in a given season can switch teams, right? Just right in between there is a there's a transfer window and they can switch teams. So you have the first the same players sorry, like this guy right here that play for both Cleveland and Dallas, right in the same season, he was tougher in the middle of the season. So these data that we have and we have to understand again, the data set we have includes kind of an aggregation column, which is the tot column for totals. So totals of this person within the whole season is so many field goals, so many three pointers, so many free throws, etc. Just a total an aggregation of the season for this particular player, we need to get rid of that. So I'm gonna do that. And oh, actually, we can use the same. No, we can't use same condition, how we're gonna get rid of these of these values. First, let's keep a copy. Because, I mean, we're gonna be we're gonna be breaking things. So we need to keep a copy, we're gonna do a df copy df copy, this is a very common technique when you're modifying something. As usual, I encourage you not to modify things. But if you are certain you want to get rid of something, you can while you're doing the process of understanding and analysis and all that, you can create a quick copy of your data if it's not large enough to fill the whole memory, of course, because because you're literally copying value. So if your data frame takes one gigabyte of memory, you once you duplicate it, now you have two gigabytes of memory allocated. So as long as that's possible, you can keep a copy. And I usually do like copy 12345. And that's pretty much the data running process, until you figure things out and you're sure, then you can remove all the copies and of course automate the work. But for now, we keep a copy. If anything goes wrong, let me comment this thing out. So we don't accidentally execute it, we have a copy in df underscore copy. Go back to the activity, we have to get rid of the players that have a tot team. So df dot lock, df at team is equals tot all these players, we have to get rid of them. How can we get rid of them? There are multiple ways the easiest is probably also we drop maybe not the easiest, because the easiest could be just do something like df equals df equals all the players that don't have a tot. And actually, let's try it out. And this should work for both. Except I'm doing something wrong. It worked. But let's bring back the old data frame, df copy, copy, there we go. So we have it as it was before, I'm going to delete the cell so I don't accidentally execute it. And there is another way which is basically we can find all these all these rows, right? So all the rows we want to get rid of, and we can get the indices. So the index of the rows when I get rid of. And finally, we can use the drop method. So we're going to get rid of these values, and we can do in place true. And let's see if that works as well. Well, I have to reset all the activities. But trust me, that's also gonna work. Alright, like we have pretty much the whole process done, we merged the data frame, we found that there were a few mismatches, right, where the names of the players changed. So we rolled in but the south and 17 doing some investigative investigation work, some detective work, we brought back to 2017 merge data, get rid of the columns, created rateable names, do a little bit more cleaning, removing TUT and all that. And now we're going to actually start with the analysis. So this is going to be very interesting. So what's the team with the most players in the league? So we have, again, so we have information about all these players, they work at all these players in the 2017 season. And of course, we know that some players switch teams, right? So how many? What team had the most players within a season, either if they went back and forth, they changed a lot, want to answer how many, which was the team with the most players in a given season, that's pretty simple, because we can just do a value counts, counts, and that's to head. And it's gonna tell us that the New Orleans Pelicans was the team with the most players to 27 in general register players. And that's correct. So what's the team with the lowest field goals? Let's take a look at the data again. So you understand what we're trying to do here for each player for each row. So each row is a player in the 2017 season, we have merged that with our personal information, which we donate in this activity. But basically, we have field goals that the players court in that given season. So we want to aggregate all the field goals of a given team of all their players and find the one with the lowest value for that. To do so, we're gonna, we're gonna group by team, team, and we're gonna find the field goal column, we're gonna sum that column. Of course, this is a mess. So basically, we're gonna do sorted values by sort of values by nothing, because just a series, and it's gonna be ascending. We can actually do a head method here. I want to dots so it doesn't switch. And we find that Dallas Mavericks was the team that scored the least to put in a way field goals in 2017 season, by just computing the aggregation of all its players stats. Let's answer it. And that worked as correct. What is the team with the best field goal percentage? And this is a very interesting one, because field goal percentage is defined as field goals, divided by field goals attempt, that is data we actually have, let me show you again, values. So we have, we have field goals, 134, over 341 attempts, so it's a little bit less than 50%. Right? You tried 341 times to score, and you only made 134. So you made, you made 134 divided by 341, your accuracy to put it away was 0.39 or 39%. So that's the column that we want to compute, that's the value we want, we want to compute. But we want to compute this by team. So we have to aggregate all the field goals and all the field goal attempts from a team, and then find that given percentage. So we can start by doing pretty much the same thing we did before, it's going to be df bro, group by team. And here, we're going to compute two values, we're going to compute field goals and field goal attempts, and we're going to do some of these two. And this method doesn't exist because I miss type, miss type that group by there we go, I find this thing so we can, we can store this, this data frame in a partial variable saying field goals per team. And field goal, field goal per team, we're gonna pre visualize it. And now we can compute a new column, which is going to be it's going to be field goal percentage, which is equals to field goal per team of fg divided fga. And now what we can do is sort values by field goal percentage header, and we're going to find that with the best, so we need to do ascending ascending falls. So the best field goal percentage is Golden State Warriors. Let's try the activity first. Let's find 2017 season, NBA. Who was the champion? playoffs, Western champions, finals, champions, Golden State Warriors. So it's not it's it's not a triviality that the player with the team with the best with the best percentage won the championship. Anyway, let's take a look at the one with the worst or you know, the lowest, which it's not us, but just for the sake of of curiosity, we can find the Memphis Grizzlies, the Las Mavericks, those were the teams with the least accuracy to put it in a way. Okay, so what's the difference between the best and worst three p shooters by position? It is known that shooting guards are the best three p throwers three pointer throwers by efficacy or accuracy. The question is, what's the difference in accuracy with the worst three p throwers always considering by position, to note use the position from POS column. So we're gonna group by position now. And we're going to compute both three pointers and three pointers attempts. I'm gonna sum all this. And this is pretty much the same thing we did with the previous one. But in this case, we're doing it by position. Let's store that in a variable. So we're gonna say position three pointer accuracy, we're gonna store that and now we can create the new column we know it's three pointer percentage, let's call it is going to be three pointer divided by three p a, the attempts. And now we can do position three p accuracy. Now let's sort values by three p percentage ascending folds, the best position shooting guards as expected. Again, this is just domain knowledge in basketball shooting guards are usually the ones shooting better. I could have with a better accuracy. But it's telling dots that we actually need to find is what's the difference in accuracy between shooting guards and the worst position the worst position the data frame, which we found was the PF power forward position, right? So what's the difference in accuracy? Well, we could do something like this minus dots, it's gonna give us the percentage, or we can do something like dots, three p percentage dot max, minus the one min. And now we have pretty much same value without hard coding. The difference is 0.024. So in percentage terms, that's gonna be 2.4%. That's basically this value right here. Tiny difference, right? Like from the first best position to the worst position. And it's not a huge difference. But of course, with so many attempts, it actually adds up potentially in the league. Alright, so this is a very interesting activity, I seriously encourage you to just pause here and try to resolve number 15 by yourself is a very interesting one, it might involve a ton of googling from you, because I mean, it's not trivial. So just, you know, this is your use your chance to go ahead and pause the activity, try to solve it, pause the video, sorry, just solve it by yourself. I give you one second, now we're gonna resume, and I'm gonna solve it live right here. So you can see how I approached the problem. Alright, let's take a stab at it. The activity is asking us to find the best score per team, right? So we want to basically generate this table that it's the name of the player, the team, position on the total points, right? So or not best course in terms of, you know, just amount of score points in their team. So basically, we're gonna do something like, for example, df dot log, df dot team is Oklahoma. And we're gonna get, we're gonna get points dot max. So basically, the max score points in Oklahoma is this amount. And now we're gonna find where I find the player by saying, these on the PTS equals that value, a number. And we found that that was Russell Westbrook. We can pretty much put everything in just a one liner. So we're gonna do I'm gonna copy this thing here. And I first find the max points in team, max point in teams. I'm gonna team is gonna be df at TM dots. I don't remember the Cleveland, which one? No, we were gonna do Boston, how is Boston's probably BOS. BOS is it's just try it out. Boston BOS. There we go. So as our BOS, Boston, and we're gonna do here team, team, I find that the team is team. And this is max points score per team. And say is the best score of Boston. So I want you to understand the problem. This is not the solution we're gonna use. Because I mean, involves a ton of manual work per team, we could potentially just iterate, you know, get all the teams we can actually show that to use, we can do something like for team in here, dots, unique, unique. We're gonna do that. So we're gonna replace here team, we're gonna place here team, print, we're gonna get here two things, we're gonna get player, I'm gonna get team, team, there you go. You get the values, I'm gonna get player, team, sorry, for my variable names, I'm going to do print. I actually can get points as well. PTS. PTS, my variable name is our terrible. We're gonna use it when I say P for team with PTS. I did something wrong here, player, team, df.log. Not enough values to unpack, we're gonna do print this thing break potentially. What if I do this? Okay, I'm doing something definitely wrong here, which I can understand. But basically, what I want to say is that we can potentially do this manually, right? So let's let's just print this thing. Let's print this thing. So Russell Westbrook for Oklahoma with this amount of points, Harrison Burns for Dallas Mavericks with this amount of points. Let's see Houston. Again, it's like we can solve it manually. So Denver Nuggets, Denver Nuggets, Nicola Jokic, Nicola Jokic for Denver. We could potentially solve it manually. But the solution that I want to show you, and it took me some time, solution I want to show you involves a group by operation to put in a way. I'm sorry, because I'm drawing you lately. I'm going to show you a drawing, signifying or trying to explain the solution that I'm going to use. So basically, we have a our full data frame. So we're gonna use black, say data frame. There you go. There is actually a way that I can make this thing a whiteboard. There you go. So we're gonna do this is our data frame. And our data frame contains information about different players. And let's say that we have red for Oklahoma. We have or no, let's actually do red for Chicago. Of course, we're gonna use we're gonna use blue for Oklahoma. And we're gonna use green, green for Boston. I need a darker green. Green for Boston. And the way it's gonna work is we're gonna have these players are just intertwined in the whole data frame, right? So there you go. And let's say that right here, this is the best shooter or the best shooter in the given team. We're gonna have the same thing for for Chicago, we're gonna have this player and that player and that player and that player and there's going to be the best scoring that in in the in the team. And then we're gonna have these player from Oklahoma, this player, this player, and that player from Oklahoma, there's gonna be the best and let's actually put it with that player here, that's going to be the best from each team. So that's basically what we did in the previous solution we filter we said the previous solution the manual one we've created an intermediate data frame that only contain players from a given team Oklahoma in this case, we then found the best core was this value or actually this value right here. And we made a second filtering in which we said find the player in that team with that amount of points. So now we have a single player we want to this is of course a different solution what we want to do now or let's separate it. What I do now is using a group by sort of way operation, we want the we want pandas to focus on different groups of players. So for example, all the blue players, all the green players, all the red players, and perform an operation in a group by group by fashion, but we want it to the result expanding back to the whole group. So we want it to just do that's pretty similar, we're gonna focus on the blue players from Oklahoma, we're gonna do is create the group by operation there, the split piece, find something in this case, we're gonna find the maximum in points. So it's gonna be for example, these maximum points. But instead of what we did before, which is all manual, we wanted to go back and replicate these value, we're gonna call these value V, this value V, very important, we wanted to replicate it back to all the members of the group. So we want these value to be applied here, and here, etc. So we want to we want it to be is gonna be V, V, V, and V. And the same thing for any other group that we have. So for Chicago, now, we're going to do another data frame, splock, and do another data frame. And we're going to do player, player, player, player, that's all going to be done automatically by pandas, find the given value we want, in this case is V again, and just replicate that value V, V, V, V here. And of course, the value when I find is a maximum of PTS. So the column PTS when I compute the maximum value, and then each player is going to have their score. So I don't know, 900 points, or 300 points, whatever, and next to it, there's going to be a new value, there's going to be the top of that group, given the group by operation, here's gonna say, I don't know, a ton of points, 1500, right? So we want each player is going to be 300. And next to it is going to be 1500. And there's going to be 30 point player didn't even play, and there's going to be 1500, right? And of course, this one is going to be 1500 in the in the points and 1500 in the values. Then what we're going to do is we're going to just find the top player or the player with the top points per 13. Alright, let's get rid of that. And finally, resolve the task, which is relatively simple. I mean, there was a huge explanation, but trust me, it's simple, the way we're going to do it is we're going to group by team. If I show you the drawing again, it's like we want to group by team, we want to perform this operation here and this operation here for each team. We're going to group by team, but we're gonna get a PTS, the column want to compute, but here we're gonna apply the operation transform and pass the value max. And that's going to give us a value. And that's basically what I told you before for each team for each for each individual. So it found it found it first broke the players into different groups based on their team. It's found the value we were looking for PTS max, it's assigned value to each one, it found the value, put the value next to each one of this and reassemble the same data frame. And now we're going to have this value once again. So let's hide that and we're going to do df at best score per team is going to be that thing, we're gonna assign it, we can call it this is our v value, of course, I want to use a better variable name. And now we're gonna look at let's say df at team is Oklahoma City Thunder, Oklahoma City and we're gonna find only player and we're gonna find team if you want to repeat it on PTS values by PTS ascending fulls first rows. And here you have that for Oh, of course, I'm forgetting the most important column we computed there. But I find that all the players in Oklahoma City are now have now assigned these new column, which is best score per team, which is equals to the maximum value of PTS from their group. So this is a very, very interesting functionality from pandas transform method. Basically, it's like group a group by operation, so all your data group by operation splits, compute, apply, and then reassemble back the whole thing, don't end up with different groups. So before all the operations was like, okay, see Chicago, Boston, and I've resolved in this case goes back again to the previous shape of the data. So we can basically create these transform data frames to compare it an individual with third group. So this is what we have, we can compare that with Boston now. For example, we can do the same thing for Chicago, I'm just I'm showing you some Boston, the best score was almost 2200 in Chicago, it was 1800. And you can basically see that each player has a PTS value, and then, of course, compared to the best core in the team. Finally, how what we need to do, and I'm going back again to what the activity is asking, is finding the best score per team. So that's basically finding all the players which have a PTS equals to best core per team. So it's gonna be the F dot block, the F at PTS is equals to the F at best score per team, we need to get the values player, team, position, and PTS, of course, let's do that real quick. It's not the same thing, it doesn't look like the same thing. Because if we look at the description, it says that it should be sorted by PTS in descending mode. So now we're going to do sort of values. Let's break this thing before it starts showing that annoying scroll. Sort values by PTS ascending, false. First few rows, Russell Westbrook, James Harden, so we have the same results, it seems we finally need to store this thing in a variable. Let's print the whole thing. And you can see that now the table looks pretty much the same way as before, even for newer looks pretty much the same way. Let's check the activity just as a note on the side of this point, because we did a pretty long developments and that have worked as expected. Okay, we have the final activity cleaning the data frame the notebook a little bit. Which team has the youngest squad by average player age? And this is a very interesting activity, because there are two forms of solving it. The first one is going to involve let me show you first how we we actually computed birth date as a daytime before. So this is from the player's DF, we turn into a birthday. This is very easily solved by doing I'm just giving you the answer straight away. DF dots group by group by birth team team at birthday dot mean. And of course, I made another typo here. Group by and let's just sort of values. There you go. And the interesting piece here, I don't know if you're noticing this right away is that this operation that seems like a very mathematical operation of computing the average or the mean was actually applied to a daytime object, right? So that is that is pretty interesting, because day times are not numbers per se, they're just points in time. But basically, what pandas is doing is finding that birthday kind of in a timeline, I'm finally computing the the mean, which is the youngest team. Well, it's based, it's basically the team that has the greatest, to put it away, the the the latest birth date, right. So 1922, 0314 is the latest date birth date mean birth date, right? So if if we do something like, like, let's let's put this in in timestamp. Now, let's put this in in days, right? So basically, or actually, let's put this in years days, there you go. So the average age, and of course, this is 2017, we can actually do that, we can say this to 2018. Oh, one, oh, one is the same thing. This doesn't change anything timestamp there. So the the the team Portland for the travel trailblazers Portland has the youngest squad by average, right? Because basically the average age was 25 years, not days because we're divided by 365. So 25 years was the average age of the Portland trail blazers and two concepts here I want to separate in the first one. Sorry, the second one that I mentioned was that we are here finding the youngest team that's basically the greatest date, right? If I was born in I was born in a 1987. My brother was born in 1992. He has a newer birth date, right? Most more recent, but that makes him younger, right? I have a an older date, date, a lower value, right? 1987 is lower than 1992. But that makes that makes me older. That makes him younger, right? That's that's the second thing that I mentioned very quickly. The first thing which is the most important one is that let's get rid of that is that the birth date column is a daytime and we correctly had a mean computed for that will be the alternative and actually let's try the activity that works. What is the alternative and I think we have that listed in the solution is, let's say this didn't exist, we'll basically need to transform the date back to a number and we can say, let's say dates. We can say age in days, we have to transform this thing from a date to a number. So we can say the F at age in days, and we can say PD, let's actually let me actually show you the result. First, the F at birth date, minus or actually the other way, PD dot timestamp, we said dot now, minus dots, that gives you a time delta. That's the name of the value pandas time delta, I'm gonna I'm googling it real quick. And we can find a time delta is just a relative duration of time. And that is the total the time delta and we can get for example, to be very precise, we can get total seconds. Total seconds, DT total seconds, or we can get something like days. And here what I'm doing is a DT daytime accessor. So if a pandas series, a column in a data frame is of time if is a daytime or a time delta, it has a special accessor dot DT that lets you perform some daytime calculation on it. Similar as with, for example, the F at team dot STR, we have the STR accessor for string columns, we have the DT accessor for daytime and time delta columns or series. So anyway, we can say this thing, age in will probably will not say it. So we say, we say player, we say birth dates. And we are gonna say age in days, birth dates. There you go. So someone that was born in 1993, oh, 801. So August 1 is 10,893 days old. So now we have transformed the daytime back into a numeric column. And we can now compute the the average age. So we can say group by same thing, team at age in days, dot min, sort values. And I did something wrong, as usual group. I don't know why I have so many issues with this group by today. And now what you can see is that we actually have to think the result in an opposite direction. Now the lowest value is the youngest one, of course, the previous one was the day the most recent date gave us the youngest team. Now the team with the lowest age in days is, of course, the youngest team. That's basically what we're finding right here. And we have let me clear this whole thing. So we can compare birth date. And these age in days, we have the same results. Ideally Portland try let me actually sorted sword ascending false. So we compare the values. And we have Portland, Portland, Toronto, Toronto, Boston, Boston, Orlando, Orlando. So it's the same value for both. I think the interesting takeaway here is how pandas implements all these useful daytime modifications, modifiers, methods, and even these operations. I think that's the most interesting piece so far. So again, wrapping it up this project extremely interesting because we did pretty much the full cycle, we got the data we merged, we identified invalid values using this detective work finding what happened rewinding time, we did some googling, we finally did some cleaning transform data types like birth date, then we did all these, you know, munging of the data, there you go, deleting these totals, understanding that we were working with. And finally, we did all these analysis, my favorite activity is by far the number 15 one with the transform method extremely useful method. So if you want to replay it, just, you know, try to solve it by yourself. And I think, I think this project represents very well, the whole process of data science data, runling data munging, data engineering, everything you're going to find in a in a regular project dealing with date.
Info
Channel: freeCodeCamp.org
Views: 333,507
Rating: undefined out of 5
Keywords:
Id: gtjxAH8uaP0
Channel Id: undefined
Length: 297min 58sec (17878 seconds)
Published: Wed Jun 21 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.