Loop / Iterate over pandas DataFrame (2020)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hey everyone, my name's Bradon! We're going to go over how to iterate (or loop) over a panda's dataframe. Tthere are many ways that we can loop over a dataframe; we'll first go over the most intuitive way of doing this, however, this way is not as efficient as some other methods that are available to us. These additional methods don't add much more complexity but they can give us a HUGE BOOST in performance!!! So stick around to see how you can implement these methods into your own code. To get started we'll import pandas as pd and we'll also import numpy. I've already written out a little bit of code to create a dataframe. What this code is doing is giving us ten thousand rows and five columns of random floats between zero and one. We are naming our columns "A" "B" "C" "D" and "E." Let's use df.describe to check out our dataframe in a little more detail. So we see that the mean is pretty close to 0.5, that's what we would expect. The minimum is very close to zero and the maximum is very close to one, with 10,000 rows for each column. So to get started, we'll use for i, row in df.iterrows. This is the first method that's available to us to iterate over all the rows in a dataframe. So what this does is return the index, that's i, and the row of data, that's row, and for demonstration purposes I've already written out a little bit of code. This code will print out the first three iterations and it will show us i (the index) and then the returned row. And then if we want to specify just a single value from a column then we'll do that here. So we can say row and then the column that we are calling. So here we see the first iteration the index is zero, then one, and then two. It prints out the entire row's data and then the value for that row and that column that we specified. To show this a little more clearly, we can compare it to our dataframe head. So you can see for this last one here, index 2, if we look at index 2 that this data matches up here a 0.47 and so forth. So most of the time we don't want to just print out the values we want to actually do something with that. So we can create a function, we'll call it iterrow_example and we'll pass it the dataframe and the column. And we'll say for i, row in df.itterrows - we'll store the value. So we'll call the row and the column and if that value is less than 0.5, we'll say we want to change that to equal zero. So what I've done here, is use the .at method on our dataframe and said, at index (whatever the index is here) in the df.iterrows and at the column here, the column that we specify, so it might be at index 1 and column b. So it'd give us this value. We want it to equal 0 if it's less than .05 else df.at i column is equal to 1. So if it's less than 0.5, we'll say it's 0. If it's greater than or equal to 0.5 we'll say it's one. We'll use a magic method called timeit. What this does is shows us the amount of time that it takes to run that line of code. So we're going to call our function that we just made, pass it df that's our dataframe, and we'll say we want this these changes to be made on column "A" and let's run that. So it took 528 milliseconds to run this command. Now that seems fast, however I know we can do much better. In fact we can copy this code above and just make a few little changes and make it run even faster. So let's call this iloc_example. We'll use the .iloc method. Once again, we'll pass in the dataframe and the column. We're not going to use iterrows anymore, but we will use .index. And we don't have access to rows anymore because we are only using index instead of iterrows. What we can do here is say df[col].iloc and then give the index position. So what this is doing is for the column that we passed in, it's finding the value at index location i. And then all of this can be the same. So with those few changes let's see what kind of performance boost we can get. We'll pass in our dataframe and this time we'll say column b. We'll run that and I forgot to run this cell above so we'll do that real quick, run it again. So we get 170 milliseconds! So with just those few little changes we get a three times speed increase. Which is pretty good, but we can do even better!! This time what we'll do is we'll use the apply method. So we just need one line of code here and we'll say df column "C" is equal to df column "C" .apply lambda x, is going to equal 0, if x is less than 0.5, else set it equal to 1. So lambda functions at first can be a little confusing, I'll put a link in the description below to an explanation of lambda functions. I'll create a video sometime in the near future on that. So let's run this... and we have 3.24 milliseconds that's a 52 times speed increase from our previous record! Can we get any faster? Yes we can! So instead of using the apply method what we can do is change up a few little things and use the np.where() method. So we'll say df "D" is equal to np.where() and what we'll do is pass in the condition that we want. So in this case we want to say where the dataframe column "D" is less than 0.5 and then we give it what we want the value to be if this condition is satisfied. So that would be zero if it's less than point five. And then we say what we want it to be otherwise so that would be one. Let's run this, and we get 207 microseconds that's 15 times faster than our previous record! Nice, and get this... With just a little tweak, we can get even faster. So we'll copy this code from above and then what we'll say is .values. So what this does is essentially turn this into a numpy array. It strips out all the additional overhead that a panda series has. So if we run that we get 92 micro seconds!!! That's amazing! All we had to do is add .values and we get over two times faster performance. So if we compare it to our original iterrows example, that's over 5 700 times faster. If you have a small data set that you're working with, it probably doesn't matter that much, however the larger your data set the more beneficial doing these minor tweaks will be. The reason we can get such massive gains in performance is: number one, by using methods that have been optimized for speed so using .apply or np.where, and also through vectorization. In these final two methods we pass in the entire dataframe series, or in this case an array of values, and the operation is done on that entire set of data, rather than in the other cases where we're looping through each single value. We have to perform those operations on each of those values which is much slower. So the moral of the story is use optimized methods where possible and vectorization. Plus, once you get used to this syntax it's much easier than writing out all of this code above. So not only is it faster performance wise, but it also helps us questions on how you can apply these methods to your specific situation you can leave a comment down below and I'll get back to you as soon as I can. And I'll also provide a link to an annotated Jupyter notebook going over these methods in the description below. Thanks for watching :)

Info

Channel: Chart Explorers

Views: 21,592

Rating: 4.9666667 out of 5

Keywords: iterate over df, loop over rows, iterate over rows, assign value to pandas row, improve pandas performance, iterate over dataframe python, python, pandas, data analysis, data science, iterrows, np.where, lambda, iloc, pandas looping, pandas iteration, how to loop over rows in pandas, optimized, vectorization, python pandas iterate, python pandas change rows, python pandas row values, pandas dataframe iterate, iterate over rows in a pandas dataframe, iterate over dataframe pandas

Id: CG3EV7UBELA

Channel Id: undefined

Length: 11min 5sec (665 seconds)

Published: Fri Sep 11 2020