Hey everyone, my name's Bradon! We're going to
go over how to iterate (or loop) over a panda's dataframe. Tthere are many ways that we can loop
over a dataframe; we'll first go over the most intuitive way of doing this, however, this way
is not as efficient as some other methods that are available to us. These additional methods
don't add much more complexity but they can give us a HUGE BOOST in performance!!! So stick
around to see how you can implement these methods into your own code. To get started we'll
import pandas as pd and we'll also import numpy. I've already written out a little bit of
code to create a dataframe. What this code is doing is giving us ten thousand rows and five
columns of random floats between zero and one. We are naming our columns "A" "B" "C"
"D" and "E." Let's use df.describe to check out our dataframe in a
little more detail. So we see that the mean is pretty close to 0.5, that's
what we would expect. The minimum is very close to zero and the maximum is very
close to one, with 10,000 rows for each column. So to get started, we'll use for i, row in
df.iterrows. This is the first method that's available to us to iterate over all the rows in a
dataframe. So what this does is return the index, that's i, and the row of data, that's row, and for
demonstration purposes I've already written out a little bit of code. This code will print out the
first three iterations and it will show us i (the index) and then the returned row. And then if we
want to specify just a single value from a column then we'll do that here. So we can say row
and then the column that we are calling. So here we see the first iteration the index
is zero, then one, and then two. It prints out the entire row's data and then the value for
that row and that column that we specified. To show this a little more clearly, we
can compare it to our dataframe head. So you can see for this last one here, index 2,
if we look at index 2 that this data matches up here a 0.47 and so forth. So most of the time
we don't want to just print out the values we want to actually do something with that. So we can
create a function, we'll call it iterrow_example and we'll pass it the dataframe and the column.
And we'll say for i, row in df.itterrows - we'll store the value. So we'll call the row
and the column and if that value is less than 0.5, we'll say we want to change that
to equal zero. So what I've done here, is use the .at method on our dataframe and
said, at index (whatever the index is here) in the df.iterrows and at the column here, the
column that we specify, so it might be at index 1 and column b. So it'd give us this value. We want
it to equal 0 if it's less than .05 else df.at i column is equal to 1. So if it's less than
0.5, we'll say it's 0. If it's greater than or equal to 0.5 we'll say it's one. We'll
use a magic method called timeit. What this does is shows us the amount of
time that it takes to run that line of code. So we're going to call our function that we
just made, pass it df that's our dataframe, and we'll say we want this these changes to
be made on column "A" and let's run that. So it took 528 milliseconds to run this command.
Now that seems fast, however I know we can do much better. In fact we can copy this code
above and just make a few little changes and make it run even faster. So let's
call this iloc_example. We'll use the .iloc method. Once again, we'll pass in the
dataframe and the column. We're not going to use iterrows anymore, but we will use .index. And
we don't have access to rows anymore because we are only using index instead of
iterrows. What we can do here is say df[col].iloc and then give the index position.
So what this is doing is for the column that we passed in, it's finding the value at index
location i. And then all of this can be the same. So with those few changes let's see what
kind of performance boost we can get. We'll pass in our dataframe and this time we'll
say column b. We'll run that and I forgot to run this cell above so we'll do that real quick, run
it again. So we get 170 milliseconds! So with just those few little changes we get a three
times speed increase. Which is pretty good, but we can do even better!! This time what we'll
do is we'll use the apply method. So we just need one line of code here and we'll say df column
"C" is equal to df column "C" .apply lambda x, is going to equal 0, if x is less than 0.5,
else set it equal to 1. So lambda functions at first can be a little confusing, I'll put a
link in the description below to an explanation of lambda functions. I'll create a video sometime
in the near future on that. So let's run this... and we have 3.24 milliseconds that's a
52 times speed increase from our previous record! Can we get any faster? Yes we can! So
instead of using the apply method what we can do is change up a few little things and use
the np.where() method. So we'll say df "D" is equal to np.where() and what we'll do
is pass in the condition that we want. So in this case we want to say where the dataframe
column "D" is less than 0.5 and then we give it what we want the value to be if this condition is
satisfied. So that would be zero if it's less than point five. And then we say what we want it to be
otherwise so that would be one. Let's run this, and we get 207 microseconds that's
15 times faster than our previous record! Nice, and get this... With just a little
tweak, we can get even faster. So we'll copy this code from above and then what we'll say
is .values. So what this does is essentially turn this into a numpy array. It strips out all
the additional overhead that a panda series has. So if we run that we get 92 micro seconds!!!
That's amazing! All we had to do is add .values and we get over two times faster performance. So
if we compare it to our original iterrows example, that's over 5 700 times faster. If you have
a small data set that you're working with, it probably doesn't matter that much, however
the larger your data set the more beneficial doing these minor tweaks will be. The reason we
can get such massive gains in performance is: number one, by using methods that have been
optimized for speed so using .apply or np.where, and also through vectorization. In these final two
methods we pass in the entire dataframe series, or in this case an array of values, and the operation
is done on that entire set of data, rather than in the other cases where we're looping through each
single value. We have to perform those operations on each of those values which is much slower. So
the moral of the story is use optimized methods where possible and vectorization. Plus, once you
get used to this syntax it's much easier than writing out all of this code above. So not only is
it faster performance wise, but it also helps us questions on how you can apply these methods to
your specific situation you can leave a comment down below and I'll get back to you as soon as I
can. And I'll also provide a link to an annotated Jupyter notebook going over these methods in
the description below. Thanks for watching :)