Loop / Iterate over pandas DataFrame (2020)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey everyone, my name's Bradon! We're going to  go over how to iterate (or loop) over a panda's   dataframe. Tthere are many ways that we can loop  over a dataframe; we'll first go over the most   intuitive way of doing this, however, this way  is not as efficient as some other methods that   are available to us. These additional methods  don't add much more complexity but they can give   us a HUGE BOOST in performance!!! So stick  around to see how you can implement these   methods into your own code. To get started we'll  import pandas as pd and we'll also import numpy.   I've already written out a little bit of  code to create a dataframe. What this code   is doing is giving us ten thousand rows and five  columns of random floats between zero and one.   We are naming our columns "A" "B" "C"  "D" and "E." Let's use df.describe   to check out our dataframe in a  little more detail. So we see that the   mean is pretty close to 0.5, that's  what we would expect. The minimum   is very close to zero and the maximum is very  close to one, with 10,000 rows for each column. So to get started, we'll use for i, row in  df.iterrows. This is the first method that's   available to us to iterate over all the rows in a  dataframe. So what this does is return the index,   that's i, and the row of data, that's row, and for  demonstration purposes I've already written out a   little bit of code. This code will print out the  first three iterations and it will show us i (the   index) and then the returned row. And then if we  want to specify just a single value from a column   then we'll do that here. So we can say row  and then the column that we are calling.   So here we see the first iteration the index  is zero, then one, and then two. It prints out   the entire row's data and then the value for  that row and that column that we specified.   To show this a little more clearly, we  can compare it to our dataframe head.   So you can see for this last one here, index 2,  if we look at index 2 that this data matches up   here a 0.47 and so forth. So most of the time  we don't want to just print out the values we   want to actually do something with that. So we can  create a function, we'll call it iterrow_example   and we'll pass it the dataframe and the column.  And we'll say for i, row in df.itterrows -   we'll store the value. So we'll call the row  and the column and if that value is less than   0.5, we'll say we want to change that  to equal zero. So what I've done here,   is use the .at method on our dataframe and  said, at index (whatever the index is here)   in the df.iterrows and at the column here, the  column that we specify, so it might be at index 1   and column b. So it'd give us this value. We want  it to equal 0 if it's less than .05 else df.at i column is equal to 1. So if it's less than  0.5, we'll say it's 0. If it's greater than or   equal to 0.5 we'll say it's one. We'll  use a magic method called timeit.   What this does is shows us the amount of  time that it takes to run that line of code.   So we're going to call our function that we  just made, pass it df that's our dataframe,   and we'll say we want this these changes to  be made on column "A" and let's run that. So it took 528 milliseconds to run this command.  Now that seems fast, however I know we can do   much better. In fact we can copy this code  above and just make a few little changes   and make it run even faster. So let's  call this iloc_example. We'll use the   .iloc method. Once again, we'll pass in the  dataframe and the column. We're not going to use   iterrows anymore, but we will use .index. And  we don't have access to rows anymore because   we are only using index instead of  iterrows. What we can do here is   say df[col].iloc and then give the index position.  So what this is doing is for the column that   we passed in, it's finding the value at index  location i. And then all of this can be the same.   So with those few changes let's see what  kind of performance boost we can get.   We'll pass in our dataframe and this time we'll  say column b. We'll run that and I forgot to run   this cell above so we'll do that real quick, run  it again. So we get 170 milliseconds! So with   just those few little changes we get a three  times speed increase. Which is pretty good,   but we can do even better!! This time what we'll  do is we'll use the apply method. So we just need   one line of code here and we'll say df column  "C" is equal to df column "C" .apply lambda x,   is going to equal 0, if x is less than 0.5,  else set it equal to 1. So lambda functions   at first can be a little confusing, I'll put a  link in the description below to an explanation   of lambda functions. I'll create a video sometime  in the near future on that. So let's run this...   and we have 3.24 milliseconds that's a  52 times speed increase from our previous   record! Can we get any faster? Yes we can! So  instead of using the apply method what we can do   is change up a few little things and use  the np.where() method. So we'll say df "D"   is equal to np.where() and what we'll do  is pass in the condition that we want.   So in this case we want to say where the dataframe  column "D" is less than 0.5 and then we give it   what we want the value to be if this condition is  satisfied. So that would be zero if it's less than   point five. And then we say what we want it to be  otherwise so that would be one. Let's run this,   and we get 207 microseconds that's  15 times faster than our previous   record! Nice, and get this... With just a little  tweak, we can get even faster. So we'll copy   this code from above and then what we'll say  is .values. So what this does is essentially   turn this into a numpy array. It strips out all  the additional overhead that a panda series has.   So if we run that we get 92 micro seconds!!!  That's amazing! All we had to do is add .values   and we get over two times faster performance. So  if we compare it to our original iterrows example,   that's over 5 700 times faster. If you have  a small data set that you're working with,   it probably doesn't matter that much, however  the larger your data set the more beneficial   doing these minor tweaks will be. The reason we  can get such massive gains in performance is:   number one, by using methods that have been  optimized for speed so using .apply or np.where,   and also through vectorization. In these final two  methods we pass in the entire dataframe series, or   in this case an array of values, and the operation  is done on that entire set of data, rather than in   the other cases where we're looping through each  single value. We have to perform those operations   on each of those values which is much slower. So  the moral of the story is use optimized methods   where possible and vectorization. Plus, once you  get used to this syntax it's much easier than   writing out all of this code above. So not only is  it faster performance wise, but it also helps us   questions on how you can apply these methods to  your specific situation you can leave a comment   down below and I'll get back to you as soon as I  can. And I'll also provide a link to an annotated   Jupyter notebook going over these methods in  the description below. Thanks for watching :)
Info
Channel: Chart Explorers
Views: 21,592
Rating: 4.9666667 out of 5
Keywords: iterate over df, loop over rows, iterate over rows, assign value to pandas row, improve pandas performance, iterate over dataframe python, python, pandas, data analysis, data science, iterrows, np.where, lambda, iloc, pandas looping, pandas iteration, how to loop over rows in pandas, optimized, vectorization, python pandas iterate, python pandas change rows, python pandas row values, pandas dataframe iterate, iterate over rows in a pandas dataframe, iterate over dataframe pandas
Id: CG3EV7UBELA
Channel Id: undefined
Length: 11min 5sec (665 seconds)
Published: Fri Sep 11 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.