- Yooo! What's up? This is Patrick from Guy in a Cube and in this video we're going to talk about lists and looping
those lists in pySpark. Except it took a little bit
for me to figure this out. Had to tinker a bit. So, short part of the long
story is I did a video on converting CSV files to Parquet files 'cause someone asked me how to do that. And they came back and said, "Well I have several of these things. I need to loop through 'em instead of doing it one at a time." I was like, "Of course you can do this." So what you need to
do... Wait, wait, wait. Enough of all this talk. You know what we like
to do. Let's do what? Let's head over to my laptop. So here's my list. And you just wrap it in brackets and you provide a comma delimited list of items wrapped in double quotes. And then what you can do is say for, and so we'll say tablename
in this particular list. Look at this, see how it didn't indent. If I put the colon there
and hit Return it indents, because you need to indent. And then I'm just going
to say print tablename. And then I'm going to run this. And when I run this
what you're going to see is it gives me a list. And you may be thinking, Patrick, well I'm going to schedule
this notebook in a pipeline. I don't want to have to go
and change it and republish it every time I add items to the list or remove items from the list. I want to drive this
from an external source. Well you can do that. Let me show you. I'm going to just replace
this list with a path. I'm going to read it in from a CSV file so I made an external file and it's just a CSV file with a list of the CSV files I want to convert. You can do this in other different ways. I'm just using a CSV file. And then I add .collect to the end 'cause it allowed me
to iterate this thing. And so then all I need to do
is say for tablename in this and then I should be able to output it. I need to change one thing. I need to tell it the ordinal position. It's zero base, so I need to tell it's the first column in the list. So now if I run this, boom,
boom, boom. Same results. All you Python experts going to be going, "You could do it this way, Patrick. You can do it this
way." Of course you can. It's like T-SQL, just like DAX. There's millions of ways to do it. I'm on my journey, I'm on
my path to learning this. So, you know a more efficient
way, you know what to do. Post it in the comments below. So now I have it. So I'm reading through it. So now what I want to do is, just so I don't have to retype
this over and over again, I'm going to tuck this in a variable and I'm going to say tablename zero. That number that I'm
wrapping in those brackets, it's the ordinal position of
the column in a zero base. So if I had multiple
columns in that CSV file, the first column would be zero, the second column will be one,
and on and on and so forth. So here's the location and
then what I need to do is, I need to read the csvfile in. So you can see I have it, but the problem is I have it hardcoded and I need to parameterize it. I want to make it dynamic. Just go ahead and add an F. If you haven't watched the video I did on parameterizing a notebook, this is a very similar thing. And then here, what you would do is put the squiggly braces and
drop tablename right there. So now, oh, and you want to make sure you spell csvfile right. And so now through each loop it's going to pull in the file from that corresponding location and I just just noticed
that I put tablename, but I really need location
'cause that's the variable that I'm tucking that tablename in. Then after I do that, then all I'm going to do is
paste this in right here. So I'm saying a csvfile.write.mode
overwrite it, right? The mode that I'm going
to use is overwrite. if it already exists, overwrite it. Option 'cause the header
there, header equals True. And then here's the path to the file. And let me replace this with location. Now if I run this, we'll cross
our fingers, cross our toes, there's locations, ethnicity, gender, level, and major. So it's done. I head over here to this
container, it's empty. I'm going to go ahead and choose Refresh and then you'll see, right, parquet files for each
one of those locations. What do you think? You got questions, you got comments? Would you like to see how
I scheduled this notebook to run in the pipeline? I'd love to know. You know what to do. Post it in the comments below. And always, thanks for watching. We'll see you in the next video.