Loop through a list using pySpark for your Azure Synapse Pipelines

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

- Yooo! What's up? This is Patrick from Guy in a Cube and in this video we're going to talk about lists and looping those lists in pySpark. Except it took a little bit for me to figure this out. Had to tinker a bit. So, short part of the long story is I did a video on converting CSV files to Parquet files 'cause someone asked me how to do that. And they came back and said, "Well I have several of these things. I need to loop through 'em instead of doing it one at a time." I was like, "Of course you can do this." So what you need to do... Wait, wait, wait. Enough of all this talk. You know what we like to do. Let's do what? Let's head over to my laptop. So here's my list. And you just wrap it in brackets and you provide a comma delimited list of items wrapped in double quotes. And then what you can do is say for, and so we'll say tablename in this particular list. Look at this, see how it didn't indent. If I put the colon there and hit Return it indents, because you need to indent. And then I'm just going to say print tablename. And then I'm going to run this. And when I run this what you're going to see is it gives me a list. And you may be thinking, Patrick, well I'm going to schedule this notebook in a pipeline. I don't want to have to go and change it and republish it every time I add items to the list or remove items from the list. I want to drive this from an external source. Well you can do that. Let me show you. I'm going to just replace this list with a path. I'm going to read it in from a CSV file so I made an external file and it's just a CSV file with a list of the CSV files I want to convert. You can do this in other different ways. I'm just using a CSV file. And then I add .collect to the end 'cause it allowed me to iterate this thing. And so then all I need to do is say for tablename in this and then I should be able to output it. I need to change one thing. I need to tell it the ordinal position. It's zero base, so I need to tell it's the first column in the list. So now if I run this, boom, boom, boom. Same results. All you Python experts going to be going, "You could do it this way, Patrick. You can do it this way." Of course you can. It's like T-SQL, just like DAX. There's millions of ways to do it. I'm on my journey, I'm on my path to learning this. So, you know a more efficient way, you know what to do. Post it in the comments below. So now I have it. So I'm reading through it. So now what I want to do is, just so I don't have to retype this over and over again, I'm going to tuck this in a variable and I'm going to say tablename zero. That number that I'm wrapping in those brackets, it's the ordinal position of the column in a zero base. So if I had multiple columns in that CSV file, the first column would be zero, the second column will be one, and on and on and so forth. So here's the location and then what I need to do is, I need to read the csvfile in. So you can see I have it, but the problem is I have it hardcoded and I need to parameterize it. I want to make it dynamic. Just go ahead and add an F. If you haven't watched the video I did on parameterizing a notebook, this is a very similar thing. And then here, what you would do is put the squiggly braces and drop tablename right there. So now, oh, and you want to make sure you spell csvfile right. And so now through each loop it's going to pull in the file from that corresponding location and I just just noticed that I put tablename, but I really need location 'cause that's the variable that I'm tucking that tablename in. Then after I do that, then all I'm going to do is paste this in right here. So I'm saying a csvfile.write.mode overwrite it, right? The mode that I'm going to use is overwrite. if it already exists, overwrite it. Option 'cause the header there, header equals True. And then here's the path to the file. And let me replace this with location. Now if I run this, we'll cross our fingers, cross our toes, there's locations, ethnicity, gender, level, and major. So it's done. I head over here to this container, it's empty. I'm going to go ahead and choose Refresh and then you'll see, right, parquet files for each one of those locations. What do you think? You got questions, you got comments? Would you like to see how I scheduled this notebook to run in the pipeline? I'd love to know. You know what to do. Post it in the comments below. And always, thanks for watching. We'll see you in the next video.

Info

Channel: Guy in a Cube

Views: 6,860

Rating: undefined out of 5

Keywords: apache pyspark, apache spark python, apache spark with python, azure synapse, azure synapse analytics, azure synapse notebook, azure synapse notebook read from data lake, azure synapse pipeline, azure synapse pyspark, azure synapse pyspark example, azure synapse tutorial, big data, introduction to pyspark, pyspark, what is pyspark

Id: ldTeS-yxpSE

Channel Id: undefined

Length: 3min 54sec (234 seconds)

Published: Tue Apr 18 2023