10. withColumn() in PySpark | Add new column or Change existing column data or type in DataFrame

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello all welcome to offer studies YouTube channel this is part 10 in my spark playlist in this video we are going to discuss about usages of with column function in pi spark generally once you load data into Data frame you may have a requirement of adding a new column to it or maybe altering or changing the existing column value or existing column data types in data frame so in such scenarios you have to use with the column function so I will be explaining all these examples practically now so with the column function we'll basically update the existing column it can be like data type update it can be value update or it will add a new column all together so since it is trying to transform the data on top of the it is called transformation function so there are a lot of functions available in pi spark that will help you the transform the data on a data frames we will discuss them all of them one by one in our upcoming videos so whenever you update any data whenever you transform the data on top of data frame actually you are not altering the data frame you create a new data frame out of it actually so let me practically show you that so that you will make sense of it so let me go to databricks workspace this is the database workspace which I have already opened and here if you go to compute you can see there is one cluster already running so let's go to workspace users under my name here let's try to create a new notebook and let's try to name it maybe uh like with column with the column notebook okay so let me give this name and the default language is python cluster I have selected let me hit create button to create a notebook and once the notebook created the UI will open my notebook here so let's wait for the u8 update yeah so let me close this pop-up so this is my notebook connected with my cluster and here let's try to create a data frame first if you have seen my Pi spark videos from the starting you know how to create a data frame by manually entering some values so I am just trying to do the same thing here so I let's try to create a data variable which holds a list list of tuples so one Mahir then maybe 3000 so let's try this will act as a row in a data frame so let's say like this is ID and this is name and this is salary so let's try to add another row as well similarly so let me put comma here and id2 wafa so please watch all my videos in the sequence order so that you will get most out of it okay so now here let me try to create a new variable called columns Maybe and this is also list so ID column we have name column we have then salary column okay so now here let's try to use this spark keyword that will give you a spark session object as I shown in our previous videos and on top of this let's try to use this create data frame function to this data frame function there is so I am hitting control space to get the notification I mean this intelligence there is a parameter called Data so to this parameter let's Supply this data variable here then comma there is another parameter called schema so for this one let's try to supply my columns so this entire code is going to create a data frame so to do to hold that data frame let's try to have a variable called DF so now the data frame whatever it creates here that will come and land inside this variable so once we do that on top of this we can use a show function as I said in my previous video let me hit shift enter to execute this code to make sure whether our data frame gets created correctly or not so our Command is running here so let's give few seconds you can see the data is also comes up here so now here let me do one thing so along with this show function on top of data frame we have something called print schema function that will give schema of this data frame so let me hit enter shift enter to execute this cell here and now if you see the output we got a schema ID column name column salary column ID column km as long name column has a string salary column as a string how these data types are coming that create data frame function will automatically infer the column data types based upon the values we Supply so by default if there is any integer it will take it as a log data type so if anything in a single quote or double click it will take it as a string type so now let's assume I want to convert the salary column to integer I don't want it to be string so how to do that so for that what we can do as I said we have to use with the column function so on top of data frame when I say Ctrl dot plus then control space there is something called with column function so with this function we can change the data type or we can update the values in a column or we can create a new column also all together so with the column function will help to do these kind of activities so now here to this with the column function so let me do one thing let me remove this and let me use a help function here to see the documentation of this with column function so when I hit shift enter it says it's a module it's a method inside the data frame so that is the reason on top of the data frame object we were able to call that function because it is a method inside it and if you say it will take column name parameter and the column parameter so what is the column name you want if the if if the column is already there with this column name then existing column will going to change if the column is not there with this column then the new column will be added with the changes and to this parameter what the change we want whether we want to convert the data type whether we want to update the values in a column whether we want to add a new column so that information we have to pass it here and here if you see column name we have to supply the string for call parameter we should Supply a object of the column Class Type and you can scroll down you can see the examples and here if you see the example they are trying to take this Edge to column so if this data frame has the H2 column then the existing column will be updated with these changes or it will create a new column with this name and then inside that column this changes will come so here what they are doing they are taking a h column from the existing data frame and place two that means addition so on top of the edge they are adding some value so so let's not worry about this example let's take this example here so what I want is I want to convert this salary column data type to integer because it is coming as a string so now what I will do DF Dot with the column firstly we have to supply the column name parameter right to the column name parameter I I am going to act upon this salary column so I will be supplying that value so on top of the salary column now here I have to specify what change I want I want to convert the data type of it so to do that first let's try to select that column so for the second parameter for the call parameter we should Supply a column from the data frame right so for that you can use DF dot column name r actually there is a function so let me import one function here from PI spark Dot SQL dot functions let me try to import there is something called call function so this call function will give you columns so now if you use this call function I am going to pick this salary column so what this call function will do it will point to a column object of that particular column and if you see the documentation the second parameter should be column type class column type and this will return class column type data only so now I have taken that entire column dot on top of that I want to use a cached function to convert the data type so I want to cast these two maybe integer okay so that's it so now this entire thing is not going to change the data in this data frame or it is not going to modify the column data type in this data frame so this entire thing is going to create a new data frame altogether with these changes so that new data frame let me save it into like maybe df1 data frame one or you can save that back into DF itself up to you so let me save it into a different object all together df1 and now on top of df1 I can perform so and also on top of df1 let me try to use print schema 2. now let me hit shift enter to execute this code and now if you see df1 dot shows the same data load changes because we only change the data type and now if you see here salary column became a integer type so how this magic is happening because of this with the column function so this is working and second thing let's assume I want to update the values in existing column so let's take the same salary column so no I don't want to change the data type maybe or let it be so till here it is changed the data type right now on top of the data frame what I want you I want to update the values so same thing so if you want to change the data in existing values also you have to use it with column function so here which column I want to update I want to update the salary column so let me take the salary column then what change I want to do I want to take this calorie column okay then I want to multiply every salary with the 2 maybe okay so now what will happen in the salary column for every row it will multiply the value by 2 and whatever the changes it will do with that changes it creates a new data frame so let's use a df2 maybe this time and then let's try to do TF2 dot so let me hit shift enter to execute this cell and now let's observe the values see salary became 3000 uh three thousand became 6000 here and if you see 4000 became 8 000 here right how that magic is happening because using this bitter column function we were able to update the values in a existing row so now let's assume so df2 created up to here it is file now I want to add a new column all together on top of this data frame and that column should be like a country column okay so to do that same thing with the column then Supply the column name so now I am going to supply country so every time whatever the column you supply here first what it will do it will check with the same name is there any column already in the data frame if yes then the changes will happen under column itself or else it will create a new column altogether since there is no column called country it creates a new column altogether and then here what I want so in that column I want value to be like India so to do that so you cannot Supply value like maybe uh India so in country column I want India column you cannot do that why because as I said the second parameter if you remember the second parameter should be class column type call function will give you class column type but here I am using the string so it won't work so how to pass a hard coded value as a class column type so for that here inside the pi spark SQL functions there is something called the lit function that will give you output in a column type see here so now let me import that Lit function literal it is literally it's just like literal only it means so to this lead function let me Supply maybe India India as a value I want to return as an entire column and that column should be with this name so this entire code again going to create a new data frame it so let's try to save that new data frame here as a df3 and now let's try to do tf3 dot show so let me shift enter to execute this code and now if you closely observe we got our country column as well so that means we were able to add a new column as well so if I recap here here I was able to point to some column and updates its data type and here I was able to point to some column and update its values and here I was able to create a new column altogether which is some new value not only that we can create a new column with some existing column values as well so let's assume this is what the data frame 3 right so here what I want is maybe I want to create a new column called copied salary and the same values I wanted okay so to do that so on top of df3 again use with column and I want to create a new column called maybe copied salary and for this column the data should be the salary column that means I am duplicating the existing gallium value data so what you can do the same thing so use a column function and whatever we have inside a salary column take that and it will simply give that result back then let me try to show data frame 4. now if I hit shift enter see what will happen it will create a new column and the data in this new column is from the existing column so that's it let me go back to presentation so if you see this is the code where I am changing the data type and here we are updating a column with the existing column values and here we are creating a new column all together with the existing values here and you say it's again a new column with some new value so that's it in this video this with column function is very useful many times we will get a requirement of using this with column function in real time scenarios so I would strongly encourage you guys to watch this with column function detailing and try to get most out of it if you are not getting it watch it multiple times so that you will make sense of it thank you for watching please subscribe to my channel and press the Bell icon to get the notification whenever I add videos thank you so much
Info
Channel: WafaStudies
Views: 24,465
Rating: undefined out of 5
Keywords: PySpark for beginners, PySpark Playlist, PySpark Videos, Learn PySpark, PySpark for data engineers, dataengineers PySpark, PySpark in Azure Synapse Analytics, PySpark in Azure databricks, Understand PySpark, What is PySpark, PySpark in simple explaination, PySpark Overview, synapse pyspark, spark, withColumn() usage in pyspark, pyspark example withColumn(), add new column to dataframe, change existing column data, change existing column datatype in dataframe, spark dataframe
Id: RgGT7LfHBQs
Channel Id: undefined
Length: 15min 42sec (942 seconds)
Published: Sun Oct 30 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.