41. Convert RDD to Dataframe in PySpark | Azure Databricks #spark #pyspark #azuresynapse #databricks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi friends welcome to First studies YouTube channel this is part 41 in pi spark playlist in this video we are going to discuss about how we can convert a rtt object into Data frame in pi spark so firstly we will be discussing about high level what is rdd actually and then we will be discussing about how to convert rdd to data frame so when when you say what is rdd it has lot of Concepts actually I am not going in detail about it I will simply give you like how technically it look like and then how to convert a rdd object into Data frame so rdd full form is resilient distributed data set so simply try to create a imagination in your mind that rdd is nothing but like a list in Python I hope everyone knows what list means list means nothing but like a collection of items that item can be string the tank term can be integer that item can be a object also so if you don't know what is list in Python please watch my python playlist in which I have discussed about list in Python so rdd is nothing but like a list in Python only so it is like a collection of objects only but data frame is not like that right so if you have seen all my videos by this time you know that data frame will be like a tabular representation right it will be having a named columns grouped together and when you use a sure function it will actually print like a table so that you can so it's like a tabular thing right but it is like object related thing so collection of objects are nothing but like rdd and it has several Advantage first thing is it is immutable that means like how data frame right if you have seen the entire playlist by this time you know that whenever you apply any transformation on top of the data frame it won't change it will create a new data frame so that is called immutable nature you cannot change once it is created it can create a new one with the new changes but you cannot change the existing one and it will process everything in the memory so in RTD what will happen let's say you have a 10 objects the objects will be partitioned and they were distributed to the individual nodes on the cluster to process them and then finally group it together to produce the data for you so that distribution and then processing nature will be there and it is a fault tolerant that means if a if any any particular node fails it will require back the rtt object will be recovered back so it has a lot of things so theoretically I don't want to bore you we bore you guys so you can online check it to understand more about rdd so in this video we will see how to create a rdd object and then how to convert that rdd object using these two data frame function and create data frame function so let's try to create a rdd object first so to create a rdd object you should use a spark context class on which there is a function can parallelize that will create a rdd object from a list let me practically show you this this is my browser in which I have opened my databricks workspace let me create a new notebook I will name it like rdd notebook python is a default language this is a cluster so once I hit a create button it will create a notebook for me let me close this pop-up here and here let's try to create a object called data Maybe so data then it is let's try to create it as a list list of objects so objects may be Tuple so let me use this is like ID let's assume and this is like a name okay so similarly let's try to add another object also into this list so this time let's use two over so now this is list right so if you want to know let's try to use this print type of data so let me hit shift enter to execute this code and you can see it is a list now this list we will convert as rdt object to do that on top of the spark session object there is something called spark context instance and which you have parallelized function actually so let's try to use this parallelized function to this function we can pass this list to create a rdd object out of it this entire code will create a rdd object so let's have a variable called RTD to store it and here let's try to print type of this rdd object to see the type so let me hit shift enter to execute this code you can see this statement printed that it is a RTD type object it is clear so now RTD type object is created now if I want to print this RTD type object I should use a collect function to collect that data from different nodes or partitions and then present you so when one side shift enter if you closely observe the result it gives me back this list which is a collection of objects so this is a rdb so once the RTD is available you can use map function there are a lot of other things that we will be discuss in our next video so for now you got an idea like there is a rtt object which we created using this spark context instance and parallelize function so let's assume in your project you have a rdd object and now you want to convert this RTD object as a data frame will that possible yes it is possible so what you can do here is and on top of rdt object there is something called two data frame function you can see here two data frame function using this function I can create a data frame out of it so this entire code will actually create a data frame so let's try to save this data frame here and then let's try to show this data frame here so let me hit shift enter to execute this code and if you have closely observed now you will see a data frame created but there is no column name specified here right so so it will by default create columns like underscore one underscore 2 actually why because there is no column specified here so for some reason I want to specify the columns also will that possible yes it is possible if you see the screenshot what in here I am doing for this two data frame function I am passing list of my column names so what you can do here if you see the two data frame function I can pass the schema also to there you can see schema has a parameter and for this schema I can mention like ID column then name column like this so now if I hit shift enter I should get column names as ID and Mahir so hope you got it right so this list whatever I am passing that will make sure to create a column names for my data frame object which is Created from the rdd object so not only that there is something called arc.create data frame function if you have seen my previous videos you know this so to this function also I can pass my rdd object and also I can pass my schema also to get a schema for that that means to get a column names so this will also create a data frame object so let me use data frame as df1 and then finally let's try to print this df1 using show function so once this command execution completes you can see there is a ID column name column with regard to two rows hook okay so let's go back to presentation so that's it in this video I hope you got an idea about what is rdd object on a high level how to create a data frame out of it thank you for watching please subscribe to my channel and press the Bell icon to get the notifications whenever I add videos thank you so much
Info
Channel: WafaStudies
Views: 9,845
Rating: undefined out of 5
Keywords: PySpark for beginners, PySpark Playlist, PySpark Videos, Learn PySpark, PySpark for data engineers, dataengineers PySpark, PySpark in Azure Synapse Analytics, PySpark in Azure databricks, Understand PySpark, What is PySpark, PySpark in simple explaination, PySpark Overview, synapse pyspark, spark, pyspark, azure databricks, rdd, convert rdd to dataframe in pyspark, what is rdd in pyspark, how to convert rdd to dataframe, toDF(), spark.createDataFrame()
Id: 7R_-_K7HxZw
Channel Id: undefined
Length: 7min 50sec (470 seconds)
Published: Thu Dec 29 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.