6. Write DataFrame into json file using PySpark | Azure Databricks | Azure Synapse

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi friends welcome to a first series YouTube channel this is part 6 in pi spark playlist in this video let's try to understand how we can write a data frame into Json file using pi spark please watch my previous videos so that you will get more out of it so in our previous videos of the five spark playlist we have covered step by step like what is 5 spark how to read and write data from CSV file to CSV file similarly in the last video we discussed how we can read a Json file now in this video Let's assume we got a data into a data frame either by reading some file or maybe some hardcoded data so now this data I want to store it into a Json file is that possible if yes then how to do that using a pi spark that's what we are going to discuss in this video not only that when you are writing data into a Json file there are some saving modes available while you write the data that means like whether you want to append the data or you want to overwrite the data or you want to error out if the data is already there so there are a couple of modes available we are going to discuss that as well so let's try to understand how to write the data into a Json file so it is not C sphere it is a Json sorry I need to change this heading here so there is something called Data frame writer object if you have seen my video where I have explained how to write the data into CSV file it is almost similar to that so the same data frame writer object contains a method to write data into Json file okay so not CSV Json file so there is something called Json function using this function you can write the data into a Json file from data frame let me practically show you this so let's go to browser so here I have already opened my databricks and this is a cluster so let me do one thing let me go to workspace and users and let me navigate to my user account and here let's try to create a new notebook and let me try to name this notebook as like right df2 Json so we are trying to write the data into Json so I given the name of that notebook also in a similar fashion so let's wait for the notebook to open here so this is my notebook let me close this window here and here so this notebook default language is Python and it is connected with my cluster so now here let's try to create one data frame first so let's try to write a code to create a data frame here so data equals to so in in my second video I think I explained you how to create a data frame with some hard-coded data so I am going to apply the same strategy here I am creating a list of variable type car like data data is the name of the variable and in this variable I am showing a list which contains two pools so in the first Tuple I have one comma maybe Mahir okay and then as a second item again I am having a tuple and in this one I am using two gamma wafer right this is a variable of data which is a list of tuples now let's create another variable called schema and here let's try to have a list of values like ID comma name so that means I am going to create a data frame with ID column name column and in the ID column one under two are the values and in the name column Mahir and buffer are the values so let's do that so spark dot on a spark session object we have something called create data frame function to this function if I click this control space I will see a intelligence and here for the data parameter I can supply my data variable comma then there is another parameter called schema so for this schema parameter I can supply my schema variable so this this code is going to create a data frame under that data frame let's try to save it into like DF then here let's try to use a display function on top of to this function let's pass our data frame and let me hit shift enter to execute this code so let's see whether a data frame will get create or not so let's wait for the execution to complete here you can see data frame created it has a two rows columns are ID and name so this is good now in this case I created a data frame with hard coded values but in real time you may be getting this data frame by reading some file also now this data Frame data I want to store it as a Json like how to do that as a Json file I have to store it so for that so DF is my data frame rate so on top of this DF data frame object when I say control space after that I will be seeing there is something called Write instance so this is going to give you data frame writer object let me practically show you that so here let's try to use a help function on top of this right instance and let me execute shift enter this help function will help you to get a documentation of everything in the python so now if you see here if you read this documentation this right is nothing but like a object of data frame writer class and also if I scroll down on inside that class we have a CSV function that helps to save the content into CSV file we have seen this in our last video and also if I scroll down there is something called Json function 2. so that Json function will actually help you to write data into Json so let me scroll down and show you that so here you can see this Json function helps to write content into a Json format so with this in our mind and also if you see for this Json function what parameters we can supply everything is listed here not only that if you scroll down what type of data we can supply into each and every parameter everything is available and If you focus on this mode parameter of the Json function we can supply values like append overwrite ignore and error so these values will append the data or override the data or ignore the data that means if the file is already there if the Json file is already there then it won't write any data and it will simply ignore without any error if you use error that means if the data is already there our file is already there then it is going to error out saying file already exist so this is the default value if you don't Supply anything into the mode parameter of a Json function then error will be the default value so you will see that error let me practically show you all this so let me copy this for now into one Notepad so from here to here why because I will be using this modes modes into my Json function now so now here let me remove this code so on top of data frame there is a write instance that will give data frame writer object then on top of that there is a Json function see here in the Json function first I have to supply the path where I want to store it so let's try to store this data into our databricks file system only so in in real time you can store the data into some data like storage Gen 2 or some other storages for that you need to mount the storages or you need to get access all the things I covered in the synapse playlist and also I covered in the database playlist you can refer there so for now I will be storing data into Data frame file system only so here let me go to data frame file system and here maybe under this file store here I want to create a file called a folder called Json data and inside that I want to get my data so maybe employees like that so let me do that so what I will be doing so let me go back to data once again and then DFS and let me copy this storage uh so under file so let me copy this path for no file store path copy then what I want is I want like under file store Json data under Json data emps under emps folder I want to have my data so what if I use like this now embase.json and let's see what will happen so you will be thinking it will create a Json file with this name and store the data there right but Pi spark will not do that Pi spark will actually create a folder with this name and store your data as a part files I explained this in the uh I think last to last video where how to write the data into CSV the same thing happens with the Json file also so if I execute shift enter here and let's see what will happen command is executing uh so if you closely observe under file store Json data then emps.json right but with this name not a file folder will be created let me go back to data and let me go back to DFS under file store and if I go back to file store there is something called Json data C under Json data see emps.json it created as a folder not as a file and if I go inside of it this is where we have this these three files right these three are like a three partition files why because spark what it will do any work if you give it will divide that as a separate partitions and it will process the data and every part will will contain some that partition related data when you query on top of this folder then entire data will be written to you so that's how it works so if you see this is actually file when I hover in the tool tip you can observe it is dot Json extension for all these three files so whatever the name if you give here it is actually storing it as a folder here okay so you may be wondering how I can I don't want path files I want like actually implies.json file only so will that possible then yes of course there is some trick to do that in the pi spark directly you cannot do that so you can use some workarounds to get that single file also so that we will discuss in our upcoming videos so for nothing like it will always create a folder only or not a file so inside a folder the data will be stored as a part files so let me do let me go to dbfs file store and here let me delete this for uh sorry so let it be let me do one thing so let me go back to my notebook here and then let's try to write the data into one folder properly this time so in this notebook so maybe let's do under Json data implies under employees folder I want to write this data so let me hit shift enter see once the command executed the data might have written already there if I go to data folder dbfs file store Json data EMP see under emps we have these part files under these files you have the data also so let's try to query and see whether the data is actually there or not to do that in our last video we have discussed right spark dot read dot there is something called Json to this Json you can supply path of your file so let me Supply this path so under this folder all the Json files it will read and give you the data back so let's try to use this under display function so because this code is going to create a data frame right so that means this entire data frame I am supplying into a display function so now let me hit shift enter button and see whether our data saved properly or not see that means employees folder where we have the Json files as part files has my data correctly only so this is the data same data I written there right so now let's do one thing so let's try to write the data once again here what will happen so if I try to write data once again if I execute this cell once again it says already data exists so it gives this error so that means as I said more error is actually working here so there is something called mode parameter so if I use for this mode parameter if I Supply value as error then it is going to behave the same so if I hit shift enter again it is going to give the same error right why because data is already there it will compile so if instead of error if I use mode as ignore and if I execute shift enter Then it is going to Simply ignore that means data is not written once again already data is there with the same data it simply ignores this right operation that means if I execute this cell we will still see the same data so let me execute this here shift enter see still we see only two rows only but what if if I use append here that means inside this path I already have a two rows now if I try to write once again with append mode it will add these two rows as well so totally four rows will come here so let me execute this shift enter cell is executing here execution completed now if I try to execute this I should be seeing the four rows here see I got all the photos here right you are getting right so now instead of this what if if I use overwrite okay all right let me go back to note put Notepad and confirm server overwrite so that is what we have to use so now if I use this right now this path contains four rows right and now this data frame contains two rows that two rows I am trying to write higher rag once again so it will go overwrite that means at the end it will only contain these two rows whatever I am writing now so let me hit shift enter this execution completed now if I execute this I should be having only two rows why because all the four rows deleted and these two rows came that is called overwrite overwriting the existing content okay so that's so you can write the data into Json file using a Json function and we have discussed about all these modes as well I hope you got an idea how to write the data into Json files using Json function in the pi spark thank you for watching this video please subscribe to my channel and press the Bell icon to get the notifications whenever I add videos thank you so much
Info
Channel: WafaStudies
Views: 14,564
Rating: undefined out of 5
Keywords: PySpark for beginners, PySpark Playlist, PySpark Videos, Learn PySpark, PySpark for data engineers, dataengineers PySpark, PySpark in Azure Synapse Analytics, PySpark in Azure databricks, Understand PySpark, What is PySpark, PySpark in simple explaination, PySpark Overview, write dataframe in to json using PySpark, write data to json in databricks, pyspark writing dataframe data to json file, df.write.json() in pyspark, dataframe.write.json() in pyspark, pyspark write to json
Id: U0iwA473r1c
Channel Id: undefined
Length: 13min 53sec (833 seconds)
Published: Thu Oct 13 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.