9. read json file in pyspark | read nested json file in pyspark | read multiline json file

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello welcome to SS unit suil the side and this is continuation of ppar tutorial so in this video we are going to see how we can read the data from the single line Jon multi-line Jon and complex type Jon files so before going to browser let's try to see how many formats are available inside the Jon file so total we are having four types of format first is the single line Jon file second is the multi-line Jon file next is the single line complex J file so which is the nested values are there and next is the multi-line complex nested J file so let me quickly go inside the files and we'll try to see all these formats so as we have seen total four different formats are available so first is the single line Json file so inside the single line Json file all the row values will be in a single row that we can see like the ID name AE and depart Department name so all four are in a single line and the second row will be in the next row next and next so going forward it will be having all the rows in 1 by one and the data in a single row so that is the single line Jon file next is the multi-line Jon file so in case of the multi-line Jon file the data will be same but here all the data will be in different rows so as we can see this is one row and the data which is in different different rows so ID will come in one row then name in second a and department so all four will be in different different rows so this is multi-line Jon file next is the nested single line Json file so in case of nested single line Json file the data will be in a single row but it will be having the nested values so for example we can see name a and address so total we are having three values in the outer J but under the J again we are having two values one for the city and second for the state so this is a single line Jen with the complex data and last one which is the nested multiline file so in this we are having the values into each row and the address will be again split out into different different rows so under the address we have city and state so it is representing into two different rows so that we can see we are having total four different formats so let's try to read the data from all these four formats let me quickly go inside the browser I have already placed all these four formats file here so go to on the AO data bricks and under this notebook we can see the cluster is up and running so first we are required to read the data data from single line Jon file so how we can read it so for reading the data from single line Jon file is very straightforward we can simply use The Spar do read method then we have to specify the option so under the option we have to set like single line and this is true it means we are fetching the data from the signal line Jon file next we are fetching the data from the Jon so we can specify Jon and then we simply specify the path so as we have already created the mount point which is the MNT input so it is pointing to input location and under that input we can see we are having this employee do single line Json file we can copy this name and specify that name over here and simply let me add this data into a data data frame and this data frame is single line and let me try to see the data from this data frame so DF do single line let me try to execute it so here we can see we are able to see the data directly so there is not a big challenge simply we can fetch out all the data and we can read it next for reading the data from the multi-line Jon file we have to make a small change on this code so what change instead of the single line we should be going to use the multi-line so once we will specify the multi-line then we are able to see the data from the multi-line Jon file here we have to specify the multi-line and let me use the multi-line so what it is saying this location is not avilable so let me quickly go and see the file name so that is the employees ml so we have to use the employees ml let me try to execute it so now we should be able to see the data from the multi-line Json file that we can see so all the data we can simply fetch out by making the change on the option as multi-line instead of the single line and if we are not specify this value as multi-line so let me try to remove this option here and let me try to execute it so what will be happening it will be causing the problem and will be saying like this is not allow so we have to have specify the option as multi-line by default it is taking as single line now go to the next one so here we need to more concerned about the complex single line Json file because here we have to specify the schema before fishing the data because it is a complex Jon file and nested values are there as we can see so here the address is having two more column one is City and second is state so this is little bit complex so what we have to do we have to declare the schema so the first schema we have to declare as address so address will be having only two columns one is City and second is State then we have to add one more schema and that schema for whole jesson and here for the address we have to specify the data type as schema for the address that we have declared so let me try to do that so it will be very easy to understand so first remember for declaring the schema we have to use the from ppar dot SQL dot types then we have to use the import and here we have to use the stru type second we want to fetch out the Stu field next data type we are required for the integer next we are required for the string type so I guess these four data types will be enough to declare the schema so the first schema we have to declare as address so I'm going to specify address schema then for specifying the SCH schema first we are required to use the instu type and under the instu type under the brackets we have to add the columns so for adding The Columns we are required to use the Stu field and here it will be asking three parameters the first parameter will be the name so the name of the column is actually name the data type of this is Str string so we can add the the string type like this and the last parameter is asking whether this value will be nullable or not so I'm going to specify this as false the next one will be State actually this is for City so let me rename this as City and this is for the state so simply we can add the state here so we have done with the address schema now we have to declare the schema for the customer so for declaring the schema for the customer let me add the customer schema like this here we have to use the instruct type and inside the instruct type let me add the instru field so those stru field will be having all those required columns the First Column is name if you can see second is the A and then address so name and and the data type of this will be string type and last column we can mark this as false now let me add another column so the second column is a so we can add the A and instead of string that will be integer type column now the last one is address so we have to add the address column here but we need to make sure like this address the data type of this will be your schema so the schema that we have created the address schema so this is the only thing that you have to remember so if you are having any complex s so first you can declare the schema of the inner G and for the outer G we can specify the data type of that particular complex column as that schema so that's it now let me try to read out the data so let me put this data inside the nested single line for reading it we are required to use The Spar dot read method dot here we are required to specify the option so inside the option we are fetching the data from the single line G so we can mark this as true next we are also need to specify the SCH schema so the SCH schema is cust SCH schema so this is you have to remember we have to add the schema as well as we have to specify the option now here we simply specify the Jon and we can read the data from that Json file so this is from the input location and the file name we can get it from here so that is the customer nested single line J let me copy it and specify that name here and let me try to use the display command to see the data from this data frame let me try to execute it okay so here we are getting the null for the A and address because we have not specified the address and a columns as it is as we have inside the file so the a we should be having the a in small letter and for the address D is missing so that's why we can see the null there let me try to reexecute it so now it should be reflecting all the data so here as we could see the name is suil age is 30 and address we can see the city and state under that particular address so we are successfully able to see the data for the complex G let me quickly see for the last option which is the multi-line complex Jon file so so here again we are going to use the same query but we are required to add the option here so in this option we can specify the multi-line value as true and let me change this single line to multi line and this data frame as well that's it let me try to execute it so everything will remains same and we should be able to see the data so as you could see we are fetching this from the the multi-line complex Jon file so let me recap what we have understood over here so the first thing we are fetching the data from the single line Json file so there is no need to do anything either you can specify the schema or you can skip for specifying the schema if you will specify the schema then the data type will be according to your schema otherwise it will be as string next for the multi-line Jon file simply instead of single line we can mark that as multi-line that we can see here but while we are fetching the data from the complex G first we have to declare the schema of the inner G value so that is for the address we have specified then we have to specify the outer J schema and here for the address column we have to specify whatever the schema that we have declared and rest will remain same for reading the dat we have to specify the schema and we'll be using that schema and here if we can go and we'll try to see the column data type then we should be seeing the data type which we have declared for the name is a string for the age is integer and city and state is Str string so whatever we have specified we can see it will be working accordingly and the for last option again it will remain same but here under the option we have to specify and marking this J as multi-line so this is what we have to do if you are having the multiple columns those are having the complex J value on that scenario that here as we have added only for the address schema you can add the schema for that column as well and for creating the final SCH schema J we have to add that column schema as we are specifying there so this is the way by which we can simply read the Jon files so I hope guys you have understand how we can read the data from all these Jon formats thank you so much for watching this video If you like this video please subscribe our channel to get many more videos see you in the next video
Info
Channel: SS UNITECH
Views: 2,200
Rating: undefined out of 5
Keywords: PySpark for beginners, PySpark Playlist, PySpark Videos, Learn PySpark, PySpark for data engineers, dataengineers PySpark, PySpark in Azure Synapse Analytics, PySpark in Azure databricks, Understand PySpark, PySpark in simple explaination, PySpark Overview, create dataframe from json file using pyspark, pyspark read json file in to dataframe, pyspark reading multiple json files in dataframe, read single line json file into dataframe using pyspark, multiline json pyspark, pyspark
Id: dOkPf_zVqaw
Channel Id: undefined
Length: 14min 46sec (886 seconds)
Published: Sun Apr 30 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.