HOW TO PARSE DIFFERENT TYPES OF NESTED JSON USING PYTHON | DATA FRAME | TRICKS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Welcome everybody, you're watching Mr Fugu Data Science. I decided to compress a few of my nested json video examples into one video showing you different approaches you can take to solving these problems. I would like to say thank you to the viewers who have gave me some feedback recently for future video topics: i greatly appreciate it, as well as the new subscribers. Now just import pandas, json, and date time. We have four examples that i'd like to go through. Okay let's zoom in a little bit so we get a better idea. We have this first example where we have the outer layer of our json object as entities. Where the values are further dictionaries and some are lists of dictionaries, okay. The second example is entities but, it's nested urls that we have. Where we will use recursion. The third example we have will be candidates where each candidate has a first name, last name, their skills which are list, the state, specialty, etc. But if you notice this: inside of our candidate we have another separate dictionary. Which will be human resources related. We'll figure out how to combine both of these. The last example will be taking the same information we had above but you don't see your key value pairs for your first, your last name ,your skills etc. And so this is a little deceiving but we're going to look at how to solve this where you have a tuple which is two different dictionaries hidden inside of it. You still have your list; you have your your list of skills. Now these won't be shown in any particular order from first example to last. One of the techniques you could utilize for solving these problems is called json normalize: where you can take some kind of semi-structured data and basically flatten it out. So let's look at our first example for entities that has our hashtags or symbols etc from a twitter example. Let me zoom out a little bit. Now let's look at what this evaluates to if we just use the regular data frame and didn't do any processing, currently. Okay, this looks kind of like an issue right? If you notice you have your outermost part which is your entities and then inside of that is what's looking like you're having a nested table in a sense. Where it has levels. We have to figure out how do we approach this. Let's do a little bit of formatting here. Instead of setting it up like this. What if we did this another way? Called the entities. How's that gonna print out? Uh oh we got this: all arrays must be the same length. Well let's handle this differently. Instead let's do something. Let's create a dictionary and let's do a list comprehension and take the key and your value. Which you're going to iterate and create a new panda series. We need to iterate between our key and our value: for example one data and i'm going to call in the entities. So we can go inside of it. But, i need to convert this to items and let's see what this looks like, Okay, so we have our one row; it's looking like we're getting some progress. But, we need to further expand on this. So how are we going to deal with this? Let's try something else: let's flatten this out with the json normalize. But, convert our data into a json string. Let's take it in convert it to json and there we go. Now, we just split it up and got within our nested data. Let's take this apart figure out what's going on. So let's take this first part and see what happened. So you took this converted it into json and oriented it to be in the record format. Then we converted it to a json string here, and then we use the normalize so we can flatten it out. Let's look at our next example: which would be example three for our data that we're using but it's in reality the second example of parsing our data. So we have our candidates which have nested values. Which also have nested key value pairs. If we print this off first as a data frame: we'll see that we end up with two columns of dictionaries. And you say: well this is interesting. How am i supposed to actually deal with this? Let's do that json normalize once again and then see what happens here. And let's take in i don't know this is taking the human resources related. Okay, so it looks like we could play with this. How should we manipulate these data? We're going to do the same thing like we did for this: except here this is going to be our df3 and this instead is going to be a candidate. Okay, but it looks a little odd right here if you notice. So we need to obviously do a little bit of formatting. Let's change this up and do a reset of the index and then we're going to call the column level equals zero, okay. But, that doesn't look like it did very much now did it? So let's check this out all right. There we go. We just needed to transform that. i'll just call this the df3 candidate as their variable name and then we need to put this all together. So how are we going to do that? Let's concatenate and take: what we just created as well as the json normalize that we did above. So let's just do that real quick of the df3. For the human resources related. But we need to do this along the columns. df3 candidate, and there we go. Now we got to put together. But, we could also do this a lot easier. So let's do this we'll do it for the hr and we could do it for the candidate. And instead of going through the rigmarole as one other way of doing this. We can just use json normalize because, of the formatting that we have for these data. If we realize the formatting. What does pandas explode do? If you had a situation where your column is a list of values such as these skills column. You can expand it. If you notice all your rows are duplicated and all it's doing is just expanding the column of your skills. If that's a situation that you needed to do that you can. Using explode: if i had a list of dictionaries it would remove a list but i would still have a column of dictionaries be aware of that. Some circumstances it's good to use explode get rid of the list of dictionaries and then figure out how to deal with that, accordingly. We can create dummy variables and let me explain what's going on with this. If we have the column of skills which is a list: we can just expand that into multiple columns and just use dummy variables of ones and zeros to represent if it was included or not. So let's take that into consideration by doing json normalize. Which is so fun. Taking our json load string, we're going to take our last thing that we used convert that to json and then as always we're going to orient it towards records at least in this example. Here we go. This is looking somewhat promising: except for the column names which are kind of a nuisance. Let's get the column names corrected. We're going to change all of the columns, and then we're going to take the first name, last name, skill, state, experience, specialty, relocation. So let's look at what our df looks like now. Okay, here's our data. But, it looks kind of funny: so let's transform these data. That looks a little more appropriate. Let's change all the column names for these data. Let's look at that. Now, expand on these data: where we're going to convert this into dummy variables for added columns. Okay, so that's looking like it's promising. But, now we need to combine these data. So we're just going to use the concat again. Let's take our dummy variable; and we're going to do both of these data frames along the columns. Now, we expanded everything. So if you needed a situation like this you're covered. Now let's do some recursion: so for instance if we had to keep going further in this and make sure we get all of the keys that would be a scenario where you would use this function. What i'm doing with this function: is looking to see if we have a dictionary. If we have a dictionary, we're going to iterate with inside of that. We're going to do a recursive call to see if we need to keep moving further inside to find more keys. And we're going to split each of these keys by this dot notation. So you can understand that you're going inside of it. Then if we're dealing with any list for our values. We're going to iterate with inside of that and run a recursive call to see what's inside of that list. And then finally we're going to just append everything that we found for formatting. And this is what we end up with so here's all of our keys, here's our values. We can create our data frame from this example. We're going to say it's from a dictionary. And we're going to call in the function that we just found or created. But we're going to orient this along the columns this time. Let's see what this looks like. Ah, we got to do a transform. Okay, that's looking pretty good. But what else do we need to do with this to make it look appropriate? Well this is our column names: which is in the first row. So let's deal with that. So the columns will equal the data frames, first row. Okay, now we just need to drop that row. So we're going to call on the index of that: which is the first row. All right, now let's change the column names which is becoming quite a nuisance for us. So let's just calling those columns. Rename it with our current names that we just made, and then let's drop... There we go but we need to drop that first row like we did before. There we go. Now we got everything expanded out with the proper names without all the dot notation if that's what you wanted. Let's look at this example which is a little misleading it doesn't have your key value pairs like you've seen for the first and last name and the skills above. But this is a little tricky and not straightforward, pay attention. You still have this dictionary that you need to deal with. You can keep this or change this. Depending on what you want to do with it. Let's take our df and i'm going to say: df4 to list. Create a data frame for this. I'm converting these values to a list so we can manipulate this; and you're going to see it's nothing too special. Now let's do some json normalize: let's take in what we did before. Convert it like usual; orient so we have this as records. Let's look at that. Now we get this situated it's looking a little promising. We can combine these together along the columns. Change the column names and drop the column that had these mixed together. This looks quite good. Now if we're going to do some simple iterating: let's figure out how we can deal with this. Let's create a dummy blank list and let's go through that example number two and let's append to it. There we are. Now this looks all right but, there's a problem here. We have this second row that's not needed and this higher date and the salary healthcare are all part of this one entry. Except when you look at evaluating this it's two separate dictionaries. So you need to do some manipulation further. We need to separate this into the first and second components and do a concatenation of it so we can get this aligned properly. Take this first one. Let's look at this first. Okay, we have our first part which is the candidates which we're going to put in. The second part that we're going to place will be the first parameter of t and we need to do this along the axis of one: which is your columns as usual. And now we spread everything out properly. So the salary and the higher date are taken into consideration for this one entry because, it's for one person. Now if we were to iterate through this differently: and instead of doing like this; we did this. You end up only taking the keys and not going inside of the nested portion here, okay. We can also parse the dates of this. I combined all of that prior into one data frame, which you see now. And we could purse by date. We need to just call this in. I'm going to take the higher date fix it and change this into date time formatting for that column. And i need to format this where, we're taking in the year, day and month. Do this notation. Let's see how that looks. Oh, that's not going to show anything. From there we can take our hire date and start parsing through it. Let's just look through this. It looks the same right. Now let's parse through this and take the year. And you're going to use something different here. You're going to use dt.year and let's see what we end up getting back. And there's your hire year. We can take this and cut and paste it a couple more times and find different things that we're looking at. We could look at the month. But, not just the month. We could actually get a month name. we can cut and paste the same thing again. We can get the day of the week. Depends on what your circumstances are. But, this is how you could parse this you know. possible considering the circumstances that we have. I hope with these few examples it'll give you some confidence of parsing json or nested json data. As always i would like to say: i hope this brought utility to someone. Please like, share and subscribe. And if you subscribe turn on that notification bel., i'll see you in the next video. Bye
Info
Channel: Mr Fugu Data Science
Views: 20,730
Rating: undefined out of 5
Keywords: mr fugu, mr fugu data science, Mr Fugu, Mr Fugu Data Science, parse json, nested json, parse nested json, nested json and python, how to parse nested json, how to parse json, json to data frame
Id: LYh8ih2X5Oo
Channel Id: undefined
Length: 15min 6sec (906 seconds)
Published: Tue Aug 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.