PARSE LIST OF LISTS OF DICTIONARIES USING PYTHON | JSON | NESTED LISTS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Welcome everybody, you're watching mr fugu data science. Today we're parsing a list of lists of dictionaries. We will solve one problem multiple ways. I got this question from a recent viewer and decided to make a video to try to help them. We need to do a few imports: let's import json, pandas, from collections import default dictionary, and chain map. The data set we're using today comes from a previous example i did parsing New York times api. This is some raw data where each value or row; if we are putting this into a data frame is a list of dictionaries. To get a clear picture of what these data look like; we will take the first entry and notice that in this list of dictionaries we have a few different options. But, we notice that we have inside of this another list of dictionaries from one of our values. So we will need to parse inside of the media further due to nesting. Where we'll notice that there are three url's, formats, heights, and widths. Within the media metadata. For each entry; we have 20 entries, okay. If we took for instance, one specific entry and threw it into a data frame. It would look something like this. But, really what we would like to have: is all of these in one single row. So you you would essentially like to have one list of all the urls, a list of all the formats, the heights, and the widths for each entry. Because, this was from one entry not three, okay. So essentially what i would like is to create this url: where we have a list of the three urls. Assume that this is the second entry, and this is their three urls and putting it onto one data frame for two rows for this. And i just cut and paste this it's the same exact data just to illustrate what i'm trying to achieve later. If you did not want it this way and wanted to expand it you could always do the pandas explode if that was something you wanted. Just remember it gives you more rows. Let's get into our first coding for the chain map from collections. And think of it as taking multiple dictionaries and basically condensing it into one. There's a few useful examples of showing this for the official documentation. Let's just call this multiple dictionary to single. Store this as a list and let's iterate in the range of values within the length of data frame we would like to parse. And take the column of media: where i'm going to store this as a dictionary and call in the chain map. Taking all of the arguments for the data frame that i would like to parse and then you're going to see something kind of funky. So i'm using the iterator but, then i'm also going through all rows and columns to the end, all right. And then i'm going to call in my good old trusty list and append a little old friend, and let's see what this looks like. Now, this is going to get us most of the way. So this spreads everything out from the media column. Mind you we we're going to concatenate at the end and put everything together. Because, this is one column that we are parsing. And if you notice we still have the metadata that we need to take care of later. We can do this another way and if you notice there are also rows that are completely null that we'll take care of. Let's do a list comprehension. Let's just call this merge; and i just found this simple little function online on stack overflow. So let's iterate inside of this/ And then we would like to return the dictionary with everything. And let's do our list comprehension for the items inside of our list. Now, i'm going to i don't know, i'll just say... i'll just call this: function (func) with iterator to do a list. Because we're going to iterate inside of the data frame as usual for the media. And then let's make a temp variable and iterate through each piece. Let's append everything. Let's call that out put it into a data frame and there we are we get the same exact thing doing a different way, okay. What you could do is check how fast each one of these operations is versus the amount of memory using a profiler and see which is more beneficial out of these two if you're curious or more inclined. Now, let's do a super crazy version i would say. But, don't get scared it won't bite. So let's walk through it so i need to create a default dictionary i'm going to create a empty list and i'm going to iterate like we've been doing through this specific column. And what do i want to do? i'm going to check if the length of this is greater than zero. And what's going on here? Well basically each list can be empty or filled which will equal one because if you're looking at lists that we're iterating through some will have something some want so if it has something, then we will then iterate inside of it. Take our iterator to go through each individual one. Going inside of a list of lists. We're taking this because, it's now going to be a dictionary converted into tuples using items. Which would be useful for creating our default dictionary. So i could just leave a note here so we could understand that we're converting a dictionary to tuples and iterating. We would like to append the index value as well as j. Which is going to be our tuples. And you'll see why this is important in a second. Because, basically we're trying to keep the frame of reference using the index with the tuple. Because if it is not like this list we need to figure out a way to take care of that. So we're going to i'll make a note of this in a second. I'm going to create a tuple and i'm going to zip together the data frame for the first element and i'm going to take the keys. What i'm trying to do here is anything that didn't fall into this i need to create a tuple that has the same keys that are inside of our dictionary and then multiply it by the length or pardon me.. Yeah and the values are going to be multiplied by the length of the keys so it matches up. And i want to mention something very specific here: if you had dictionaries that had mismatching keys. This would not be what you would want to do right here. Since, i know that for this specific example it does work. I'm using it. If it did not you just have to do it a little bit differently. Okay, so what's going on with this? We have our first index value and we have the type the subtype the caption, copyright approved, metadata right. Then does the same thing for all of these. Now, let's skip down to number seven and this is what i wanted to match up because when i create the default dictionary here's my key as the type and then each value i want it to line up properly and if i didn't do this else statement we would have had an issue. So let's finish this up and let's create our default dictionary for the first entry in the tuple is the key, the second entry is our value. Throw this into a data frame see what that looks like. But, i made a mistake. There we go. So this looks like we're on the right track and we did the same exact thing once again. Let's scroll down. So we did the same thing. That's perfect. Let's just call this new df, clean this up a little bit so it's not ticking up on my page. So we did pretty good here. But, i want to note something once again. If you have a list of dictionaries of varying lengths you need to do one more step, all right. Don't forget that. We will consider doing something like: basically find the unique keys and then you could perform if you wanted a set operator. So then you would have all of the keys and then you could do an if else statement to say if it's there put this, if not do that. So here is basically what was going on with creating our tuple so it cooperated with the formatting type. The innermost list of lists for a media metadata is very interesting. This threw me off and took me some time. That's why this video is delayed. So basically what i'm doing is i'm creating this function where i have my default dictionary once again. And i'm iterating through my data frame column of choice checking if it's a string basically if we don't have a list of dictionaries then it's going to be a string value saying nope or whatever i declared, or you have some empty value or whatever. Then we have to iterate through our dictionary; turn it into tuples again and use our default dictionary and return that. I have to iterate inside of the column. Do the same type of comparison and then throw in our function has our default dictionary. Then i used an else statement: because, i wanted to keep the formatting. S entry number 7, 13 and 16 or something like that do not have a list. And they are strings. So we take care of that by creating this zipped tuples for our keys and our values and i chuck that together. Then i create a data frame and expand it. And so for each entry in this: remember there were three urls, three formats, heights, and widths. These last three columns are always the same. So you have this height, and this particular width for this thumbnail. You have this format with this height and this width, etc. And it was difficult to distinguish what's going on with these you can see that each one of them is different, all right. So i know the formatting is there. I have this set up as a list of strings. So if you further decide to expand this out and flatten it: if you needed to you have that option easily, all right. So just scroll back up and i want to mention this. I decided to print these out. So i can actually do a comparison if you see 10/3 10/3 10/29 10/29 and you look at them and they're all different because, they have this different separator for these images or whatever they are, right. It looks like i did the formatting for everything properly. Then we concatenate and put everything together. The original data, the new data frame with the media, and the expanded url information. Now, we end up with the same amount of rows. So that checks out. But, instead you have 32 columns because it's expanded out now. All right so let's scroll back to the top and look here's everything that worked out from our nested stuff that i wanted to retain as a list. I kept this just as a reference. And then these short ones right here came from the original media, all right. And then the rest of them are just the original data all right. That wasn't too bad of a video. Last thoughts i'd just like to say: if you wanted to speed this up, consider using list comprehensions vs loops. Understand that there are times when indexed values forgot the e. When index values actually do matter and you would like to preserve the order. So you need to adjust your code and remind yourself of that. And finally people could write elegant code all the time. But, it just comes with experience. But, do realize that you may see elegant code but it may not always be the most efficient. May not always solve the problem faster, okay. So don't get hung up on thinking your code has to look the best. Just leave doc strings and notes so people understand what's going on. But, as always thank you for watching this video. I hope i brought utility to someone. Please, like share and subscribe and if you subscribe turn on that notification bell. i'll see you in the next video. Bye
Info
Channel: Mr Fugu Data Science
Views: 7,352
Rating: undefined out of 5
Keywords: mr fugu, mr fugu data science, Mr Fugu, Mr Fugu Data Science, parse json, parse lists of list, parse lists of dictionaries, parse list of json, parse lists of list json, python lists of lists
Id: KTJ3AQfRpN4
Channel Id: undefined
Length: 13min 52sec (832 seconds)
Published: Mon Sep 07 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.