Welcome everybody,
you're watching mr fugu data science. Today we're parsing a list
of lists of dictionaries. We will solve one problem
multiple ways. I got this question from a recent viewer and
decided to make a video to try to help them. We need to do a few imports:
let's import json, pandas, from collections
import default dictionary, and chain map. The data set we're using today
comes from a previous example i did parsing
New York times api. This is some raw data where
each value or row; if we are putting this into a
data frame is a list of dictionaries. To get a clear picture of what these
data look like; we will take the first entry and notice that in this
list of dictionaries we have a few different
options. But, we notice that we have inside of this
another list of dictionaries from one of our values.
So we will need to parse inside of the media
further due to nesting. Where we'll notice that
there are three url's, formats, heights, and widths. Within the media
metadata. For each entry; we have 20 entries, okay.
If we took for instance, one specific entry and threw it into a data frame. It
would look something like this. But, really what we would like to have: is
all of these in one single row. So you you would
essentially like to have one list of all the urls, a list of
all the formats, the heights, and the widths for each entry.
Because, this was from one entry not three, okay.
So essentially what i would like is to create
this url: where we have a list of the three urls. Assume that this is
the second entry, and this is their three urls and
putting it onto one data frame for two rows for this. And
i just cut and paste this it's the same exact data just to
illustrate what i'm trying to achieve later. If you
did not want it this way and wanted to expand it you could always do the pandas
explode if that was something you wanted. Just
remember it gives you more rows. Let's get into our first
coding for the chain map from collections. And
think of it as taking multiple dictionaries and basically
condensing it into one. There's a few useful examples of showing
this for the official documentation. Let's just call
this multiple dictionary to single.
Store this as a list and let's iterate in the range of
values within the length of data frame we would
like to parse. And take the column of media: where i'm
going to store this as a dictionary and call in
the chain map. Taking all of the arguments for the data
frame that i would like to parse and then
you're going to see something kind of funky. So
i'm using the iterator but, then i'm also going
through all rows and columns to the end, all right. And then i'm going to call in my good old trusty list and append
a little old friend, and let's see what this looks like. Now, this is going to get us most of the
way. So this spreads everything out from the media column.
Mind you we we're going to concatenate at the end
and put everything together. Because, this is one column that we are
parsing. And if you notice we still have the metadata that we need to take care
of later. We can do this another way
and if you notice there are also rows that are completely null that we'll take
care of. Let's do a list comprehension. Let's just
call this merge; and i just found this simple
little function online on stack overflow. So let's iterate
inside of this/ And then we would like to return the
dictionary with everything. And let's do our list comprehension
for the items inside of our list. Now, i'm going to i don't know, i'll just
say... i'll just call this: function (func) with iterator to do a list.
Because we're going to iterate inside of the data frame as usual
for the media. And then let's make a temp variable and iterate through each piece. Let's append everything. Let's call that
out put it into a data frame and there we
are we get the same exact thing doing a different way, okay. What you could do
is check how fast each one of these operations is
versus the amount of memory using a profiler and see which is more
beneficial out of these two if you're curious or more inclined.
Now, let's do a super crazy version i would say. But, don't get scared
it won't bite. So let's walk through it so i need to create a default dictionary
i'm going to create a empty list and i'm going to iterate
like we've been doing through this specific column. And what do i want to do?
i'm going to check if the length of this is greater than zero. And what's
going on here? Well basically each list can be
empty or filled which will equal one because if you're
looking at lists that we're iterating through
some will have something some want so if it has something,
then we will then iterate inside of it. Take our iterator to go through each individual one.
Going inside of a list of lists. We're taking this because, it's now going to be
a dictionary converted into tuples using items. Which would be useful for creating our default dictionary.
So i could just leave a note here so we could understand that
we're converting a dictionary to tuples and iterating. We would like to
append the index value as well as j. Which is going to be our
tuples. And you'll see why this is important in
a second. Because, basically we're trying to
keep the frame of reference using the index
with the tuple. Because if it is not like this list
we need to figure out a way to take care of that.
So we're going to i'll make a note of this in a second.
I'm going to create a tuple and i'm going to zip together
the data frame for the first element and i'm going to take the keys.
What i'm trying to do here is anything that didn't fall into this i
need to create a tuple that has the same keys
that are inside of our dictionary and then multiply it by the
length or pardon me.. Yeah and the values
are going to be multiplied by the length of the keys so it matches
up. And i want to mention something very
specific here: if you had dictionaries that had
mismatching keys. This would not be what you would want to
do right here. Since, i know that for this specific
example it does work. I'm using it. If it did not
you just have to do it a little bit differently.
Okay, so what's going on with this? We have our first index value
and we have the type the subtype the caption,
copyright approved, metadata right. Then does the same
thing for all of these. Now, let's skip down to
number seven and this is what i wanted to match up because when i create the
default dictionary here's my key as the type and then each
value i want it to line up properly and if i
didn't do this else statement we would have had an
issue. So let's finish this up and let's create our default dictionary
for the first entry in the tuple is the key, the second entry
is our value. Throw this into a data frame
see what that looks like. But, i made a mistake.
There we go. So this looks like we're on the right track and we did the same
exact thing once again. Let's scroll down. So we did the same
thing. That's perfect. Let's just call this
new df, clean this up a little bit so it's not
ticking up on my page. So we did pretty good here. But, i want to
note something once again. If you have a list of dictionaries of varying lengths
you need to do one more step, all right. Don't forget that.
We will consider doing something like: basically find the unique
keys and then you could perform if you wanted a set operator.
So then you would have all of the keys and then you could do an if else
statement to say if it's there put this, if not do that.
So here is basically what was going on with creating
our tuple so it cooperated with the formatting type.
The innermost list of lists for a media metadata
is very interesting. This threw me off and took me some time.
That's why this video is delayed. So basically what i'm doing
is i'm creating this function where i have my default dictionary once again.
And i'm iterating through my data frame column of choice
checking if it's a string basically if we don't have a list of dictionaries
then it's going to be a string value saying nope or whatever i
declared, or you have some empty value or whatever.
Then we have to iterate through our dictionary; turn it into
tuples again and use our default dictionary
and return that. I have to iterate inside of
the column. Do the same type of comparison
and then throw in our function has our default dictionary. Then i used an else
statement: because, i wanted to keep the formatting.
S entry number 7, 13 and 16 or
something like that do not have a list. And they are strings.
So we take care of that by creating this zipped tuples for our keys and our
values and i chuck that together. Then i create
a data frame and expand it. And so for each entry
in this: remember there were three urls,
three formats, heights, and widths. These last three columns
are always the same. So you have this height,
and this particular width for this thumbnail. You have
this format with this height and this width,
etc. And it was difficult to distinguish what's going on with these
you can see that each one of them is different, all right.
So i know the formatting is there. I have this set up
as a list of strings. So if you further decide to expand this
out and flatten it: if you needed to you have that
option easily, all right. So just scroll back up and i want to mention this.
I decided to print these out. So i can actually do a comparison if you see
10/3 10/3 10/29 10/29 and you look at them and they're
all different because, they have this different
separator for these images or whatever they are, right.
It looks like i did the formatting for everything properly.
Then we concatenate and put everything together. The original data,
the new data frame with the media, and the expanded url information.
Now, we end up with the same amount of rows.
So that checks out. But, instead you have 32 columns because it's expanded out now.
All right so let's scroll back to the top and look
here's everything that worked out from our nested stuff that i wanted to retain
as a list. I kept this just as a reference. And then
these short ones right here came from the original media,
all right. And then the rest of them are just the original data
all right. That wasn't too bad of a video. Last thoughts i'd just like to say:
if you wanted to speed this up, consider using
list comprehensions vs loops. Understand that
there are times when indexed values forgot the e.
When index values actually do matter and you would like to preserve the order.
So you need to adjust your code and remind yourself of that. And finally
people could write elegant code all the time. But, it just comes with
experience. But, do realize that you may see elegant code
but it may not always be the most efficient. May not always solve the
problem faster, okay. So don't get hung up on thinking
your code has to look the best. Just leave doc strings and
notes so people understand what's going on. But, as always thank you
for watching this video. I hope i brought utility to someone.
Please, like share and subscribe and if you subscribe
turn on that notification bell. i'll see you in the next video.
Bye