Welcome everybody, you're watching Mr
Fugu Data Science. I decided to compress a few of my nested
json video examples into one video showing you different
approaches you can take to solving these problems.
I would like to say thank you to the viewers who have gave me some feedback
recently for future video topics: i greatly appreciate it, as well as the new
subscribers. Now just import pandas, json,
and date time. We have four examples that i'd like to go through. Okay let's zoom
in a little bit so we get a better idea. We have this first example
where we have the outer layer of our json object
as entities. Where the values are further dictionaries
and some are lists of dictionaries, okay. The second example is entities
but, it's nested urls that we have. Where we will use
recursion. The third example we have will be candidates where each candidate
has a first name, last name, their skills which are list,
the state, specialty, etc. But if you notice this:
inside of our candidate we have another separate dictionary. Which will be human
resources related. We'll figure out how to combine both of these.
The last example will be taking the same information we had above but
you don't see your key value pairs for your first, your last name ,your skills
etc. And so this is a little deceiving but we're going to look at how to solve
this where you have a tuple which is two different
dictionaries hidden inside of it. You still have your list; you have your
your list of skills. Now these won't be shown in any particular order from
first example to last. One of the techniques you could utilize
for solving these problems is called json normalize: where you can take some
kind of semi-structured data and basically flatten it out. So let's look
at our first example for entities that has our hashtags or
symbols etc from a twitter example. Let me zoom out a
little bit. Now let's look at what this evaluates to
if we just use the regular data frame and didn't do
any processing, currently. Okay, this looks
kind of like an issue right? If you notice you have your outermost
part which is your entities and then inside of that is what's looking like
you're having a nested table in a sense. Where it has
levels. We have to figure out how do we approach this.
Let's do a little bit of formatting here. Instead of setting it up like this.
What if we did this another way? Called the entities. How's that gonna print out?
Uh oh we got this: all arrays must be the same length.
Well let's handle this differently. Instead let's do something.
Let's create a dictionary and let's do a list comprehension and take the key and
your value. Which you're going to iterate and create
a new panda series. We need to iterate between our key and
our value: for example one data and i'm going to
call in the entities. So we can go inside of it. But, i need to
convert this to items and let's see what this looks like,
Okay, so we have our one row; it's looking like we're getting some progress.
But, we need to further expand on this. So how are we going to deal with
this? Let's try something else: let's flatten this out with the
json normalize. But, convert our data into a json string. Let's take it in
convert it to json and there we go. Now, we just split it up
and got within our nested data. Let's take this apart figure out what's going
on. So let's take this first part and see what happened. So you took this
converted it into json and oriented it to be in the record
format. Then we converted it to a json string
here, and then we use the normalize so we can flatten it out. Let's look at
our next example: which would be example three for our data that we're using but
it's in reality the second example of parsing our data. So we have
our candidates which have nested values. Which also have
nested key value pairs. If we print this off first
as a data frame: we'll see that we end up with two
columns of dictionaries. And you say: well this is interesting.
How am i supposed to actually deal with this? Let's do that
json normalize once again and then see what happens here. And let's take in
i don't know this is taking the human resources related. Okay, so it looks like
we could play with this. How should we manipulate these data? We're going to
do the same thing like we did for this: except here this is going to be our
df3 and this instead is going to be a candidate. Okay, but it looks a little odd right
here if you notice. So we need to obviously
do a little bit of formatting. Let's change this up
and do a reset of the index and then we're going to call the column
level equals zero, okay. But, that doesn't look like it did
very much now did it? So let's check this out all right.
There we go. We just needed to transform that. i'll just call this
the df3 candidate as their variable name and
then we need to put this all together. So how
are we going to do that? Let's concatenate
and take: what we just created as well as the json normalize that we did above. So
let's just do that real quick of the df3. For the
human resources related. But we need to do this along the columns.
df3 candidate, and there we go. Now we got to put together. But, we could also do
this a lot easier. So let's do this
we'll do it for the hr and we could do it for the candidate.
And instead of going through the rigmarole as one other way of doing this.
We can just use json normalize because, of the formatting that we have for these
data. If we realize the formatting. What does
pandas explode do? If you had a situation where your column
is a list of values such as these skills column. You can
expand it. If you notice all your rows are
duplicated and all it's doing is just expanding the column of your skills.
If that's a situation that you needed to do that you can.
Using explode: if i had a list of dictionaries it would remove
a list but i would still have a column of dictionaries
be aware of that. Some circumstances it's good to use explode get rid of the list
of dictionaries and then figure out how to deal with that, accordingly.
We can create dummy variables and let me explain what's going on with this.
If we have the column of skills which is a list: we can just expand that into
multiple columns and just use dummy variables of ones and zeros to represent
if it was included or not. So let's take that into consideration by doing
json normalize. Which is so fun. Taking our json load
string, we're going to take our last thing that we used
convert that to json and then as always we're going to orient it towards records
at least in this example. Here we go. This is looking somewhat
promising: except for the column names which are kind of a nuisance.
Let's get the column names corrected. We're going to change all of the
columns, and then we're going to take the first name, last name, skill, state,
experience, specialty, relocation. So let's look at
what our df looks like now. Okay, here's our data. But, it looks kind
of funny: so let's transform these data. That looks a little
more appropriate. Let's change all the column names for
these data. Let's look at that. Now,
expand on these data: where we're going to convert this into
dummy variables for added columns. Okay, so that's looking like it's
promising. But, now we need to combine these data.
So we're just going to use the concat again. Let's take our dummy variable; and
we're going to do both of these data frames along the
columns. Now, we expanded everything. So if you
needed a situation like this you're covered. Now let's do some
recursion: so for instance if we had to keep going further in this
and make sure we get all of the keys that would be a scenario where you would
use this function. What i'm doing with this function: is
looking to see if we have a dictionary. If we have a dictionary, we're going to
iterate with inside of that. We're going to do a recursive call to
see if we need to keep moving further inside to find more keys.
And we're going to split each of these keys by this dot
notation. So you can understand that you're going inside of it.
Then if we're dealing with any list for our values. We're going to
iterate with inside of that and run a recursive call
to see what's inside of that list. And then finally we're going to
just append everything that we found for formatting. And this is what we end up
with so here's all of our keys, here's our values. We can create our data
frame from this example. We're going to say it's from a
dictionary. And we're going to call in the function that we just found
or created. But we're going to orient this along the columns this time.
Let's see what this looks like. Ah, we got to do a
transform. Okay, that's looking pretty good. But what else do we need to do with
this to make it look appropriate? Well this is our column names: which is
in the first row. So let's deal with that. So the
columns will equal the data frames, first row.
Okay, now we just need to drop that row. So we're going to call on the index of
that: which is the first row. All right, now let's change the
column names which is becoming quite a nuisance
for us. So let's just calling those columns.
Rename it with our current names that we just made,
and then let's drop... There we go but we need to drop that
first row like we did before. There we go. Now we got everything
expanded out with the proper names without all the
dot notation if that's what you wanted. Let's look at this example
which is a little misleading it doesn't have your key value pairs
like you've seen for the first and last name and the skills above.
But this is a little tricky and not straightforward,
pay attention. You still have this dictionary that you need to deal with.
You can keep this or change this. Depending on what you want to do with it.
Let's take our df and i'm going to say:
df4 to list. Create a data frame for this. I'm converting these values to a
list so we can manipulate this; and you're
going to see it's nothing too special. Now let's do some json normalize:
let's take in what we did before. Convert it like usual; orient so we have
this as records. Let's look at that. Now we get this
situated it's looking a little promising. We can
combine these together along the columns. Change the column
names and drop the
column that had these mixed together. This looks quite good.
Now if we're going to do some simple iterating: let's figure out how we can
deal with this. Let's create a dummy blank list and
let's go through that example number two and let's append to it.
There we are. Now this looks all right but, there's a problem here. We have
this second row that's not needed and this higher date and the salary
healthcare are all part of this one entry. Except when you look at evaluating
this it's two separate dictionaries.
So you need to do some manipulation further. We need to separate this into
the first and second components and do a
concatenation of it so we can get this aligned properly.
Take this first one. Let's look at this first.
Okay, we have our first part which is the candidates
which we're going to put in. The second part that we're going to place
will be the first parameter of t and we need to do this along the axis
of one: which is your columns as usual. And now we spread everything out
properly. So the salary and the higher date are
taken into consideration for this one entry because, it's for one
person. Now if we were to iterate through this
differently: and instead of doing like this; we did this. You end up
only taking the keys and not going inside of the nested portion
here, okay. We can also parse the dates of this. I combined all of that prior into one data frame, which you see now.
And we could purse by date. We need to just call this in. I'm going to
take the higher date fix it and change this into
date time formatting for that column. And i need to format this where,
we're taking in the year, day and month. Do this notation.
Let's see how that looks. Oh, that's not going to show anything.
From there we can take our hire date and start parsing through it. Let's just
look through this. It looks the same right. Now let's
parse through this and take the year. And you're going to
use something different here. You're going to use dt.year
and let's see what we end up getting back. And there's your hire year. We can
take this and cut and paste it a couple more times
and find different things that we're looking at. We could look at the month.
But, not just the month. We could actually get a month name. we can cut and paste the same thing again.
We can get the day of the week. Depends on what your circumstances
are. But, this is how you could parse this you know. possible considering the circumstances that we have. I hope with
these few examples it'll give you some confidence of parsing
json or nested json data. As always i would like to say:
i hope this brought utility to someone. Please like, share and subscribe.
And if you subscribe turn on that notification bel., i'll see you in the
next video. Bye