Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're going to explore some real world economic data using python and pandas we're also going to pull down that data directly from fred which is a free resource using their python api i'm really excited about today's video i think that you'll learn a lot about python pandas plotting how to use web apis for pulling down data and exploring it within python this is an incredibly rich data source and there's so much that you can do with it hopefully this video will inspire you to start your own project using the same data source now everything i'm going to be doing is in a kaggle notebook which means that you can just copy the notebook and not have to set up anything yourself all you need is a kaggle account my name is rob i make videos about data science coding in python and machine learning if you like this video please consider subscribing so you'll get an update next time i release a video okay let's start looking at some economic data okay so here we are in a kaggle notebook like i said this is something that you can copy and walk through the code yourself very easily now kaggle does have a lot of data sets already on the website that you can take advantage of but i want to show us how we can use the fred api to pull down a lot of this data because it's a really powerful tool that you can use in the future and have up-to-date data sets that are already available on the fred website and we're going to show you how to do that but first we need to make some imports to use and of course we're going to import pandas as pd we're going to import numpy as np and for visualizations we're going to import map matplotlib pi plot as plt and import plotly express as px we're also gonna import oh let's do some styles so plot style use and this is going to give us a default plotting style for all of our plots we're going to use 5 38 it will look nice for this data set um and then i want to actually import the fred api uh python package but it's not available by default in the kaggle notebooks but that's not a problem we can just do a cell with an exclamation point and pip install it from here so we're going to pip install the fred api and if i normally ran this it'll output all the results from the install but i kind of want to suppress that so i'm going to output this to dev null which will just um make it so we don't have to see all the outputs of the pip install but while that's installing it's actually loading up our notebook we can see it's run i'm gonna also import from fred so from this fred api i'm gonna import fred and what else should i do i'll do this pet set option max columns is 500 and this will allow us in pandas to see if we have a really wide pandas data frame we'll be able to see up to 500 columns instead of them hiding those midi middle ones by default and then i'm also going to add a color palette so color palettes gonna be uh something that i just pull from uh matte plot lib the nice thing about this color palette is it'll just be a list of colors that we can use to change the colors in our plots so they all don't look the same color and this line of code i use a lot because it just pulls the default colors from whatever theme we're using in matplotlib let's go ahead and run that now let me just show you before we get too far that this is the website fred you can see that they have a bunch of different data sources it's all time series data financial and economic data that is readily available and they have their own api so if i search for fred api we can see here that their api has a lot of details and if you want to request the api you need to get an api key on their website you can request it here it's pretty easy to do and it's basically just a code they're giving you an example here of abcde but it's a string of numbers and letters that you will be unique to you and then you can use it to pull the data directly into pandas and instead of having to scrape the website or download csvs and stuff like that which we want to avoid so this step you'll have to do on your own but it's pretty painless to do and once you do it you'll have basically a string that you can paste in here and you will say fred key will be this and i'll say put your key here but because i have my own key i'm gonna actually pull that in from kaggle's secrets i've actually saved um here in this add-on i've saved a secret my fred api key as a secret and i can use this line of code to load that fred api key quickly and in a way that no one else can see it because i don't want anyone else to use my api key and hit fred's website over and over again okay so what are we gonna do first we're gonna create create the fred object and this will allow us to pull in all the data so we create the thread object by using this thread that we imported and you can see if i do shift tab here inside of this fret object it'll tell me the doc documents for it and we basically just need to provide it this api key so i'm going to explicitly say our api key is this fred key again this will be the key that you get when you apply for it and we're going to call this our fred object and this is just the main object that we'll be using to doing all our searching and pulling down of data from the fred website and all right so that now has it create created we're going to search fred uh search for economic data so two of the main things that fred allows us to do is one search and number two is pull down the actual data the search is the same as if you were on the website and you were to search in that search bar it'll give you the results but in a pandas data frame and the pulling of data is similar to if you had clicked on download csv on the pr fred website but we're doing this all through python so it's really nice we don't have to mess with any of the uh web interfaces so let's go ahead and search and you can see here there's actually a search by release search by category but we're just going to do a general search and then we're going to search for s p because i know that s p 500 is on fred and this will give us an example of what the results look like all right so the result is a pandas data frame that shows us the series ids that are available on the freb website that we can pull from the id name some of the information about when the dates data starts and ends the title of it and then also the frequency that the data is gathered there's also it some information about if the data is seasonally adjusted or not so some of the data sets they'll adjust it for the season seasonality but this is just uh not seasonally adjusted the raw smp numbers and then it the notes tell us a more descriptive detail of what we're searching for now there's also this popularity ranking on fred so if i make this result smp data frame smp search this is our search results data frame and i want to actually sort this as it comes in oh another thing here is this s p search shape you'll see is a thousand rows of results the reason why it's exactly a thousand is because the fred search has a limit on it and this is a limit just to make sure we don't pull way too much data that we don't expect to from fred and the limit defaults to one thousand so you could change that limit if you wanted to you can also uh say change the sort order so let's do the sort order by the popularity this will give us popularity so let's do order by the popularity so now we'll look and we'll see that our search results are going to be sorted sorted by the popularity of the results for everything with smp and uh yeah so now we have the the this data frame we have all the siri different series ids that we might want to pull from all right so i'll leave the head of this there and next thing we'll do is actually pull down some data data pull raw data so let's try to actually pull in the raw data for one of these series let's just pull this id smp which is actually the smp index value coming in daily and we can do that using the spread object that we've created by using git series now there's a few different ways that we can pull in these series but we're going to only focus on the main git series for now if there were different versions of the series then you might want to look back at a historic version but we're just going to pull in this we know it's called s p 500 and then we could give it an observation start time and end time but we're just going to provide it the series id which is s p 500 and pull that in and what we see here is now we have a panda series with the index as the date time column and the values as the price of that s p 500 so let's call this s p 500 that's our series and let's do some quick visualization just to make sure the data is what we expect it to be so we'll take this series and we'll run a plot on it pandas is nice because it'll just let us plot right out of the box and there are a few things i want to do to this plot to make it a little easier to read so let's change the fig size to be a little wider i like to do this in financial or time series data because usually the data is going from left to right and your eye kind of wants to see across the dates so making it a little wider helps let's also add a title called s p 500 and then let's um yeah let's just see what this looks like okay so it looks a little wider and bigger i think this is nicer looking and then another thing i want to do is change this line width of the line because it's a little too large here to really see the detail and we can change that using lw parameter in our plot okay so maybe it's a little too skinny there maybe a little bigger and plot so there we go did did we pulled in the raw data and we plotted it let's actually make this in plot great so number four what we're going to do is we're going to pull in a few different data time series data sets from fred and we're going to merge them and see how those things interact and there's a bunch of different interesting things we can pull from fred so let's take let's check it out and see what we have okay so the next thing we're going to do is try to pull in some data about multiple data series and then compare them side by side so we want to pull in some data that we have maybe monthly information about and maybe something that we have data about for each state so since fred has a lot of information about unemployment let's go ahead and try to pull in the unemployment rates and we're going to use fred search to do that we're going to search for unemployment and we are going to order let's just go ahead and do this results so let's call it unemployment results and we can see here that there are a handful of different unemployment data sets that we can pull in some are monthly some are in percentage and you can see some are seasonally adjusted and others are not let's go ahead and pull this seasonally adjusted monthly unemployment rate which is called on rates here and we can do this by doing our fred get series for on rate and call this unwrite now we can see we have the unemployment rate seasonally adjusted again just to quickly plot and see what it looks like there we go a little big jump here when covet 19 hit and then let's also look if we can see in this unemployment rates if we can get anything specific to a state so it looks like there are some states data in here if you look here's ohio and georgia so maybe let's search for fred unemployment state there we go so we have unemployment rate in georgia seattle a bunch of them and um we can also let's take the results and see how many there are there's a thousand rows so let's go ahead and filter this down a little bit so filter and we can see that this filter expects a tuple something like frequency and seasonal adjustment so we're going to filter down to the frequency as monthly this will only return results that have a frequency of monthly and let's search for unemployment rate all right so this is our unemployment data frame let's put this up here and we're going to want to filter this down and get the series for just a subset of this since this has um this has more than just unemployment rate we're gonna actually filter the data set on the title column so let's go ahead and try to find just monthly not seasonally adjusted values and percentage and we're going to run a query on this unemployment data frame in order to do that so let's look in the seasonally adjusted column we're going to query in the seasonally adjusted column and only pull when it's has seasonally adjusted is true or as as the value now how many values are we down to 309 rows and uh it looks like there are some these last ones that we do not want to include we want to make sure that the units is in percent so let's do this and units equals percent and the shape on this 156 and what we're going to do here is we're going to take this unemployment data frame that's filtered and save it off and then take this unemployment data frame and we're going to search for only when it has unemployment rate in the type title and the way we can do that is by looking at the title and applying a string filter on this so when let's do string contains unemployment rate so this will filter out any of these results that do not have the actual unemployment rate in the title and then we'll locate where that is true and now we have a data set that is much smaller but we don't know what size so let's run a shape on this 54. so we've got alma just about the 50 states probably washington dc is in there and in that main overall unemployment rate is there okay so now that we have all the search results that we want to pull the raw data of let's put this all together let's delete these since we don't need them anymore and call this our unemployment date metadata data frame so the next thing we want to do is actually go through each of these ids and pull the series data down and the way that we'll do that is we're going to actually have to loop through each of them and then run that get series fred command on it but it's not that hard let's try it out by doing the unemployment data frame index now the index are all of the different the index has all of the different ids that we want to pull down let's look at the length of this 54 yeah so for my id in these we're gonna wanna pull this series and we're gonna do this using the get series for that each id and then we're going to save this as our results now i'm going to put a break in here after the first loop just so we can see what the output of this is so the output is this series with the date index and the values as um as the value of the which is the unemployment rate and then we're we're going to want to join these all up together eventually so let's go ahead and make this to a data frame and we can do that by running the two frame command on this panda series and by default the column name is going to be a zero but we don't want it to be zero so we're gonna actually give the name equals my id so now we have a pandas data freight frame with the unemployment rate as the title of the column and the index as the date time so we're going to go ahead and do this make this results again and then we're going to make a list that will store all these results in called all results and every time we loop through we'll append this list and let's let that run okay so that's done running it's pulled for all of those 54 different unemployment rate features that we wanted to pull and now they are in if we look at this type of this all results they are in a list and each of the values in the list has one data frame with each of these different looks like california florida texas new york all of the unemployment rates it each is its own series or pandas data frame okay so now we want to take all these results that we have that are stored in a list and actually concatenate them together and we can do that pretty easily using pd concat on this list and we're going to make the axis one to make sure that they're all stacked side by side now some of these the don't have the values for the same dates and they'll just be filled in with null values for those dates so i'm going to run this and you can see that actually the unemployment rate goes way back to 1948 but the state level ones didn't start until later so those values are all null up until recently when we have values for each state but that's just a good thing to know there's also these two these two extra ones that i'm going to drop because i don't want this i'm not sure what these are so we're going to drop these last two i do want to drop these last two but uh let's just double check and see what they represent so if we do a tale of two on our unemployment data frame metadata we can see that um this is a series i think that is uh has been discontinued and the other one is the business cycle developments i don't think we want either of these so we're gonna drop both of those two just to only have state and overall data we'll drop that with an axis of one and this is our gonna be our unemployment results let's go ahead and save that move that over here so it's all in the same cell and now if we go that on results we can see that we have a data frame with all of the unemployment data that we might want to see and let's go ahead and use a different plotting tool to plot this so i'm going to actually drop the first one because it's not a state wide unemployment rate and we're gonna also drop n a axis equals one and we're let's just call this unemployment states and we're going to take your unemployment states and we're going to check to see where the null values are so we're going to say is an a and then a sum with the axis equals one this will tell us how many unit missing values we have for each date and we'll plot this just to make sure that we can drop this so yeah it looks like this these are the dates prior to when they started they prior to when they started provided monthly unemployment rates for states so we're gonna go ahead and that was just a track check to make sure we can drop n a and we'll go ahead and save this as our unemployment states again okay let's use plotly express to plot this as a line plot and we're going to do px line and we're going to give it the data frame of unemployment states and then we are going to yeah go ahead and plot this so the nice thing about using plotly express now is that we actually have an interactive plot where we can filters or certain states by double clicking on them we can show multiple states next to each other so yeah here's new york versus utah you can see the difference in unemployment rates and it's a lot of insights that you can put from this just by clicking around now i do have a dark theme on my web browser right now that makes it a little hard to read but um if you were in normal mode you'd be able to see this with a little bit clearer so this is plot states unemployment rate very cool and we know that there was a big jump in unemployment here in april of 2020. maybe an interesting thing that we could look at with this data is actually pulling each state's unemployment rate on that date and we can do that uh pretty easily let's let's title this pull april 2020 unemployment rate her state we'll take this unemployment states and we'll take where the index is equal to 2020.0401 and then we'll locate where this occurs and let's make this a bar plot so let's let's uh first transform it what this will do is this will flip the whole data set so that we have the value each column is now as different state and then we're going to sort the values of this date column now we have them in order of lowest to the highest unemployment rate and on that month and we can go ahead and plot i'm just putting this backslash so i can split up my lines here i like to split up my lines and we're to do the kind equals bar um with the fig size is 10 by 5. and you can see here now we have a bar plot with all the different states and their unemployment rates let's say the title is on unemployment rate by state april 2020 there we go a few other things is we want to remove this legend for this column so i'm just going to save off this plot as its axis and do legend remove from this which will remove the legend you can see the highest one here is nevada it looks like nevada was obviously hit the most since they have a lot of travel and wyoming was had the least highest unemployment rate in april of 2020 one thing i'd like to do is since each of these the text is not really easy to understand since this is the fred index id name for each of these let's try to switch this for the actual state name and this shouldn't be too hard what we're going to do here is take our results we're going to take our go back to this metadata and we see here that for each id there's also the column that's called title so the title looks like they all have what it says unemployment rate in and then the state name and the full state name so we can make a new column called state name by doing a string on this column and then strip and we strip out where it says unemployment rate in actually we're going to do a replace of unemployment in and replace it just with an empty string and now we have all of the ids matched up with the state names or in the case of this on rate it actually links to unemployment rate and then we're going to make this into a dictionary and you'll see why this dictionary will be helpful in a second because we can actually map all these names of the ids to the state name so we're going to call this id to state so we're going to take this mapping dictionary that we've created and leverage it to rename the columns in the data frame that we have above called unemployment states and i'm going to show you how we'll do that here right very easily so we'll take the unemployment states data frame and we'll take the columns from it by doing dot columns and we're going to use something called list comprehension here it's really handy in python if you've never used it before but we can actually go through each of these values in the columns and apply um the mapping to it so if we did c for c in the columns we'll just get a list of each of the state's ids in each column but we want to transform it so we're going to actually instead of doing c for c in unemployment rates we're going to do the id to state mapping value which will give us the state name for each state in the columns and then we can just write over what the column names are with this and voila now if we look at unempty states we have the actual state names here as the title so i'm going to go back and use this new use this new version of the unemployment states data frame to make our plots from before let's go up here to where we made it originally use this id mapping and redo our line plot there now we have full state names here which is nice and then we'll have this plot again i actually think this would look better as a horizontal bar chart so i'm going to change this from bar to bar h and yeah now we can read the states a little bit better and let's do the width as one this will make it fill out a little bit more maybe 0.9 i'm also adding in an edge color as black which is nice because then you can kind of see the outline of each a little bit better and i like this i think this looks nice and it shows us each state's unemployment rate at this date so x label now is percent unemployed that's nice another interesting economic metric that we can look compared to the unemployment rate is something called the participation rate and that's just another indicator of people who are participating in the job market so let's try to pull the participation rate similar to what we did before and we're going to actually reuse a lot of the code that we wrote before that's the nice thing about writing it in python and pandas is we can reuse the same ideas over and over again and not have to redo things like we would in a spreadsheet like we did before with unemployment rate let's search for the participation rate we're going to make sure it's monthly and that it seasonally adjusted in a percentage so let's see what the result is when we pulled this we have the same deal or we have a bunch of results what's the shape of this it's 51 so it's each state's result now let's do the same mapping that we did before where we would take the title and we're going to take out where it says labor force participation rate 4 and make a dictionary for this and that'll be our participation id to state and let's do our loop through where we'll get and combine all of our results for the unemployment similar to what we did with the unemployment rates and we'll call this states we're going to loop through instead of the unemployment rate the participation results and this should look good i don't think we need to drop this anymore since it's not in in there now that that's done running we're just going to do the same thing with the participation with states column renames okay so now we have the participation data for each state over time similar to like we did with the the unemployment data let's put this up here and let's think of uh maybe an interesting plot we could do with the combination of both the participation data and the unemployment data and that'll be our final plot so let's try to plot the unemployment rate versus the participation rate for each state over the years 2020 and 2021 i think we can do that pretty easily so let's um let's just start by taking an example of one state and we'll start with the unemployment rate remember we called this unemp states so we take this and we'll take new york and we'll query where the index is greater than 2020 greater than or equal to 2020 and the index is less than 2022 and we'll plot this so this is the unemployment rate for the state of new york for these two years and then we'll also we're gonna actually take this and make a figure and plot this on that axis of this figure and we'll do the exact same thing with our participation so you can see up here is the participation down here is the unemployment we can actually make a second axis off of this by using something called twin x and then plot the second plot on that second one let's make the color of this in our color palette the second value and there we go we have the unemployment rate versus the participation rate let's turn the grid off on that second one there we go and let's also put a title in for this axis set title and this is going to be new york but we want to do this for all 50 states and we can do this by creating this exact same plot but within subplots of a mi a bigger figure so the way we do this is we'll do um let's do 10 by five and let's also make the fig size as ten by ten and now these instead of an axis will be accesses there are going to be multiple axises that this will provide us and that will loop over we have to make sure we flatten this and now what we have here are a bunch of different accesses that we'll be able to plot on it's hard to see here but there's a bunch of different 50 different um accesses that we can plot on and we'll actually iterate over each column in our unemployment states to get this so let's take our columns which is each state for state and these and we're going to do everything we did before but instead of this we'll take the axis of the index location so this is our index location or i will make it start at zero and we'll increment it each time and then this is going to be our state name and the column that we'll pull in is our state column not just new york and at the end let's do plot.show so it looks like the unemployment states has a different name for the district of columbia it looks like district of columbia has the word the in it and the participation column is district of columbia so we'll just replace this column name by doing a rename rename of the columns and we can provide this a dictionary of the district of columbia for this unemployment states and rename it as district of columbia and we'll overwrite this call it fix dc and try rerunning this again since we only have 50 spots here we're gonna skip the district of colum columbia so so if the state equals district of columbia we're just going to continue here right and we have to actually change what we're referencing here for the axis and every part of this plot or else it won't work now things look a little jumbled up here we can fix this by doing a few things i'm going to make it so this figure does share x as true and share y is true because they're the same should be the same scales ish actually let's just share why [Music] because the dates are all the same and then we'll also do something here called plot tight layout this will make the titles not overlap as much and when we set the title let's make the font size a little bit smaller so let's make it eight that did not work so i think what we actually have to do here so here we're going to make this fig size a little bit bigger actually we want to do share x let's share x okay so this is pretty cool we can see every state here and how their participation participation rate and unemployment rate were affected over uh 2020 and 2021. you can see each state sort of has its own profile and if you dig into this deeper you might be able to explore why maybe some states responded differently than than others in both participation rate and unemployment rate all right that's the end now you explore thanks so much for watching this video about exploring economic data using python pandas and the fred api this is just scratching the surface of all the things you can do with these tools i challenge you to take the notebook that we worked on here today try editing it looking at some of the data and seeing if you can find some insights from either this or some other financial indicators that you can pull easily from fred let me know in the comments if you enjoyed this and if there are any other things that you would like me to go over in future videos and of course subscribe so you'll be alerted when my next videos go live thanks a lot for watching see you next time
Info
Channel: Rob Mulla
Views: 184,197
Rating: undefined out of 5
Keywords: Economic Data Analysis Project with Python Pandas, python pandas, economic data, economic data analysis, pandas data exploration, data science for beginners, rob mulla, economic data research, fred python api, fred economic data, pandas data cleaning, plotting with matplotlib, plotting with pandas, economic data science, unemployment data, python data analysis projects, economic indicators, kaggle notebook, econometrics, python econometrics, fed data, economics
Id: R67XuYc9NQ4
Channel Id: undefined
Length: 44min 54sec (2694 seconds)
Published: Tue Apr 26 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.