Exploratory Data Analysis - A Case Study | Data Analysis with Python (6/6) | Free Certification

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello and welcome to Data Analysis with Python Zero to Pandas. This is a free certification course being organized in collaboration by Jovian and free code camp. My name isAakash. I'm your instructor for the course. I'm also the CEO of Jovian. And you can find me on Twitter at and Jovian is a sharing and collaboration platform for data science. We've been sharing all the course materials with you using Jovian, and you've been submitting the assignments using Jovian as well and free code camp.org is a place where you can learn to code for free by building great projects and interacting with the worldwide community. So do check it out. Now the course Zero to Pandas is according first introduction to Data Analysis with Python, and we've had video lectures every week with live coding on cloud-based Jupyter notebooks. We also had assignments for you to practice the concepts and the course project that you're probably currently working on. And once you do these, you will be able to own a certificate of accomplishment, all free of cost, and this will be a verified certificate that you will be able to showcase on your LinkedIn profile and on your resume. And finally, so the topic for today, our final lecture is exploratory Data Analysis, a case study where we will be taking everything that we have learned in the entire course and bringing it together into a single project where we will be analyzing a real world data set, and we will be asking and answering some interesting questions about the data and we will figure out how to use the right tools and techniques and libraries and functions. And that I, that right. Places, and we will be drawing some inferences and conclusions. So this is in some sense, a walk through of a real world project that you might do in Data Analysis. So with that let's get started. So the first thing you should do is go to the course page zerotopandas.com. And on the course page, you will be able to find all the course materials. So today, since we are looking at lesson five, so you can sorry, less than six. So you can scroll down to lesson six, exploratory Data Analysis, a key study and click open. This will take you to the course page and the lecture page where you will be able to see the video to the video that you're watching right now will be available as a recording for you to review. Now for each lesson lecture, we have been using Jupyter notebooks. So today we will be using the Jupyter notebook EDA on stack overflow developer survey. So you can just click on this link and that will bring you to this Jupyter notebook. Now you can view this Jupyter notebook, you can read through it. But what we would want to do is we would want to run this notebook. Now you'd need not run it right now. If you watching live, you just watch the lecture. And after the lecture, you can run this notebook and experiment with it, but I'm going to run it right now. And we are going to work through it and we click the run button and click run on binder to start the notebook up on the binder platform. Now, this might take a couple of minutes and in the meantime, I just want to show you how to ask questions. So if you have questions during or after the lecture, you can come to the lesson page and here click on the discussion forum link. And this will take you to a page on the discussion forum. This is where we've been having all our discussions, and you can use this tread for asking questions. So all you need to do is click the reply button and post your question. There's a big blue reply button and you just post your question and hit reply. And either somebody from the course team or somebody from the community will be we'll answer your questions. Okay. So please ask your questions on the forum and also answer questions. If you already know answers to some of the questions that have been asked to you. So returning to the notebook, this notebook is called exploratory Data Analysis using Python a case study, and we have clicked the run button and selected run on binder. So finally here we have our Jupyter interface running, so let's just open up the notebook here by python-eda-stackoverflow-suvey.ipnb And the first thing that I'm going to do is I'm going to click on restart and clear output. This is to clear all the outputs so that only the code remains and we can run the code and see the outputs for ourselves. Then I'm also going to hide the header and the toolbar. You need not do this, but I'm just doing this so that we have a little more space to work with. So in this notebook, exploratory Data Analysis, using Python, a key study, we will be analyzing responses from the stack overflow annual developer survey, and we will apply all the things that we have learned so far about Python, Jupyter, Numpy pandas matplotlib seaborn together, and we'll bring everything together. Now you can run this code online just in the way that we have by using the run button that we just use to run this code on binder. But you can also run this code on your computer locally, so you can follow the instructions given here under option. Now, as I mentioned, we will be analyzing the stack overflow developers survey data set for our analysis. And this is an annual survey conducted by stack overflow. And you can find the raw data and results on this page insights start stack overflow.com. In fact, we will be analyzing the results from 2020, but you can also see the survey from the past years. The first thing that we would want to do is to download this data set into Jupyter, and there are many ways to do this. We've seen many ways or over time across different lectures. So the first thing that you could do is just download the CSV manually and upload it using Jupyter's graphical interface. So let's do that. Let's see how to do that. So you click download for dataset and that points us to a Google drive link, and then you can download the Google drive link. And so here I'm downloading it to my desktop, and then you can unzip the Google drive link. So once you unzip this link, you will see this folder developer survey 2020. And if you open up this folder, you will see here that you have a PDF and then a bunch of CSV files. And then a read me file. So you can go through the, read me to learn more about the dataset, but probably the interesting files for us are going to be the CSV files. So we might need to upload the CSV files onto Jupyter. So the way to do that is to come back to the notebook, go file open, and then you will find an upload button here. So click upload, and just select the file that you want to upload. So in this case, let's say as an example, let's try and upload survey results, schema CSP. So one by one, you can upload each CSV file that you need. Okay, so that's one way to do it. You can click the upload button to complete the upload, but I won't do that right now. Then there's another way to do it. If you have a direct link to the raw CSV file. Okay. So this is very important. If you have a link, which points directly to the CSV file. So a Google drive link, or some other link will not do it, has to want to the raw file. Then you can use the URL rectory function from the URL lib dot request module. So this is what we've used in our previous lectures. We've used this in the lecture, NumPy and pandas. So you can review those lectures, like just three and four to see how to use this, or just check the documentation. But today we are going to use a third method. We are going to use a helper library called open datasets. So this is a library that we have created for you to make it easier for you to do Data Analysis, where we are creating a curated collection of datasets for Data Analysis and machine learning. And these datasets can be downloaded into Jupyter with a single Python command. So this is how it works. We have to install this library, PIP install, open datasets. So let's just, come back to the Jupyter notebook and let us run PIP install open datasets. so this is installed the open data six library. Then we import open datasets as ODI. So we just import open datasets as ODI. And then all we need to do is run audi.download, and then pass in a dataset ID. So we have currently added about six datasets here, but we will be adding many more. We're trying to make sure that we have at least a hundred data sets here by the end of this week. So you will see more here and we simply need to pick up the ID of the dataset that we want to use. So here we have the stack overflow developer survey, and we need to pace it ID here. So we just do audi.download and under the hood, this is going to fetch the list of files in the dataset, and it is going to download all the files. So you can see that these. You are orders have been fetched and they have been downloaded into this folder stack overflow developers survey 2020. Okay. So with that, we have, we seem to have the dataset downloaded. Once again, you can go file open. And you can check that there is a folder here, stack overflow developers survey 2020, and this contains the read me and then the public file and the schema file. Okay. So those are different ways to download the dataset. The important point is to have the files next to your Jupyter notebook, irrespective of how you get them. All right. So let's just check import dos module and verify once again, that. Dataset is actually downloaded. So when we run our start list, TIR on the stack overflow developers survey folder, we see a read me, we see some survey results and we see the survey results, public MECs survey results schema. So there are three files and you can go through these three files. So the read me contains some information about the dataset and then the survey results schema contains a list of questions. And so let's maybe just view, load up that file and view that file as well. And then there are the survey results public. So this contains the full list of responses to those questions. Okay. So we will load the CSV files using pandas and let's import pandas as PD, and let's use the pd.read CSV function, where we will pass in the part to the survey results, public dot CSV file, which contains the survey responses. And we are going to call it survey raw DF. So the reason for calling it, that is because this is going to be the raw data set, the unprocessed data set that we are loading up. And then we are going to do some modifications to create a prepared data set for analysis. So here we create a survey or DF by calling period or read CSV. And let's take a look at this. So this is what the data frame looks like. It looks like it has a whole bunch of columns. In fact, in total, there are about 61 columns. So 60 of them are the recur, each column corresponds to one question from the survey and here, what is not the full question, but just a short form. And then there is also one column which contains a respondent ID. So these survey results are anonymized. So there is no personally identifiable information, like first name, last name, phone number, email, et cetera. So the results have been anonymized and every respondent has been given an ID, which is not really useful for us there's one column. If you want to use the respondent ID, and then there are 60 columns, one for each question. Now, if you want to know what the actual question was, this is where you can use the other, which is a survey, which is the other file, which is survey results, schema dot CSV. And we see that in just a moment. But you can see at the bottom here that there are over 64,000 responses. And as we set 61 columns and let's just see a list of columns in the data frame, so we can use surveyed or DF dot columns to see a full list of columns from the data frame. A lot of these may not make sense. So this is where we need to use the schema file to understand what these columns represent. So here we have this file. Let me just put it on a single on a separate line. So here we have this file, schema survey results, schema dot CSV, and maybe let's re load up the entire file first. So we can just do speedy.read CSV and schema file name. So this file is a CSV file that contains two columns. One of the columns, the first column is titled column. So this corresponds to the names of all these columns, except respondent, so main branch hobbies, et cetera, et cetera. Actually it contains the respondent ID as well. So for each of these columns, you can see the actual question here. So for instance, main branch means this question was which of the following options best describes you, et cetera, et cetera. Hobbyist means do you code as a hobby? So that's what the correspondence is. Now it's okay to have it in this format that, but the schema data frame is primarily going to be used to access the question for each column. So what we might first want to do is we might just want to set the index by loading the file and set the index call argument to this call so that when we read it now we just have one proper column and then the column. It self is just the index. So now we have the respondent and then we have this one column question texts. So now we can access the question text using the respondent as a key, right? So you can, for example, imagine doing dot look and then passing in the respondent key here to access the question text for the respondent. Okay. So now let's simplify it a little bit further since we only have one column. We don't really need a data frame. Data frames are required when we want to go over multiple columns of data. So we can simply get the question text out of this, and that will give us a series. okay. It looks like this is stuck. Yeah. So we can just read the CSV file, the data frame and from it, just get the question, text column out of it. So that will give us a series. And that is all we need, really know where the series has an index, which is the column name. In the main data frame and then the series has a value each value corresponds to the full text of the question. Okay. So that is what we have done here. We've created a schema row, which is where we redone the schema file with the index column has column. And we just create a series question text. So here is a schema arrow, and we can now use schema_raw to retrieve the full question texts for any column in survey raw DF. So now we can check for example, that yours code pro column. So you can see here that your score pro column in the survey data frame, corresponds to the question, not including education, how many years have you coded professionally as a part of your work? All right. So with that, we have loaded up our dataset and we have them into data frames. We have innovative that we can now work with it, using the tools that we know and understand. So we are now ready to move on to the next step of pre-processing and cleaning the data for our analysis. And before we do that, it's always a good idea to keep saving your work from time to time, because we're running this on an online service binder. So we simply select a project name. So here I'm selecting a project named Python EDS stack overflow survey. We installed the job in library, and then we import Jovian and run jovian.commit(). And this is going to ask us the first time, this is going to ask us for an API key, which we can get from our profile and just paste it in here. And that is going to then commit this notebook to our profile. yeah. So now this notebook has been committed. So you can view this notebook on your Jovian profile whenever you want to. Whenever, wherever you coming from, whether you come in from binder or you come at from your local computer, everything gets saved on your Jovian profile. And then you can take this and run it on binder whenever you need it to continue your work. All right. So moving ahead, now we have our data as data frames, and while the survey contains a wealth of information, it contains about 65,000 responses or two 60 questions. We will limit our analysis to us, a few areas. And this is what you might want to do for your projects as well. Just pick a team for your project and do not try to do a lot of different things with the dataset. So we will limit our analysis to three areas. The first would be understanding the demographics of the survey respondents who it, is that has taken the survey and the global programming community in general. And understanding if that survey responses are representative of the global community. We will also try and understand the distribution of different programming skills, experiences and preferences. So specifically like things like programming languages, which programming languages do people like, which one do they not? And we will also understand some employment related information, professional and information, preferences, and opinions. So something related to the kind of roles people are holding the science and programming fields. Okay. So to do that, let us select a subset of the columns. So here are some columns for demographics, and then here are some columns for the programming experience. And then here are some columns for the employment. So you know that you can use this, the schema that we had created to check the questions. So do check out the full questions for each of these before moving forward and let us just. Check how many columns you have selected. So we have selected about 20 columns. Okay. Now what we will do is we'll take our survey raw data frame. And then if we simply pass in a list of columns to it as an index that is simply going to select a subset of columns, and then we are going to take that data and just create a copy of it and call that survey data frame. So we are creating a copy so that we can modify this without effecting the original data frame so that if we make a mistake, we can always recreate this from the original data frame. So we create the survey data frame and we've just taken the selected columns and created a copy. And then we have created here. We are now going to just pick out the same selected columns out of the schema as well. So that our schema also contains just the columns that we need so we can look at it right now. So here we have the survey data frame. This now has only 20 columns, the columns that we have selected, it still has all the rules. And then we can check the schema as well. So here we see the schema. So you can see here that the schema has each questions for each of the 20 columns. So you can see in total, it has a. If you check the shape, you will see that it has 20 entries. And the survey data frame has about a 64,000 rows and 20 columns. Now we can use the info method of survey data frame to see the list of columns and different data types. So these are all the different columns that we had just selected. And you can see that out of the 64,000. Plus entries, not all of the entries are non empty. So for each of the columns, you'll see some null values, some values that are empty in the CSV file. So they binders replaces empty values in the CSV file with NP dot Nan, which is a token. It is used, it uses four empty values. And because there are so many empty values and because a lot of rows contain many different types of data, the data type that you see here. Is detected as object for most of the rules. Now the object data type is okay while we are working with strings. It's not really a problem, but when we are working with numbers and we want to perform some new medic computations and draw some graphs, which involve number some kind of number processing, then we might need to convert some of these rules. Some of these columns into numeric data types. Okay. So, far we have just the age and the work week hours as the numeric data types. If you see age here, it has a datatype float 64 and then we'll work week. Ours has a datatype float 64, but there are a few more columns which can also have a numeric data type. So we have the age first core. We have the ears code pro and we have the ears code column. All three of these are also numeric, but for some reason they've gotten classified as object. So let's investigate that a little bit. Now, if we just look at survey DF dot age, first code, and let us look at the unique values for this. So it turns out that most of the values are numbers. And the question if you're wondering what the question is, let's just check schema dot age first code. So the question is at what age did you write your first line of code or program? And the answers to that are mostly numbers, but there is also an option younger than five years and older than 85, which are strengths. And we might want to just somehow convert these into numbers or ignore them because that will keeping them a string strings will prevent our analysis. So what you can do, what we will choose to do here is we will choose to replace these with empty values. Okay. And we will choose to convert the rest of these into strings. And the way to do that is to use the PD dot two numeric function. So the PD dot two numeric function, and you can check the documentation, it takes a series or a column, and it converts that series into a numeric data, right? So it's going to take all of these and convert these into floats and wherever it encounters a string, it will show, throw an error. But what we want to do is we want to ignore the Arabs and simply replace. Any non-numeric values with the non-value or the empty place holder value. So that's why we are passing in errors, equal scores. Okay. So this is one thing that we're doing for the age first score. Similarly, we have your score and your score pro. So let's just take a quick look at your score and escort pro is going to be similar. So here's coders, including any education. How many years have you been coding in total? And if we take that and we check the unique values of that, So once again, here we have these options, like less than one year and more than 50 years. Once again, what we are going to do is we are going to ignore those values. We are going to convert these into empty values and the rest of you are going to convert into numbers. So once again, we use PD dot two numeric, and we assign the result back to survey DF year score. So this way we are replacing the columns with a new version of the same column where all the, not all the elements, our numbers are empty. And then similarly, we have this for the year score pro column. Okay. So go through the other columns and see if there are any other columns that are numeric. But for the time being, we will convert these three into numeric columns. Now we had two columns earlier age and work week hours, and then we have three more columns. So in total we have five numeric columns and we can start seeing some basic statistics about the numeric columns using the describe function. So here we have, we are called sorority of dot describe, and you can see that for age, there are the mean, the average age is about 30 of the survey respondents, but the average age of first coding. So the age at which they wrote their first line of code is around 15 years of coding is about 12 in on average. And so on what we exert around 40, but you will start to notice some problems here. It seems like that the minimum age mentioned is one. A bit seems quite unlikely that a one year old infant has filled out the survey and the maximum age appears to be 279, which once again is quite unlikely. And this is a common issue with real world data in general. And with surveys in particular, you have to understand that surveys are triggered by people and people are not, there is no obligation to enter the right information. So sometimes people may. Intentionally putting the wrong information. While other times there might be accidental errors while filling in, like maybe somebody was looking to type 17, but they did not press the seven. And so it ended up being one or maybe somebody was trying to type 27 and they accidentally pressed nine as well. And that ended up with 279. So we should try and solve for these for each column. You should go through it and try to figure out if the values in the column make sense. And a simple fix in this case would be to simply delete the rows where the value in the age column is higher than a hundred years or lower than 10 years. So we are basically saying that those entire responses are invalid. Maybe they were maybe they're invalid unintentionally, or maybe that was intentional. And to delete the rows, we can use the drop of method of the data frame. So we just call survey DF dot drop, and then. You can just go try and understand the syntax here what's happening is that we are first checking the list of values survey. Do you have dot age where the ages, where the age is less than 10? So that's going to give us a bullion series of proven false for each row on whether we age is less than 10 or not. We then use that to select the rows. We then use that to select only those rows where the age is less than 10. And then to drop to a survey, do you have door drop? We need to pass an index. So if you just call dot index on this. So we're basically passing all the IDs of all the, IDs of all the rows that need to be dropped. And then we're saying that we are dropping, we are doing this in place. So that is going to remove the rows from this same survey. Do you have data frame and not create a new data frame? With the remote pros, right? There's a lot to unpack here and the way to solve it would be to just take each expression and run it on a different line, on a different code cell and see the results and build it step by step. Or you can also follow this link here, which, which explains this entire line of code. Or you can just check the documentation of survey DF, not drop. Okay. So with that, we've done this for age, less than 10 we've removed. Those rows is greater than a hundred. We've removed those roles as well. And the same holds true for work week hours. So here, if you check the workweek hours, it seems like the minimum is one which seems reasonable. Some people might be just working for one hour a week, but the maximum seems to be 475. And that's probably wrong because the number of hours in a week is. About 168. All right. So we can take another approximation here that let us remove all the rows where the value for the column of workweek hours is higher than 140 hours, which is about 20 hours per day. So now you have survey DF dot drop, and here we are dropping the where the rows, where the workweek hours are more than 114, the gender column also allows picking. So there's the gender column. So if you'll see there's a column called gender, if we just check schema or gender. So here, the question was which of the following genders describe you and please check all that apply. So this gender questioned about gender allows picking multiple options. So here we have man woman non-binary gender queer or gender nonconforming. So these are the three options. But there are cases where people have big multiple options too. So for instance, there are 120 when people have big man and non-binary gender queer or gender nonconforming. Now, while this is acceptable in general it is going to make our analysis a little bit difficult. So we are just doing to do a small simplification here we are just going to. Peak all of these values where none of the values have been selected or where multiple values have been selected or all the values have been selected and simply replace these with empty values. So the reason we're doing this is that it's going to simplify our analysis just a little bit. So let us simply, so what we're going to do is we're going to replace all of these with the empty value. So that's going to simplify our analysis. We're just looking at one category at a time. Okay, so I'm just going to do this. There is a, there is, you can follow this line of code. So we're calling survey, DF dot ware and somebody of dot where takes a condition. And then based on the condition, wherever the condition is satisfied, it replaces the value with a specific value that we are providing here. And you can do it in place. Okay. So I will leave this as an exercise for you to figure out on what exactly this does, but the, end result of this is to take survey, DF gender. And now when we check the value counts, we only have a single option being selected, man, woman, or non-binary. Okay. And that's just for simplification, not to say that any of these options are invalid. So with diet, we have now cleaned up. The dataset and prepared it for our analysis. We've made a few assumptions. We've made a few simplifications of we've removed certain roles. We've replaced certain values with empty values. And this is a typical process that you will follow for any dataset, any real world data set that you do, that the lesson here is not to jump immediately into analysis. Just first, go through these values see, where the missing values are. See, if you need to do something about the missing values. So for strings, not really, but sometimes for numbers, you may want to do some kind of replacement for the missing values. Sometimes you may want to simply remove those rows altogether. We have not done that in this case, a deal with any invalid values, any values that you feel are outside, any normal range you should get rid of, maybe get rid of those rows entirely, or maybe simply clear out those values with just the empty value. And a lot of things like that. And then finally, we've now cleaned up our data frame so we can just check a sample. So I'm just calling. So I already have dot sample with 10. So just to see a random sample of 10 rules, and this is again, a very good exercise to do. Just go through some sample data from your dataset, just to get a sense of what the values in each column and different roles look like. Okay. So here we can see that the countries it looks like their strings and then the age seems to be okay. It seems to be a number that are also places where people have not filled the age. Then we have the gender. Remember we've done a simplification here where we reduced it to one one answer, and then we have the education level. Then there is the undergraduate major and so on. So just going through each columns is going to help you make better inferences on the data. And with that, our data pre-processing and cleaning is complete. So let's just commit our work once again. So moving forward now, before we can ask any interesting questions about the survey responses, it would be helpful for us to understand what the demographics, which is things like the country, age, gender, anything that you can. You used to pick out groups from the responses, any such information, what the demographics of the respondents look like. And it's important to explore these variables mainly in order to understand how representative the survey is off the worldwide programming community and of the worldwide population in general. And this is, again, the reason this is important is because a survey of this scale generally tends to have some bias. So in the world, there are a certain number of people who are programmers out of those programmers. A certain fraction use stack overflow, and that's not a randomly uniformly selected fraction. So there's already definitely some selection happening there. Out of those who use stack overflow, a certain number of people have taken the survey. So once again, that is not a uniform selection from the entire user base of stack overflow. There's probably some selection bias here. People who are more likely to take a survey are probably also people who are, let's say have three or four other qualities. So it's not a random sample, completely random sample. And then there's also how stack overflow publicized the survey. What was the outreach process and all that besides who saw the survey and who filled the survey? And also like the language of the survey, the kind of questions that were asked the length of the survey, all of these things make a big difference in terms of who has actually filled the survey. And all of this is called selection bias where the respondents of the survey do not come from a randomly picked sample of the overall of the overall population that you want to study. So do keep that in mind, and that's why it's important to first look at the demographics. Okay. So we're going to do what is called exploratory analysis and visualization, where we don't really have a question in mind. We are simply looking at different roles, columns, comparing things, blurting graphs. And since we will be plotting graphs, we will be using matplotlib and Seaborn. So here, what I'm doing is I am importing matplotlib and Seaborn, and I've said some basic styles. Like I've used the dark grid style from seaborne. I have increased the font size and the figure size for my plot lips so that we can see the figures a little more easily. Now, if you want to understand what all of these mean, I will refer you back to the previous lecture on data visualization with Matlock, lib, and Seymour. Okay. So with the imports out of the way, so let's first look at the number of countries from which there are responses in the Surrey and maybe plot the 10 countries with the highest number of responses. So there was a question where do you live? And that, so the column name for that was country. So we just see that from the schema country corresponds to where you live and from the survey data frame, if we just get the country column and then call dot N unique on it. So that's going to give us the number of countries that are there in this dataset. So respondents from 183 countries have answered questions. And that's pretty good. No, that's quite, it's quite wide, but it might be better to look at what the distribution of responses across countries looks like. Now we cannot plot the entire distribution for 183 countries. So maybe what we'll do is we will simply look at 15, the top 15 countries from where we had the maximum responses. And the way to do that is to take the survey df.country, the column, and then use dot value counts on it. So dot value counts is going to well, let's see what that does. So you can see here dot value counts. That's going to take each distinct value on country. So each unique value inside country, and then take that and count the number of values for each one. So here you can see that these are the different value counts. So for each country, we now get back account. And this was a series, essentially. And then from that, and this is going to be sorted. So you can also decide whether you want this to be sorted. So you can say sort equals true, and you can say ascending equals false, by the way, if you want to show this documentation in line for any function that you're using, all you need to do is press shift plus tab. So here on pressing shift plus tab, and that shows me the documentation. Okay. So we have the value counts and then we pick the top 15 countries out of it. So those are the top countries you can see here, the United States at the top, then India, then Yuki. And it's good to look at it this way as a table, but you cannot really grasp what is the difference? Visually what is the distribution looking like? And that is where a bar chart will probably help. So we can use the index of the series as the x-axis. And then we can use the values in the series as the y-axis. So we are going to create a figure here. So I'm just going to set up PLT road, figure fig size, just to make this a big figure. We are going to set a title for it. So we are going to set the title as the question that was asked, and then we're going to use seaborne dark bar plot. So Seaborn has been imported as SNS and we use the bar plot and for the bar plot, we give an x-axis and y-axis. So the x-axis, we are going to use the index, the names of the countries and for the y-axis we are going to use the values, which is the number of respondents from each of these countries. So there you go. Now you have the graph. Now, before we analyze this, I just want to say by default, these labels are printed horizontally, and if we do print them horizontally, there will be a lot of overlap. What we've done here is we've caused, we have called the PLT dot X tax function and we set rotation to 75. So that has taken these labors and then rotate it at 75 degrees. And because of this, now you can see that they're slightly slanted and we can all read them. Okay. So now we have the graph. So looking at the graph, it seems like a disproportionately high number of responses came from United States and India, right? So the United States has 12,000 responses. And then India has 8,000, which is about, only about seventy-five percent of the number of responses from the U S and then next is United Kingdom, which is less than half of India. All right. And that already tells you that probably this survey is not really representative of programmers around the world, maybe a large 12,000 plus 8,000 plus 4,000. So that's about. 24,000 out of 65. So that's about more than one third, more than 40% of the responses have come from these three or four countries. And you might know if you think about this a little bit, it makes sense because one, this survey is in English, so it was only conducted in English. So therefore. The programmers from non English speaking countries probably did not get to hear of it. Second stack Overflow's user base is shackle is also a platform that is completely in English. So it's user basis, primarily from countries where English is spoken in in different professions. And those happen to be the top three countries, United States, India, and United Kingdom, where we speak English on a day-to-day basis for, in our professional lives. Okay. So that's something to consider that the survey may not be representative of the entire programming community, especially the non English speaking countries. And it's something for stack overflow to consider. Maybe they should try translating their question and answers into different languages, and maybe they should also translate this survey into different languages so that they can get a more represented creative design. Okay. Now there's an exercise for you here. What you can do is you can try finding the percentage of responses from English speaking versus non English speaking countries. So here I've linked to a list, a CSV file, which contains the list of languages spoken in different countries. See if you can combine that data with this data to identify, to create a new column that says English speaking, and maybe it contains yes or no, or true or false for the English speaking column. And then see if you can see how many responses are from English speakers, how many are not. Okay. So that was about the different countries. The data came from probably the next thing that we can study is the distribution of the age of the respondents. So the age is another important factor to look at. And this time, because age is numeric, we can probably use a histogram to visualize it. So we are going to use PLT dot hist and. Let's just check what the age, what the question about age was. So the question was, what is your age in ears? And a lot of people may have preferred not to answer this question. Fortunately what happens is when we use these matplotlib and functions, any empty values are automatically ignored. Now we have so we are calling PLT dot haste and then do it. We are passing the survey. Do you have dot age? So this is the column containing the age values and these are all numeric. Remember? So then we're also going to set a number of bins. So what we want is we want to take the ages starting from 10 because we've removed and everything below age 10. So we want to start from 10 and we want to go up to 80. You could also go up to a hundred, if you wish let's maybe change that to a hundred. And we are going to split this entire range of 10, 200 years into ranges of five years. You could split this into ranges of 10 years as well, but we'll split it into 10 years and then we will count the number of responses in each age group. Okay. So this is what the chart looks like. So it seems like that there are very few responses below 15 years of age. And a little more, a little over 2000 above 15 years of age, but the maximum bulk of the responses seem to be in the range of 20 to 45 years. Maybe you could say 15 to 50 years. So that seems to be the sort of the professional lifespan of a programmer to a large extent. But on the other hand, you can still see thousands of responses above and beyond the age of 45 and 50. It's, common that a lot of people tend to fall into this, age range of 20 to 50. But on the other hand, you have people from all over all the way, going up to close to even 80 years of age. Okay, so that's good. Now we understand the distribution of age and roughly this is representative of the programming community in general, especially since a lot of young people have taken up computers, science as a field of study or a profession in the last 20 years, right? Colleges now have computer science degrees. The number of jobs in computer science have increased and most new jobs tend to be taken by younger people and most new degrees as well. So that's why it is slightly representative. And you can do some research on how exactly representative it is and which age groups left out. And if you, and here's an exercise for you, you may also want to just create specific age groups like 10 to 18 years, 18 to 30 years, 30 to 45, 45 to 60 and so on. And you may want to create a column called each group, which can, which will contain based on the age. It will contain one of these values. And then what you can do is you can repeat the analysis in the rest of this notebook for each age group. If you want, if you just want to pick out your age group, let's say you're in the 30 to 45 age group, and you just want to know what programmers in your age group think. Then you can just do this analysis for your age group. And that'll be an interesting thing to try out. That's a project idea right there. Then let's look at the distribution of gender. You've already done a small simplification here where we've excluded multiple responses. And now it's a well-known fact that women and non binary genders are underrepresented in the programming community. So we might expect to see a skewed distribution here. So if you do schema or gender here, it looks like the question that was asked was which of the following describes you if any. So people were open to leaving it blank, and then we have the gender counts. So now we're taking sovereignty of dot gender and then using value counts here and already, you can see that there is a huge drop. There are 45,000 responses from where people have selected man as the gender. And there are about 38, 3,800 only 3,800 values where people have selected women. And let's also include in value counts. You can also include drop any equal to false. What that does is that also tells you how many people have big, nothing, no value. Now, what we can do is we can use it pie chart to visualize this distribution. So once again, we take these gender counts and we use PLT dot PI and we can then give it an index. We can give it, give them some labels as well, and we can give the charter title. So here you go. Now it seems like around 71% people have big man only about 6% big woman, and only about 0.6% have big nonbinary, gender queer or gender nonconforming. And if we exclude the non values, so if we exclude the nuances, so among the people who have answer, it seems like 92% closed over 91% are men. And. This number is actually a soda. So that means only about 8% are women or non-binary. And this number is actually a far more secure than the overall percentage. So the overall percentage of women in non binary genders in the programming community is estimated to be about 12%. So this number is still has some skew and what this number tells us in general, and even the overall figure. Is that there is a diversity problem in the programming community that we definitely a 50% of all people are women. And then there's a larger number. I think I'm not sure about the number, but a larger percentage of people also identify as one of the non binary genders. So we definitely need to have more representation in the programming community and we should support people from underrepresented communities and. Encourage them to be part of the programming community. Okay. And an interesting exercise for you now to do, would be to compare the survey responses and preferences across genders and repeat this analysis with those breakdowns. So for each graph, maybe instead of, for each bar chart, try to show side-by-side men versus women and see how things differ. So for instance, you could try and figure out how do the relative education levels differ across genders. Are do women hold similar degrees in terms of percentage or do their whole higher degrees. And you may be surprised how do the salaries differ? That is another thing to figure out. We definitely know that there is a gender pay divide. There's a gender pay gap. So maybe you can discover that. And there's a, column which talks about salaries. And then you may find there's this analysis on gender divide in data science. And you may find that useful. That is also an exploratory Data Analysis project. If you want to explore that a little bit. So that was about gender. Now. Let's talk about the education level. Now. Formal education in computer science is often considered an important requirement of becoming a programmer computer sciences. One of the most sought after degrees and both in bachelor's and in masters. Let's see if this is indeed the case, because on the other hand, a lot of you may have learned programming on your own, and there are many free resources and tutorials available online to learn programming. So what we do is we will use a horizontal bar chart to compare the education levels of different respondents. So you can just check here schema.education level. So there's an ad level column. The question was which of the following describes the highest level of formal education that you've completed. So keep that in mind. This is the highest level. So what we are going to use now is account plot. So what's account plot. We can check the doc documentation. So the count plot shows the counts of observation in each categorical bin using bars. So what that means is if you check the different values that education level contains. So if you just check. Schema DF sorry, if you just take survey, DF dot ed level dot let's say, what are the unique values? So these are all the unique values and the count plot is going to tell us for each of these values, how many observations are there? Like how many times was this particular option picked or how many times is this particular value show up in that column? Okay. So we pass to count plot sorority of.education level, and you can just do this, but that is going to. Make vertical bars. And if you just want horizontal box, you can just pass the Y equals 70 of.education level. So this is what the graph looks like. Maybe we should increase the size of the figure a little bit. So let's just set PLT dot figure fig size. Okay, this is a lot better. So the question is which of the following describes the highest level of formal education you've completed. And now you can see here that it seems like out of the 65,000 respondents, about twenty-five thousand more over 25,000 hold a bachelor's degree, and then another close to 12,000 hold a master's degree. And then there are a few more, which hold a doctoral degree, probably about a 1500 or so. So all of these three combined, it seems over half of the respondents, whole half of the respondents hold a bachelor's or master's degree. So most programmers definitely seem to have some college education. Definitely, maybe some STEM education where it's called some kind of a STEM education, but it's not clear from just this graph alone, whether they hold a degree in computer science. Okay. So let's dig a little bit deeper into that. And one problem with this graph is that here we are showing the absolute numbers. And probably what we really want to understand is percentages. So one exercise for you is to convert this graph, to just show percentages instead of the full numbers. Okay. And that will probably give a clearer idea because we probably want to know out of them, number of people who have responded to this question, What percentage have mentioned that the bachelor's degree is the highest degree that they hold. And that is probably the more relevant question to ask. So you can try and modify this code to just show percentages. Okay. But keeping that aside, we could tell them over half of the respondents hold a master's or a bachelor's degree. All right. So now let us then take and plot the undergraduate majors. So this time we will look at schema dot undergrad major, which was what was your primary field of study? And we will then convert these numbers into percentages. So to convert these into percentages, we take the value counts for each of the values. So here we say, survey DF dot undergrad, major dot value counts. So for each major, like computer science major use, you have 31,000 responses and so on. So what we can do is we can divide that by the total number of responses given for undergraduate majors. So let's just take the survey, be able to undergraduate major and, call dot count on it. So dot count is going to count the total number of values. So if we do that, then you can see that now we get back a fraction. So for each, major that was provided here, we get back a fraction. Point six one and so on. And if we multiply that by a hundred, we are going to get back a number. Are we going to get back a percentage? All right. So now we have a 65% 61% have computer science and another, a 9.3% of big, another engineering discipline and so on. All right. But it's probably better to look at that using a graph. So we just put this result into a, variable called undergraduate percent and then use sns.bar plot to plot it. Okay, so now here we have it. So in terms of the primary field of study, it seems like out of the people who've responded over 60% say that computer science or software engineering or computer engineering was there. Now the one way to, no, this is like a glass half full half empty kind of a situation. The way I would interpret this is that close to 40% of programmers holding a college degree. Have a field of study other than computer science. Which is very encouraging a lot of people after college feel that you cannot switch your field. That is definitely not true for computer science. If you want to get into computer science and you have some sort of formal education, some sort of STEM education, you can absolutely pursue it. There are so many online sources like close to 40% of people who are in the domain. Art from art, from streams, other than computer science. So I think this is a very encouraging sign. And I think this number is only going to go higher because there are better and better resources available. And programming is blow up is proliferating into pretty much every domain now. So you do a little bit of programming, no matter what you study and that equips you to become programmer as well. Even data science for example, is primarily a lot of programming. So, what we understand in general is that while college education is helpful in general, you do not need to pursue a major in computer science to become a successful programmer. And one trend that you might have noticed here while we are doing exploratory analysis is that first, every time we plot some graph, there is some background to it. There is a reason why we are exploring a particular column. And this is something that we've explained here that we want to understand. Like the reason we are looking at education level is to understand whether formal education is important or not. So have some background, have some, have something in mind when you are exploring a particular column. And then once you've explored that column, once you've plotted a graph, try to gain some insight from it, try to make some kind of an inference or an observation or a hypothesis based on it. Sometimes you may need to then go do more research to identify if your hypothesis is correct. And in other cases, it, the inferences can become pretty clear. For instance, in this case, it is pretty clear that a lot of people do not have a computer science degree. All right. And that is what that is. The best part of exploratory analysis is that you get to found all of these interesting inferences. Each time you draw a graph, you learn something new. Each time you look at a column, you learn something new, and then there's an exercise here for you. There's a column called new ed IMPT. Let's see what that column is. So that's, let's see schema.new ed. Okay. So that column is that called arc question. And the column is how important is a formal education, such as a university degree in computer science to your career. Now, what you can do is you can take this column and analyze the responses, the distribution of responses for people who hold some college degree or. What's this, sorry for people who hold some, hold a computer science degree, versus those who don't. Okay. Try and analyze these results. So how many people, what percentage of people who hold a computer science degree have selected that a formal education in computer science is important to the career. And what percentage of people who do not hold a computer science degree have selected that formal education is for their career. And see if you notice any difference in opinion. My guess is somebody who holds a college degree may value it highly, but somebody who does not hold a college degree, but still becoming a programmer will probably say that. It's probably not that important. So do check it out. Do discover there are more insights to be gained here. And then one last area that we will look at is employment. Now. Especially in among programmers, freelancing contract work and part-time work is slowly becoming a more popular choice. So it would be interesting to see the breakdown between full-time part-time and freelance work. So maybe let's visualize the data from the employment column. So the employment column was which of the following best describes your current employment status. And what we've taken here is once again, we are going to plot a simple horizontal bar chart. And one of the things that I want you to take away from this is also that simple charts are often good enough for a good analysis. You do not have to come up with a lot of fancy charts, although it's good. If you can find the best chart for the best kind of graph for every a statement of work for every situation, but even simple bar charts, line graphs, scatterplots can give you a lot of information, right? So you are very well equipped at this point. If you've worked through this course, So now we look at the sorority of donor employment dot value counts, and we're just setting normalize equals true. So when we said normalize equals true, that also gives percentages. And then we're also going to convert sorted in ascending order and can convert that multiply by a hundred to get percentages. So normalized gives fractions, and then we can load that into percentages. And then we use the Panda's plotting function. So dot plot. And this is just to show you a variation of different ways of plotting. And we are going to plot a horizontal bar chart with the green color. Okay. So now you see that among the people who have replied to this question employed full-time seems to be about 70%. So 70% people are employed full-time among the respondents, but there are a fair number of students as well. So there are about 12% of students. What you might want to do is you might want to break down. And then there are people who are not employed, but are looking for work. And then there are people who are freelancers part-timers and then there are people who are just maybe their hobbies. They're not really looking for work. They're not employed or they're retired. So you might want to create a new column employment type that contains values like enthusiast, which could mean students or people who are not employed, but are looking for work. And then professional, which has people who are employed full-time part-time or are freelancing, and then other, which has people who are not employed or retired. And then what you can do is you can see for each of these graphs, how will the preferences differ between students and professionals between enthusiasts and professionals, especially some of the things that we'll do after this, which is analyzing programming, language preferences and things like that. So that's a good exercise to do in any survey, in any analysis, all of these breakdowns offer a lot of insights and the best way to do it is identify which group you lie in. Let's say you are parsing by gender or by age or by your employment status, and then do the analysis just for yourself and people like yourself. And that is going to give you a lot more insight. Okay. Now, one interesting observation here is that if you take away students Then, among people who are employed, it seems like at least 10% of the people are working independently as contractors or freelancers or self-employed for instance, startup people who are doing startups or running their own companies. And that is pretty encouraging. That's a pretty high number of, for a technical field like this, to be able to work on your own without being formally associated with any. Com company. So that's an, that's a great thing that it's also a way for you to try and break into the field. If you are looking to become a programmer or looking to break into data science, maybe initially you can try some freelance work. You can try maybe an internship, some part-time work and help, and that can help you transition into a full-time role. So that's something to consider as well. Now in terms of what are the actual roles held by the respondents? We can look at the dev type field, so you can see here, there's a type Def type, which of the following describes you. And there are a bunch of different values for the rules that are provided. Now, the problem here is that this allows selection of multiple, values. So if you just check the Def type dot value counts, or let's just do deaf type dot. Unique, you will see here that there are a lot of different possibilities. So you can see here that there is a developer, desktop or intra. Okay. Let's just say value counts. That's probably a little bit easier. Yeah. So you can see here that there are some simple, options that were paid, like developer full stack, just that, and then develop a backend. But then there is also, you can select multiple options so you can select developer backend. And the semi-colon indicates that multiple options are pigged. So about 3000 people have big three options, develop a backend developer for antenna and develop a full stack and about 2000 of big developer back in and develop a full stack. And then there are a lot of people who have big, many different combinations. So it's not really clear even from the data, how many options were available, but it seems like this person has picked a whole bunch of different options. And so is this person and so on. So this has been, we might need to do some more processing. Now we might need to take this column which contains, a list of values, separated by semi-colons and maybe split it out into multiple columns. Okay. And for that, what we are going to do is we are going to define a helper function called split multi column that is going to take a series or a column of data. Which contains lists like values and lists of values. So data like a column like survey or Def type and split that the values of that column into multiple columns. Okay. Now I will not go over the code here and you can try this. You can try to run each line one by one, and you can try to understand the code. So by this point, I hope that you are well equipped to understand the score. Just run each line on a different cell. If you're not, you can also ask on the forum and you can have a discussion where you can ask, you can share where you're stuck or which part you don't understand, and you can have a discussion to figure it out. But let us look at the output. So we know what the input into this looks like. So the input into this, split multi column function is going to be a series where people have picked either one, either none or one or more than one options for there. Employment type or the role job role. So we can take split. We can call split multi columns on surveyed or dev type and passing this column returns or data frame. So we get back a data frame dev type DF. And if we just check this data frame, it seems like this data frame now has one column for each of the options that were provided for the question. So now we have one column for developer desktop enterprise applications. Then we have developer full-stack developer, mobile designer, and so on. So there are in total 23 columns about our 23 are about 23 options that were given for your job role. What we have is for each respondent, we have either true or false. So for instance, for this respondent, has not selected developer desktop enterprise, but this respondent has selected developer full stack. And this, the respondent has selected develop a mobile and they have not selected developer front-end or back-end. All right. So this is how sometimes we might need to do a little more processing of our data. We might need to break one column into an entire data frame so that we can do our analysis. And now that we have this data frame, the dev type DF, we can now use this to identify what were the most common rules. Okay. And a simple way to do that would be to simply count the number of throughs. In each column and you know that through when it is converted to an integer, becomes one and false, when it is converted to a number becomes zero. So what we can do is we can simply take the column wise sums so we can just take Def type dot some. So we, now we get a column by some, and then we can sort those values in a descending order. So that's going to give us these dev type totals. So now you see that we have developer backend developer, full developer, front end, and so on. So those seem to be the most common rules. And it's not surprising that the stack overflow is primarily a tool used by developers and professional developers for finding answers to small questions on writing code. So it's not uncommon that the developer role that you see is the most common, one, but one interesting thing for you to figure out would be what percentage of respondents work in roles related to data science? And then you can probably also try and figure out which role has the highest percentage of women. So what you can do is you can just filter now, that you have this data frame, you can then merge it back into the original data frame. So you can create a new merge data frame, which contains the columns from survey DF, but also contains these columns for each role. And then you can try and find out which role has the maximum percentage of women. Okay. That's an interesting thing to figure it out. So with that, we end our exploratory analysis and we've only explored a handful of columns from the 20 columns that we selected. We've only explored about five or six, so you can explore and visualize the remaining columns here. You have some empty cells and you can always add new cells using insert cell below. So please do that. The more you explore, the more you'll learn it. It's possible that while you work through this notebook, you find five or 10 interesting columns. And you just want to do a project using those columns, and that's perfectly acceptable. You can use this dataset for your project. Just do not repeat the same analysis that is done here. Do something a little more interesting. And before we continue, let us upload our work. So from time to time, keep running jovian.commit so that you do not lose your work. All right. So now we come to a slightly more interesting part. Although I think the exploratory analysis was pretty interesting as well, but now we can ask some questions and then answer them. So we've already gained several insights about the respondents and about the programming community in general, simply by exploring individual columns. But let's ask them specific questions and try to answer them using the data frame operations and using interesting visualizations. Okay. So the first question that we'll ask is. What were the most popular programming languages in 2020. So this survey was conducted in February of 2020. So this is technically 2000 nineteens data. But see, so we see schema dot languages worked with, so the question asked was which programming, scripting, and markup language have you done extensive development work over in over the past year, right? So which languages have you used in the past year? But this is a two-part question. And the second part is, and which do you want to work in over the next year? So here, the respondent were presented with the list of options and then for each option, they had to check boxes to the first check box. They will take for this part, whether they've worked with it in the past year and the next check box they would take for this part, whether they want to work on it over the next year. And the responses were then taken and they were broken into two. They were broken into two columns. So we are, we have the language worked with, which contains the answer to the first part, which language have you worked with in the past year. But then we also have the language desired next year, which has the exact same question, but this contains the responses to the second part. So I'm showing you this because this is something interesting that nobody with real world data and especially with how surveys are conducted. You might often have this and without the context, you might not understand what's the difference here, because it seems like languages worked with and languages desired next year. Have the exact same question. So you may want to just go through the read me, or you may want to take the survey or serve to understand that, okay, there are two parts and the first part is covered in the first column. And the second part is covered in the second. Okay. So putting that aside let's just look at what the, what some values in the languages worked with column look like. So it looks like once again, people could select multiple options, multiple languages. So you can see that the first person is selected C-sharp and then HTML CSS, and then JavaScript. So they are separated by a semi-colon. So this is similar to the dev type field. So the first thing that we might want to do is just split this into a multicolor them. So we just call it split multi column on seri DF dot languages worked with. And then we can see the languages work BF. So this is another data frame where we have 25 columns. So it seems like 25 languages were presented to the respondents. And now for each of those, for each respondent, we have, we can see through false, so true indicating whether the, whether that respondent has used the language and false indicating whether they have not. Okay. So now going back to the question. Which were the most popular programming languages in 2020. So all we, what we can do is we can try and identify percentages. What percentage of people have selected JavaScript. And then what percentage of people have selected Swift and Python and so on, and then pot plot that as a bar chart. Okay. So once again, to get these percentages, we simply say languages work. DF dot mean? So this is another way to do it because true becomes one and false becomes zero. So if we take the mean of this entire column, if you take a column wise, mean we essentially, we get back the person or the fraction of true values in the column, right? So the mean is simply sum of all the values divided by the total number of values. So since the zeros or the falses go away, that is basically the number of proof values divided by the total number of values. And that is essentially the percentage, of Pru, right? So we take, or the fraction of true. So we took convert the fraction into a percentage. We multiply by a hundred and then we sought values by in descending order. So that gives us the percentages of each language. So it seems like JavaScript is the most popular language followed by HTML, CSS, and sequel and so on. And let's visualize this once again, using a horizontal bar chart. So now it seems like the languages used in the past year. Once again, JavaScript was the most popular language followed by . And this is no surprise because today a lot of software has moved on to the web. Like you probably spend most of your time in the browser. Even the Jupyter notebook platform that we're using is actually running in the browser. And the only way to write code in the browser is one. You have to write HTML, CSS, and second, you have to write a Java script for interactivity. Now, JavaScript might be higher than HTML CSS, because you can also use JavaScript on the server side using a framework called node GS. And because of all these reasons, JavaScript is the most popular language because it is the sort of the defacto language of the web. And then we have so again, plotted this chart and based on this chart, we can make some inferences, right? Then we have, now once again, today, all applications need some kind of a database and the most popular form of databases what's called a relational database or what is called a tabular database. So a lot of the data, let's say your the, data of your Facebook accounts, the read of your Twitter, the data of your Instagram. Or any platform that you use, the data that you put into Jovian? A lot of the data is saved in sequel databases and the way to interact with these databases using the SQL language. And that is why sequel is probably pretty popular as well. But beyond that, if you take away web development and take away database access the actual application development, like any backend application development or data science, a lot of these other areas, any non-web related. Our development seems like Python is the most popular language. And this is again, no surprise because Python is a well, Python is a general purpose language and it has beaten out Java. So Java, the de facto language for all development pretty much for about 20 years, but it seems like Python has now beaten our Java. So it's a good thing that you're learning Python. It is definitely an in demand language. Okay. And there is a whole wealth of information that you can gather just from exploring a little bit deeper into this question. For example, what are the most common languages used by students and how does the link list compare with the most common languages used by professional developers? So is there a gap between what students learn and what professional developers use? You might want to answer what are the most common languages among respondents who do not describe themselves as front-end developers? Because no front-end development, you don't really have a choice. You have to use JavaScript. TypeScript is a choice, but it's a small choice because if you just exclude front-end developers, can you then answer, what are the most common languages people use? Can you try and find out what are the most common languages used by respondents who are working in a field related to data science? And maybe also see in terms of age developers who are older than 35 years of age, or maybe developers who have more than 10 years of programming experience, what are the most common languages used by them? And what are the most common languages used by people who are younger? Is there, do you see a shifting trend and what are the most common languages used in your home country? That is something that you can try and find out because there are responses from over 180 countries here. Is there a difference between let's say us and India and China and different countries, then moving ahead. And another similar question is you could ask us which languages are people most interested to learn over the next year. So for this, we can use the language desired next year column a, which is an, which will have pretty much identical processing. So we take the language design next year and we split it into multi columns. Then we get percentages for each language. By again, once again, by taking the mean and sorting the values and multiplying by a hundred. So you can see here, the language interested percentages. These are the values. And let us just go jump into visualization directly. The visualization is once again pretty much identical. And it seems like that Python is the language that most people are interested in learning. So we have Python JavaScript, HTML, CSS, again, they seem to be closed by and followed by sequel and TypeScript. And it's no surprise that Python is the most sought after language because it's an easy to learn general purpose programming language. And it is very well suited for a variety of domains like application development, numerical computing, data analysis, and so on. And in fact, we are using this using Python for this very analysis. So you are in good company. If you're learning Python, you can apply to a whole host of different domains. What you can do is now you can repeat the same exercises that we discussed for the most common languages. Just replace all of these questions with the languages people are most interested in, so that those are some exercises. And you can also the next question, what we lose we'll combine these two things. So the question yet is, which are the most loved languages. That is where do you see a higher percentage of people who have used the language and want to continue learning it over the next year? Okay. So this may seem like a somewhat complicated question to us. Okay. People who have used the language in the past year and they also want to continue learning it. How am I going to figure that out? It may seem a little bit tricky. But it's actually really easy to solve using pandas at AOPA using pandas data frame operations. So here's what we'll do. We will create a new data frame, languages, love DF, which has the same structure as languages work, BF and languages interested here. So again, it, for every call language, it has a column. And then for every respondent, there is a row and there is a true in there's a true value. Only if the corresponding value in both languages work, BF and languages interested, DF are both true, right? If so, if somebody has worked in that language in the past year and wants to continue using that language, then we are going to put it in dead. And the way to do that, really simple, all you do is you take the languages work there, and then you put in an ampersand and then you pass languages interested. DF. What this will do is this will do all. Element wise and right. A Boolean. And so if two values are true to respective values, then you will get two. And if two, if you have a true and a false or a false and a true or a false and a false, you will get back false. All right. Except that this is going to happen on a per element or a per value level. So now we take languages, love DF. And then we, look at it. So for example, this respondent has proved side to side for C-sharp because this person has this person has worked in C-sharp and they weren't interested in continuing to work in C-sharp. So that's the we're saying that this is a proxy for saying that they love this language. Okay. So let's convert that into percentages. Now we want to take these numbers and for each language we want to identify how many people love it out of the number of people who have used it in the past year. So we take language, love dot some. And then, so that's a column by some, and we divide each of the column by sums by languages, work dot some. So for the column, C-sharp we will count how many people love the language. So how many truths there are in this column? Divided by how many people have used the language. And then we are also going to multiply it by a hundred to convert it into a percentage. And we're going to sort it in a descending order. And let's take a look at that. So you can so you can see here that we have languages load percentages. You can see that for each language. We now have percentages. And you can see who the winner is here. So the winner seems to be a language called rust and let us look at it in a plot. So the winner seems to be this language called rust. You may not have heard of it, but it is. So it is a low-level language for doing systems programming and it provides the performance of languages like C plus but it provides many conveniences and a type system of. Some of the best languages things like Scala and Java and so on. So it's a, pretty useful language. A lot of people enjoy using it. And it's interesting to see that a small language with a growing community is the one of the most loved languages. And in fact, you can see the hints here. Now, if you see this graph rust seems to be used in a very small fraction. So you see dust here, it's a small fraction of people use it. It's a far smaller than let's say JavaScript, but if you look at this graph, in terms of people interested in using or learning rust, it is way up ahead, somewhere at the top, right? Almost close to about a third or maybe a higher than a third of a JavaScript search definitely seems to be an language that's gaining a lot of popularity and a lot of interest. So maybe if you're looking for new language to learn, rust may be a good choice. And this metric that we just calculated is something stack overflow calculates every year, based on their survey results and rust has been stack Overflow's most of languages for four years in a row, followed by TypeScript and TypeScript. Again is a language that offers an alternative for web development. So these are things that you should do that once you get, an answer, once you get a graph, maybe just search online on why that might be. Why the result might be such, so you can learn a little more about Rust in TypeScript. Now, what I find probably even more interesting is that Python features at number three, despite already being such a widely used language. And that's generally not true for widely used languages. If you see JavaScript in terms of the love score, it's fairly low in term and Java is far, lower, whereas Python. Has it remained at number three, right? So it seems like people who use Python enjoy biting. And that is because the language has a solid foundation and it is really easy to learn and use. I hope you've been able to learn Python. You've been able to you can now say that you're comfortable with Python and just these six weeks. And then it has a strong ecosystem of libraries for various domains, and it has a massive worldwide community of developers. And that now includes you and me. So w who enjoy using it? So I've been using Python for the last 12 years. I definitely want to continue using it for the next 12 as well, and I hope you'll feel the same way. So that's about the most loved languages. We now have some insights about that. The next, a few exercise, simple exercise that you can try here is to identify the most dreaded languages, which is languages, which people have used in the past year. But do not want to learn or use over the next year. Right? There's a small hint here. All what you can do is you can simply inward the languages, interestingly called a data frame. So if you, and the way to invert it is using this tilda operator. So just inward that data frame and then do the same thing that we did here. So you should be able to answer the most dreaded language and then see if your results you'll get the same result as what the stack overflow results present. So you can always refer to them. So moving further along, next question here is in which countries do developers work the highest number of hours per week? Okay. Now, to do this question, we will, you need to use the group by function of the group by function of a data frame, the group by method. And there's a small caveat here. We only want to consider countries with more than 250 responses because. Otherwise, it's not really they're presenting an average because there are definitely lots of countries with thousands of responses. And if, we are setting a threshold of two 50, so that if there's a country where only 10 people have responded, we are not going to consider it in terms of to get the average number of hours per week. So we dig and we group it by country. So what this does is this. It takes for each value of country. And there are 184 values. We take all the rows which have that country related to that country and then can separate them out into groups. And so far we've not performed any operations. So you don't see any result here now for each of these rows, the column that we are interested in is workweek hours. So we select the column work week hours, just as we select the columns over data frame. And just as an example, let's select the age column as well. So now once again, from these, for each of the groups, we have all the rows and from these rows, we've only selected the work we can age columns, and now we need to aggregate them. So one way to aggregate them could be using a mean. So if we use a mean, then we can see here that work week, hours, and age. So we will get back a new data frame where now the index is the name of the countries, all the different unique values and countries. And then the values for work week are an age. Are the averages from those groups. So all the rows from Afghanistan, we've taken an average of the workweek covers and put that value here. Similarly, all the groups from Afghanistan, we've taken the age and we've taken away all the rows from Afghanistan. We've taken the age and we've put in the average value here. Okay. So this is how you grew by, and you can learn more about this in the pandas lecture, which is lesson four. Now, what we want to do though, is we want to only look at countries which have more than 250 responses. So the way we are going to do that is first we're going to create a country's data frame. And this is exactly what we just did grow by country, but just keep the workweek hours and take the mean and then sort to aggregate and then sort by workweek hours in a descending order. So that's a country DF. Okay. So you can see that these are the countries with the highest, these are countries with the lowest, but it's possible that a lot of these countries, probably the number of responses is really low. So what we will do is we will create a new data frame called high countries, DF, where we will take these, where we will only select the rows where the value counts are greater than 250. So we are getting 70 of.country. And we're trying to find the value counts of each country, the number of responses from each country. And then we're filtering out those only where the responses are greater than two 50. And then we use their dot LOC dot look function to pick values from the countries DF with only value counts greater than 52 50. And then we pick the top 15 out of those. So let's see now. So now we have the top 15 countries. Okay. And once again, if you do not understand this, there are a couple of things you can do revise the pandas lecture. That's one thing you could do. Look at the documentation of countries. They have dot LOC, not log and split this into small parts. So first take this already of the country that value counts, run it in a sale. Take that and compare that with two 50 and see what the result of that is. And then put that into countries. They have not look and see what the result of that is, and then add in the head. So with all of these things, it's a question of breaking them down step by step, and then the more you break them down, the more you understand, and the more you will be able to use them. It's now we have the high response countries. And these are the 15 countries with the highest number of working hours. It seems some saltation countries, some Asian countries like Iran, Israel, and China have the highest working us followed by United States. So that's intense. And then we have Greece. So people are probably working a lot. Programmers are working a lot in Greece. And once again, we see a bunch of Asian countries all the way til actually a major majority of these seem to be Asian countries. And then there are a few European countries and then there is United States. But overall there isn't too much variation. If you see 44 is the highest value if you just take the first three as outliers, then here you have 41 to 40. So on average, there are only about no people are working at about 40 hours per week. There's no variation where on there's a country with an average of 60 hours or a contributing an average of 20 hours. At least it seems so from the top 15. Now one, a few exercises that you can try is try to compare how the average working hours compare across different continents. So you may find this a list of countries in each continent useful trying to find out which role has the average number of has the highest average number of hours work per week. Out of all the developer or out of all the roles that we looked at. So you may need to merge with the dev type data frame that we created with, which had one column for each role and try to find out, try to find it, maybe how the hours work compared between freelancers and developers working full-time well, full-time developers. It's possible that you might find that the average is around 40. But freelancers, the, one of the reasons people take up freelance or even part-time work is because they want free time to work on other things. So let's try and verify if this is true. Do freelancers work less or more, and that could help you maybe even decide if you want to choose between a freelance or a full-time role. Okay. And then let's ask one more question. How important is it to start young, to build a career in programming? Okay. And this is again, something that is a question that a lot of people wonder, not just about programming, probably about data science. And in general, about any field, can you enter this field? Let's say beyond if you have not done it in college. And I think we've established that even if you have not taken it in college, you can still enter the field, but can you enter the field if you worked in a different domain for a few years? Can you enter this field in, your thirties and your forties and so on. Okay. So what we'll do to answer this question is we will create a scatterplot of age watches, the years of coding experience. So that is your score pro is asking, not including education. How many years have you coded professionally as a part of your work? Okay. So what we lose we'll plot age on one access on the x-axis and then we'll plot the. Years of professional coding experience on the y-axis and that should maybe give us a hint So here is a chart and this is what it looks like. Now, if you look at this, it seems let's, look at some values and let's try to understand this chart. And we've also put in a color's year and we'll touch that in just a moment. But if you see here in the, let's say you're looking at age 14. So at age 14, there are several people who have less than 10 years of programming experience. And there are even at every age, all the way from around 15 to even close to 50, there are people who have just started working as programmers. So what that means is that you can start programming professionally at any age. There is no restriction that you have to start early. People are starting in their twenties, in the thirties and the forties, and everybody's welcome. And these are pro this is professional experience. This is not just programming experience in general. So you can, if you put in the work, if you learn, if you're open to learning, if you're excited about it, then you can definitely get into the, domain. And then we have also added a color for each of these dots. So each dots represents one response. So we've added colors. So the color we've added is if a person is a hobbyist, then we represent them with blue. And if not, we represent them with orange. And once again, it turns out that a lot of the people who are programmers also say that programming is a hobby for them. And especially so in the initial years. So in the initial years to get through the initial years, It will really help for you to just have programming as a hobby, something that you do just to build things just to, solve problems, just something that you do on the weekend. And if you do that, then you will probably also have a long and fulfilling career in programming. So these are some inferences that we can draw from the scatterplot. We can also look at the distribution of the age first code column. So just to see when people have tried programming for the first time, and as you might suspect, a lot of people have probably have some exposure to programming. Maybe they've written a first line of code, maybe a simple HTML page, or maybe just a hello world program. And Python's hello world is just this, right? So you can just type Python into a terminal and you just say print. Hello word. And that's the first program. But that by itself, no tells you what programming is. So it seems like a lot of people have been exposed to programming in their teens. And it depends on pretty much every field you end up writing some code. So in Excel, you write formulas in different streams of engineering. You probably use MATLAB or. Maybe some kind of a numerical computing package, of course, in computer science you write, programs. If you're doing right now, it pretty much every field, there is some code that you're writing. So it's not surprising that people get some exposure to programming at an early age. But then there are also people who have experiences after a certain age. Like you can see a lot of people doing so after the age of 30 and after the age of 40, and there is like a small number of people who have even exposed, become exposed to it all the way up to the age of 18. So there are people from all ages and walks of life who are learning to code. Okay. Now. Here are a few questions that you can try and answer. How does the experience change opinions and preferences? So maybe what you can do is you can repeat the entire analysis while comparing the responses of people who have more than 10 years of professional programming experience. Like we just do the use which languages do they want to learn? With those who do not have that. And this is going back to like students versus professionals. Now you're going to have three categories, students professionals and experienced professionals. And do you see any interesting trends? Do you see what you're, what one might call a generation gap in terms of the quarters from the old days versus quarters, people who are learning it right now what, kind of roles do they occupy? What kind of languages do they prefer? And maybe you can also try and compare the years of professional coding experience across different genders. Just to see if my guess is you might see that. No, although minorities are under represented right now. There are very few women, but my guess is you will see that there are more women now than there used to be earlier. So it's definitely things are improving and you can try and validate that. So that's one way, if you can try and compare the years of professional coding experience across different genders. So with that, we have. Barely scratched the surface here. We have almost been talking for about 90 minutes and we've already gotten a huge load of insights. And hopefully you're thinking of many, more questions that you would like to ask and answer using the data. Now we have not even used all the 20 columns, only about 12 or 13 columns have been used. And then there are another 45 columns to pick from. So you can use these empty cells below to ask and answer more questions. So I let you try out and you can try out all of these exercises. So there's really no end to this. The more you, do, the more you experiment, the more you exercise these skills, the better you get at it. Now I've used fairly simple charts. I've not done many breakdowns. I wanted to leave that as an exercise for you, but try and replace each chart. So each chart or each graph that we have tried to use a different kind of graph. And maybe go through the Mac plot lab gallery, go through the seaborne gallery and try and pick, which might be an interesting graph to draw there. So these are all different exercises for you to try. Data analysis by itself is a, there's a lot of depth in the field and you can probably spend at least a few months just exploring different ways to slice and dice and analyze and visualize the data. So please do that. Okay. The best way to learn is by doing. Now what we'll do is we'll just summarize some of our inferences and conclusions, and this is always a good thing to do at the end of your analysis. So here's some of the summaries, like based on the demographic data, we can infer that the survey's somewhat representative of the overall programming community, although it definitely has fewer responses from programmers in non-English speaking countries and from women and non-binary genders. We have also learned that the programming community is probably not as diverse as it can be. In terms of gender in terms of age, maybe in terms of the different languages or the different countries that are there. So we should probably take more efforts and support and encourage members of underrepresented communities and racism. Another thing that we've not looked at but that's another factor where there's a lot of disparity and we've learned that most programmers hold a college degree and although a fairly large percentage of them did not have. Computer science as their major in college. So a computer science degree, isn't compulsory to learn to code or to build a career in programming, but some STEM education definitely helps. And a significant percentage of programmers either work part-time or as freelancers 10% is actually a pretty good number. And this can be a great way for you to break into the field, not just in programming, but also in data science, which are very closely related fields. We learned that JavaScript and HTML are the most popular programming languages used in 2020. And then we learned that Python is the language, and most people are interested in learning. And we've learned that trust in TypeScript at the most loved languages, both of which are small, but fast growing communities. And finally it seems like programmers around the world seem to be working on 40 hours on average, but there are slight variations by country. And finally we learned that. You can learn and start programming professionally at any age. And you're likely to have a long and fulfilling career if you also enjoy programming as a hobby, and especially if it's going to help you during the first few years. Alright, so that's our analysis. And as I said, there's a wealth of information to be discovered and we've barely scratched the surface. So there are a few more ideas that I wanted to share with you. You can repeat the analysis for different age groups and genders and compare the results. Specifically try and represent try and pick a slice of responses that represents you late. So maybe the country or from the gender, the age group that you're in and try and see what the preferences of people are and see if that reflects your opinions to try and choose a different set of columns. We've chosen 20 out of 65 and we've used about 12 of them. So you can look at a lot of the other columns, read through, go through the, read me and go through the survey. But repairing allows us focusing on diversity. So identify the areas where underrepresented communities are at par with the majority. So you might see that in education, actually, there's probably not a big difference in terms of the percentages of degrees that. Are different degrees held by people, but then there are places where there are differences, like salaries. You will see that there is definitely a big gap. And you can try and validate that and try to compare the result of this year survey with the previous year and identify some interesting trends because this is data that you get every year. And once again, you can go back to this link. Stack overflow.com inside store stack overflow.com. And you can download the raw data for every year. Now, one interesting exercise for you to do is to see the survey results. So you can see that this is a pretty long analysis that they've done a whole bunch of analysis on pretty much the similar questions that we have answered. So see the survey results and try to replicate the survey results graph or graph, right? This could be a great way for you to just see if you are doing the same kind of data, cleaning and analysis and simplifications and see how that affects your results. Now, if you can replicate all of these results, now that's a great sign that this is real world data, and this is a large dataset and this is a deal analysis. So that's a real sign that you've done something significant. In data analysis and you can proudly then showcase that on your professional profile. Then we have this I just want to share a few references. Now we've used pandas matplotlib and Seaborn. So you can refer to the previous lectures. We've linked you just go to zerotopandas.com and you can find the lectures there. You can watch those lectures, or you can also just read, go through the documentation and the user guide. So you have the pandas mat, plot, lib and seaborne user guides. Also go through galleries on websites. So these galleries will show you all the different types of charts that you can create using these libraries. And finally as I told you, we are creating this open datasets library, Python package, which is a curated collection of datasets for data analysis and machine learning. So, far we have about six, seven datasets, but we will, we are planning to add about a hundred datasets here. So over the next few days, and we've released this library just yesterday and it's something that we worked up quickly to make sure that it's easy for you to download these datasets. So we are going to add about a hundred data sets here, so you can use these a hundred datasets for your course project as well. Okay. And that is the next step that I want to talk to you about. Now we have. We've looked at this. What we saw today, the exploratory data analysis is basically what you need to do on your course project. So you simply need to repeat this. You can repeat it with the same dataset and ask different questions, do different analysis, big, different columns, or you can pick a different dataset. So let's open up the course project page. Now the course project deadline. Once again, I wanted to mine, I knew the deadline has been extended to October 3rd, 1150 9:00 PM. GMT. So you have more than a couple of weeks to work on this. And then the objective of the course project is exactly similar to what we did today. You find the real world data set of your choice. You use and binders to parse and clean and analyze the data and you use matplotlib and seaborne to create visualizations. And then you ask and answer interesting questions about the data and an optional, but highly recommended step, because you've put in so much work and. If you can just consolidate all of your learning into a blog post to showcase your work, that is something that you can do as well. So I just want to give you a quick overview of the course project, and then we have a few exciting things to close out. So now the course project, this is a starter notebook and by the way, we've done a walkthrough of the course project in last week's video as well. So you can just check that out too. So on the course project, you can take the starter notebook. And just run the, start a notebook on binder, and you can also run this starter notebook on your local computer. So you do not have to run it on binder. So I just want to show you how to run it on your computer. So here I have a terminal. Let me just zoom in a bit. Yeah. So here I have a terminal and this would be a terminal or a command prompt, or an Anaconda prompt that you would need. And if you want to download this to your local computer and run it locally, then what you need to do is click on the clone button. So click on the clone button and just paste that command. Now you will need to run this command, but then to run this command, you need the job in command line tool installed. So actually I'm just going to skip that command and I'm just first going to run PIP install. Jovian. Upgrade. So that's going to upgrade the Jovian Python library for me. And once the Jovian libraries installed, you can see that we now have this command line tool called Jovian that you can use. So now once again, I can copy this clone command. Come back here, run Jovian clone, and just simply the title of the project. So username slash. The name of the project and presenter. Now that is going to download the files. So you can see that this, these files got downloaded to my desktop, the zero to hundred score starter. If you wish you can change the name of this folder. So let me just, if I'm, let's say I'm going to analyze the. State of JavaScript survey. So I'm just going to call this state of JavaScript 2019. This is the data that I'm going to analyze for my project. So then I go into this folder, state of JavaScript 2020, 2019. Now here, you need to install all the different libraries and we suggested installing these libraries inside an Anaconda environment. What you can do is you can manually create an environmental conduct, create minus N and then simply give it a name. So let's say course project, and you can set up Python version for it. So these are the same instructions that are provided in each lecture notebook. So let's just create a, yeah. So let's create a condor environment here. So we have called conduct create minus end course project. So that's going to create a Python environment where we can install all our, libraries. Okay. So now the environment has been installed. Then we are going to activate the environment. So he had conduct activate course project. So now the environment has been activated inside this environment. We might want to install all the libraries that we want to use. So we are going to use Jovian. We are going to use Jupyter. Let's say we'll use open datasets. Or you don't have to, but you, might we are going to use pandas. We are going to use non-pay and we are going to use seaborne and we're going to use Mac blot lip. So we just installed all the libraries after activating the environment. And once these libraries are installed, Let's just give that a second. Yeah. Sometimes this might take awhile for you to install, and this is one of the reasons we recommend using binder because. All of these steps are taken care of for you. So once these libraries are installed, we can now start a Jupyter notebook by typing, a Jupyter notebook. Okay. So once again, a quick recap of what we did, the first thing that we did was we installed the Jovian Python light. Let me just come back here. I'll open this once again. So once you run Jupyter notebook, Yeah, this will print out a URL for you, which you can open up on your browser. But a quick recap, the step one was to install the Jovian Python library. And of course you also need to have Anaconda installed. So make sure that you have the Anaconda distribution of Python installed two step two is to then clone the notebook using the Jovian clone command. Step three is to enter the directory and create a container environment. So that is done using conduct lead. Step four is to activate the environment and install all the libraries. So you say condyle activate the environment name and the new install libraries using PIP install. And then step five is to just open up the Jupyter notebook. So by typing Jupyter space notebook. So that's going to print this URL. So you just take this URL here and open it, a new browser. And now you can see here we have the zero, two Pender's course project or IPNB file. So now at this point, it is pretty similar to the place that we get to when we click the run button and click run on binder. So run on binder is a one-click experience. That's why we recommend it, but with a few more steps, and I think now you are familiar with these steps. Now you are now I think you're comfortable enough that you can figure these things out. If not, you can always ask on the forum. So with a few more steps, you can run it on your local computer. Okay. So now we have this zero two course of pandas course project file. And the first cell is a text sale, a markdown cell, and you can remove the cell before submission. Now this is the cell describes what you need to do. It gives you the guidelines. So the first step is to select a real world data set. So you have to find and download an interesting real-world dataset. And we have given some recommendations for datasets here. If you see this. Forum topic on the recommended datasets for the course project. So we have given you links to many places from where you can download datasets. For example, there is Kaggle, there is the UCI machine learning repository. It is a GitHub that is this GitHub repository three, which has a list of datasets. And now of course, we are sharing this open datasets library with you. We will be adding a more datasets there. So there are a lot of places from where you can get datasets. And we've also picked out some interesting data sets for you that you can download and use. So today we've used the stack overflow developer survey, but there's also the COVID-19 data. This is updated on a daily basis. There's the state of JavaScript survey. And then there is stocks data, and then there's Countrywide COVID data. There is some agriculture data. There is a data science job data. There is a well sports data. There is. Games data, video game data. So there's a lot of interesting data to analyze. And then of course, there's a lot of places where you can download your own personal information as well and analyze it. So please go through this list and try and identify something that you find interesting. So anyway, you take these datasets and then you have to make sure that this data set contains tabular data, preferably CSV or Excel files that it can be read using pandas. Then you perform some data preparation and cleaning just as we did. So you load up the data frame. You look at the number of rows and columns. You decide which columns you want to use. You decide how you want going to handle any missing or invalid data. Maybe you might want to pass some dates. Like we pass some numbers. You might want to create additional columns. You might want to merge multiple data sets. Then you need to perform exploratory analysis and visualization. So this is exploring the distributions of numerical columns using histograms. Using bar charts to visualize categorical columns using his scatterplots to see distributions across multiple columns and take note of interesting insights from this analysis. And then also ask and answer questions about the data. So you have to ask at least five interesting questions and answer those questions by either computing the results using non pipe binders or by plotting graphs using mat plot, labor, Simone, and whenever you're using any library function, just explain briefly what it does. It's always helpful for the reader and then finally take your inferences and summarize them and write a conclusion. So this is a really important part of consolidating everything you've learned into a single paragraph or a single second, and also share ideas for future work on the same topic using other the same dataset or maybe other relevant datasets. And make sure to share links to any resources that you like might be helpful. For people reading the reading her analysis. So definitely share links or documentation, maybe share the link to the course, page so that people who are not familiar with binders and unpacking use that. And then the last step is to make a submission and share your work. So whether you're using binder or you're running on your local computer from time to time, you need to run Jovian dot commit. So you just set a project name. So for instance, my project name would be, let's say s ateofjs-survey-2019 analysis, set up project name and then use the Jovian library to run jovian.commit() and commit the project. And that's going to take this either from binder or from your local computer and put this onto your Jovian profile. So you simply take this link then, and then you need to take this link and go back to the course project page. And on this page, you need to put in the link and click submit. So once you do that, you will see it showing up in your submission history. So make sure that this is a Jovian link. So there should not be a local link. Don't put a local host link. Don't put a binder link, don't put a Kaggle or CoLab link. Please commit to Jovian and put up, Jovian link. And make sure that Jovian, that link is to a notebook hosted on your own profile. Otherwise the submission will be rejected. Okay. And what we will do is then we will evaluate your project. So what does evaluation look like? So here we have shared the evaluation criteria as well. So we will evaluate that your dataset contains at least three columns and one 50 rows of data. We will, you must ask an answer at least five questions about the dataset. Your submission should also include at least five visualizations. And your submission should include explanations using markdown sales apart from just code, right? So just code is not good enough, please write explanations. And that helps you understand your data that helps you gather insights from you. And it is also going to help others like tomorrow. If you want to showcase this project on your public profile, or do you want to share it on LinkedIn or wherever, or do you want to link to it from your resume? You want to make it nice. You want to show that you have Python work, you are presenting it well because presentation, believe it or not is a very important part of data science. So do not. So do not skip that. It's not just about writing code. It's about gathering inferences and presenting them and making coming to no making interesting observations and maybe making hypothesis and digging deeper. So the data tells you just some facts, but you have to analyze it and really. Come up with what, that those facts mean and infer them right? In the context of either the dataset itself or for your company or whatever you're working on. And finally your work should not be patronized. So do not copy paste from somewhere else. Of course you can take, you can borrow functions and every data set has been analyzed by many different people. So you can do this analysis, which people have already done. You can even look at those notebooks. You can look at notebooks by other people. But do not plagiarize. And I think you will be able to tell best if, you are plagiarizing. So as such, the entire project should not be a copy paste. And the biggest loser in that case will be you yourself, if you're basically copy pasting stuff. So please don't do that. Now, apart from doing that, do share your work online. So this is a lot of effort that you've put in into this project. So please share it on your social media. You can just share the link. W once you've committed it to Jovian, you can simply, where is it? Yeah. Once you've committed it to Jovian, you can simply use the share button and share it on any of your, any of these platforms, share your work on the forum as well, because there are tens of thousands of course, participants. So we would love to see thousands of projects being shared on this thread. So do share your work on the forum and browse through projects by other participants and maybe give feedback. And that'll also be a great way for you to learn. When you see other people creating visualizations, similar or different visualizations, you will get to learn from their code. You will get to learn from their analysis and so on. So please do that. And one highly recommended step is to write a blog post and a blog post is a great way to present and showcase your work. So you can sign up on medium.com to write a blog post for your project. It's really simple. You just sign up and then you click new story and you can just start typing and you can simply, as a starting point, you can simply copy over the explanations from your Jupyter notebook into your blog post. And in terms of the code and the graphs, you can actually embed them. So you can take your code and graphs you from your job in the notebook submitted to Jovian and you can embed them within within your blog post. So you just watched this video for a quick tutorial or just follow this guide. So you can see here that this is a blog post on medium and inside it is embedded a code cell, or you can even embed. An entire you can embed even like some code and the outputs, like things like graphs, or you can only embed just the graphs if you want to see it. So the benefit of writing a blog post is that it an over Jupyter, it contains a lot of code and, contains a lot of pre-processing steps. But on your blog posts, you can decide the narrative and you can simply use the right code blocks and you can write, use the right graph, right graphs from your Jupyter notebook, and you can embed them. Within your blog post to tell the story that you want, right? It doesn't have to follow the same structure as the Jupyter notebook, and it will be a great thing to showcase on your profile. It will be a great thing to share on your on your social media, on your put up on your resume, or just mentioned when you're applying for an internship or things like that, and you can check out our medium publication. We've linked to it. For how to write a good blog post. There are many good examples. All of these have been written by people from the community, and this was during a previous course. And in fact, all of these are in a lot of these cases. These were the first or the second blog posts written by people. So please do check it out and don't be afraid. You can write it too. It just takes a little more effort, maybe a few more hours after you do your project. Okay. But do write it as far, as possible. And we've, as you mentioned, we've shared some recommended datasets and we've shared some example projects. So you can go through these projects and you can keep revising this video as well, to just get a sense of what, how you should analyze your data. And you can either use this notebook as a starting point. We've created this template for you, where you can put in the project title, write some introduction, and then there are sections for each step and remember to commit your work at each step so that you do not lose your work. Or you can also start from a blank notebook. That is perfectly all right. It's all a question of what you feel most comfortable with and do remember to just remove this cell before your submission so that the instructions are not included in your final submission. Okay. So with that, we have done We've just revised the course project as well. And you have time till the 3rd of October, that should be sufficient time. So please do put in the work you've if you've come this far and you should definitely do a course project while you have this, all of these things in your head, and it will really reinforce all the ideas that we've learned. Now, I just want to do a quick recap of the course for a couple of minutes. And a lot of us started out without even python programming experience. So we started out with an introduction to programming with Python. We just looked at the four steps with Python and with Jupyter notebooks using it like a calculator. We did, we explore data types and variables. We saw brunching with conditional statements and loops. And then in the second lecture, we looked at rewriting reusable code with functions, working with the OSTP and the files. And then, and you saw the first assignment where we solve some world problems using variables and arithmetic operations. We manipulated data types using methods and operators, and we used branching and iterations to translate ideas into code. And we also learned how to explore the documentation, how to get help from the community. So after learning Python, we looked at . So we saw how to go from Python lists to number Aries, and we saw how to work with multidimensional Aries. And we saw what are the different area operations matrix operations that you can do? We learned about slicing. We learned about broadcasting NumPy by itself is a very powerful light duty. And we also saw how to work with CSV data files. Then we did an assignment on NumPy array operations, where you explore the non-pay documentation. And demonstrated the usage of five 80 functions. And you created a Jupyter notebook with explanations, about five functions on how to use them and how not to use them. And we shared hundreds of notebooks with the community and probably learned a lot from each other. Then we learned how to analyze tabular data with binders. Which was reading and writing CSV data with binders. We learned how to query and filter and sort data frames and pandas data frames are really powerful. And even today we've seen a lot of different functions, which we probably did not explore earlier. Then we also looked at grouping and aggregation for summarizing the data we looked at merging and joining data from multiple sources. And then we did an assignment on pandas where we applied all of these. Things that we learn. Finally, we had one lecture on visualization with mat plot live and seaborne, where we learn how to do basic visualizations with Mat plot, lib and advanced visualization with seaborne. So things like line charts, scatterplots bar charts, heatmaps, and histograms. And then we also saw how to customize and style charts, how to make them beautiful. And we also saw how to plot images and how to plot multiple charts in a grid. So all of these things are all of these things. Then we tied together into our today's lecture on exploratory data analysis, where we found a real world data set, the stack overflow developer survey with 65,000 responses. We loaded the data, clean data, pre-process data did exploratory analysis and visualization. Then we asked unanswered questions and we made a bunch of inferences. And now you're working on the course project. Where you will repeat this process on a real world data set of your choice. So that was a quick recap of the course. Now, what should you do next? Try out the notebooks yourself. You can revise any of the previous lectures. You can watch the videos. You can try out, you can run the notebooks, just it's a one click away at any point. Definitely try out the stack overflow survey results and some of the exercises there. And if you have any questions, you can always ask questions on the forum. I've been saying this from the start. People who are active on the forum are the most likely to complete. So if till now you've not checked out the forum, just go to ZerotoPandas.com and there is a link to open up the forum. So here you can see here, there's a course community discussion forum. Any question you have, we have topics for each lecture. So any question that you have just go to that topic and first search through that topic. It's likely that your question has already been answered. And if not, you can always post a new you can post a new post on that topic, just reply to that topic. And somebody will answer it. Like we have an amazing course community and we've been seeing a huge contributions from people spending hours, just answering questions from other people. So I just want to give a big thank you to them. So please do participate into the forum now. If you complete all the assignments and the course project then, and if we, once we evaluate all of that and you get a pass grade in all of them, then you will be issued a certificate of accomplishment. This will be a verified link hosted on Jovian. So it will be a page on Jovian, a part of your profile, where it can be displayed, and this will be available for download as a PDF as well. So if you want to download it, print it out. You can do that too, and you will be able to add it on LinkedIn. Onto your LinkedIn profile so that anybody who has at your profile will see that you have completed a certification on data analysis that Python, and you will also be able to share, it online on Twitter or Facebook or wherever. So we've all pertained a lot of effort. So we can definitely celebrate it once we own the certificate of accomplishment, do share it and even encourage your friends to take the course in future sessions. This is what the certificate looks like, but this will be embedded into a webpage from where you will be able to download it and share this page as well. The thing you should not do is immediately do not jump to another course. Walk on a project make your, project as large and as interesting as possible, because it's not enough to say that you've done a course. And it's not enough to say that you have a certification. It's not enough to just say that you've done a small project. You should have a significant data analysis project under your belt, and it should be something that you should have documented and presented well, and it should be something that you have put on your public profiles, right? So do something that you feel proud of. And then put it up on your public profile, build, improve your professional profile and write blog posts, right? Your totals and write guides. You can do this on medium. You can do this on GitHub pages. There are a lot of platforms where you can write blogs. You can use Jovian to share your Jupyter notebooks. And I talk about that a little bit as well. But do write any guides that can help people who came before you. So look back at yourself and try to write maybe a small tutorial for that person to encourage them to demystify data science for them. Or maybe just to tell them that pandas is not as scary as it might seem, or maybe just point them to this course and say that you can learn about it here. So there are a lot of resources available online for you that you now know. And so you can now curate them and share them with your community, right? So if you're a student, share them with your classmates, if you're a professional, share them with your colleagues. Share them within your company, the more you share your knowledge, the better you get at it as well. Okay. And then improve your professional profile. So showcase your certificate, showcase your project on your LinkedIn profile on your guitar profile on your resume. So do that. And then once you feel like you've really done a lot of work in on this topic, that is a point at which you should then take more data science and machine learning courses, right? So do not fall into the trap of just doing a bunch of courses without any real output out of them. The best way to learn is by doing and doing good projects. Now you can use Jovian to build your professional data science profile. And we are working on some very interesting improvements to the profile, a lot of which we've already added. So if you go back to just open Jovian and login, so if you just go to my profile, if you're logged in, so you can see here that on your profile, you will be able to add more information. So on your profile, you will see an edit button where you can add your current designation. Your you can add your current university or. Your company, and you can also link your GitHub profile. And here you can see a collection of all the notebooks that you have created so far. So any Jupyter notebook that you create, any time, any interesting analysis that you do just upload it to Jovian. So just do just run jovian.commit() inside it, and it would get added to your profile. You can also upload notebooks directly. If you have a Jupyter notebook somewhere on, let's say you have somewhere on GitHub or you have somewhere. You have some air on your computer, you can upload that to, or you can import it from a URL. So you can do that. You can also create collections so you can create collections where you see interesting notebooks joined together into a single collection. For instance, I have this collection on deep learning with . I have this collection on data analysis with where I'm going to add a few more notebooks. So that's an interesting way to organize your notebooks. And you also get access to the forum as part of the profile. So try to make, try to answer questions on the forum, because that is then going to also reflect on your professional profile. The more questions you've answered, the more knowledgeable you are. So that's, what it is to do use Jovian and all these projects that are hosted on Jovian. By default, you have this really nice, beautiful optimized view of the Jupyter notebook. You can even, these are mobile friendly views. So if you even open these on mobile, You will still be able to they will load up really fast. We spend a lot of time just tuning the performance of these pages. If you're sharing a link to a Jupyter notebook, a lot of people are probably going to open it on mobile and on mobile, this might not render well on different platforms. So please do use Jovian. It's a great, way to share this. And it's also a gateway to just show all the work that you've done, because the version history of a notebook shows how much work you've put in. So if you have a notebook with some 20 or 30 or 50 versions, and then you share that with somebody, then they can go through it and they can see that you've really put in a lot of effort into this. So a lot of this, about your professional profile, a lot of it is about just showing what you've done now, making it visible for people to discover and learn more about, okay. And you can follow us on Twitter. So we are Jovian at Twitter. Keep tweeting, interesting notebooks. We keep tweeting interesting resources for data science. In fact, if you tag us, if you share your notebook online and tag us, then we will definitely try and retweet interesting notebooks. We try and retweet four or five interesting notebooks every week at the very least. So we hope you'll find Jovian useful. And on behalf of the entire course team, I just want to say a big thank you for you or to you. For being so active on the forum for doing all the assignments and working on the project, going through all the lectures and just being, awesome overall we, were really excited to run this course and yeah, and this is really not the end. We are hoping that we will continue to have a long association with you. We have a lot of interesting things planned for you. So I will see you in the forums. This is the end of our lectures, but we will be able to interact with you via the forums. You can follow free code camp, follow Jovian on Twitter, or follow me So with that, we come to the end of Data Analysis with Python. Thanks a lot for joining and all the best for those of you who are still working on the assignments on the course project. And I hope to see you soon in a future course. Thank you and goodbye.
Info
Channel: Jovian
Views: 8,055
Rating: undefined out of 5
Keywords: data science tutorial, data science python, data science, data science full course, data analysis, data analyst, data science course, data analytics for beginners, data analyst skills, data analytics course, python for data analysis, matplotlib, jupyter notebook, data processing, ai, ml, exploratory data analysis, data science project, data analysis project, data science projects, machine learning project, kaggle, data science portfolio, exploratory data analysis python, python
Id: B4GbWjUFUGk
Channel Id: undefined
Length: 128min 5sec (7685 seconds)
Published: Thu Dec 31 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.